USING ROBUST TREE-BASED METHODS FOR OUTLIER AND ERROR DETECTION

Next: DETECTING MULTIVARIATE OUTLIERS IN Up: OUTLIER DETECTION Previous: OUTLIER DETECTION

USING ROBUST TREE-BASED METHODS FOR OUTLIER AND ERROR DETECTION

Ray Chambers, Xinqiang Zhao, and Adao Hentges

Department of Social Statistics
University of Southampton
Highfield, Southampton, SO17 1BJ, U.K.

Editing in business surveys is often complicated by the fact that outliers due to errors in the data are mixed in with correct, but extreme, data values. In this paper we focus on a technique for error identification in such long tailed data distributions based on fitting outlier robust tree-based models to the outlier an error contaminated data. An application to a trial data set created as part of the EUREDIT project that contains a mix of extreme errors and "real" values will be demonstrated. The tree-based approach can be carried out on a variable by variable basis or on a multivariate basis. Intial results from both these approaches will be contrasted using this data set. Issues associated with "correcting" identified outliers in these data will also be explored.

Pasi Koikkalainen
Fri Oct 18 19:03:41 EET DST 2002