ALGORITHMS FOR AUTOMATIC ERROR LOCALISATION AND MODIFICATION

Next: COMBINING EDITING AND IMPUTATION Up: EDITING AND IMPUTATION SYSTEMS Previous: UNIFIED ENVIRONMENT FOR DATA

ALGORITHMS FOR AUTOMATIC ERROR LOCALISATION AND MODIFICATION

Ton de Waal

Statistics Netherlands,
PO Box 4000, 2270 JM Voorburg, Netherlands,
e-mail address: twal@cbs.nl

Automatic edit and imputation can be subdivided into three steps. The first step is error localisation during which the erroneous fields are identified. The second step is imputation. In this step the erroneous fields and missing data are imputed for. The final step is modification during which any edits that may still be violated after imputation are made satisfied by slightly modifying the imputed values. Statistics Netherlands has developed algorithms for error localisation and modification in a mix of continuous and categorical data. The developed algorithm for error localisation is based on the (generalised) Fellegi-Holt paradigm, which says that data should be made to satisfy all edits by changing the fewest (weighted) number of variables. The algorithm determines all optimal solutions to the error localisation problem, given a user-specified upper bound on the maximum number of errors. In case there are more errors in a particular record than the specified upper bound, this record is not corrected automatically. In this paper we describe the algorithm in some detail. The developed algorithm for modification is based on the paradigm that, after imputation, the imputed values should modified as little as possible to satisfy the edits. To measure the distance between an imputed record and the final, modified record, a distance function consisting of a sum of a part involving only the categorical variables and a part involving only the continuous ones is used. The categorical part of the distance function consists of a sum of positive weights, where each weight indicates the costs of changing the imputed value into a certain other value. The numerical part of the distance function consists of a weighted sum of absolute differences between the imputed values and the final values. In this paper we describe a heuristic that has been developed to minimise this distance function subject to the restriction that all edits become satisfied. Finally, in the paper we also describe modifications of the algorithm for error localisation problem so it can be applied to solve related problems.

Next: COMBINING EDITING AND IMPUTATION Up: EDITING AND IMPUTATION SYSTEMS Previous: UNIFIED ENVIRONMENT FOR DATA

Pasi Koikkalainen
Fri Oct 18 19:03:41 EET DST 2002