next up previous
Next: NEURAL NETWORKS Up: OUTLIER DETECTION Previous: DETECTING MULTIVARIATE OUTLIERS IN

DETECTING MULTIVARIATE OUTLIERS IN INCOMPLETE SURVEY DATA WITH THE BACON-EM ALGORITHM

Cédric Béguin and Beat Hulliger

Cédric Béguin
Swiss Federal Statistical Office
CH-2010 Neuchâtel
Switzerland
E-mail: cedric.beguin@bfs.admin.ch

The BACON (Blocked Adaptative Computationally-efficient Outlier Nominator) algorithm, one of the many forward search methods, is a very efficient outlier detection method in multivariate data with elliptical distribution. Starting from a small subset of good points BACON iteratively grows this good subset using Mahalanobis distances based only on the good observations. The largest Mahalanobis distances indicate the outliers when the growth of the good subset stops. The adaptation of BACON to complete survey data is straightforward by defining weighted estimates of the mean and the covariance matrix. Missing values are more problematic to deal with. The EM algorithm for multivariate normal data is used to evaluate the mean and the covariance matrix at each step of the BACON algorithm. The adaptation of EM to survey data is presented. The merging of both algorithms through the splitting of EM to use the advantage of the growing structure of BACON is discussed as well as the number of iterations of EM. The hypothesis on the missingness mechanism is the usual EM assumption, namely MAR (missing at random) data. Examples on well known datasets with challenging outliers are shown with up to 30% MCAR (missing completely at random) data. The BACON-EM algorithm is also applied to datasets of the EUREDIT project.



Pasi Koikkalainen
Fri Oct 18 19:03:41 EET DST 2002