next up previous
Next: DETECTING MULTIVARIATE OUTLIERS IN Up: OUTLIER DETECTION Previous: USING ROBUST TREE-BASED METHODS

DETECTING MULTIVARIATE OUTLIERS IN INCOMPLETE SURVEY DATA WITH THE EPIDEMIC ALGORITHM

Beat Hulliger and Cédric Béguin

Beat Hulliger
Swiss Federal Statistical Office
CH-2010 Neuchâtel
Switzerland
E-mail: Beat.Hulliger@bfs.admin.ch

The Epidemic Algorithm is a non-parametric method to detect multivariate outliers which is completely based on interpoint distances. It simulates an epidemic in a point cloud in p-dimensional space. The epidemic starts on the sample spatial median and then spreads through the point cloud with probabilities that decrease with the distance between points. Outliers typically are infected late. Thus the Epidemic Algorithm maps the multivariate space into time and outlying infection times are used to detect outliers. The cut-off for infection times to be considered outlying is arbitrary in a non-parametric context but it can be discussed under multivariate normality or other stochastic models.

Contrary to most of the established methods the complexity of the Epidemic Algorithm only depends linearly on the dimension p. However it depends quadratically on the number of points involved.

The adaption of the algorithm to missing values is straightforward by leaving out missing values from the distance calculations. The adaption to sampling is somewhat more involved since the infection probability in the population must be estimated from the sample.

The Epidemic Algorithm is applied to data sets of the EUREDIT project which contain missing values, sampling weights and various types of outliers. The choice of the infection probability function and its constants is discussed. The Epidemic Algorithm is an alternative to outlier detection methods which use the Mahalanobis distance and therefore assume an elliptical distribution of the bulk of the data. It can be seen as a data depth method.



Pasi Koikkalainen
Fri Oct 18 19:03:41 EET DST 2002