KERNEL METHODS FOR THE MISSING DATA PROBLEM

Next: NEURAL NETWORKS FOR EDITING Up: NEURAL NETWORKS Previous: EDIT AND IMPUTATION USING

KERNEL METHODS FOR THE MISSING DATA PROBLEM

Hugh Mallinson and Alex Gammerman

Royal Holloway College
University of London
www.clrc.rhbnc.ac.uk

An imputation problem requires the completion of a dataset that is missing some values on some or all variables. A successful imputation action preserves the joint probability distribution of the dataset. We compare four imputation algorithms, a linear regressor, a group-mean imputation, a neural network and a Support Vector Machine (SVM). Our chief aim being to evaluate the SVM's performance. We artificially induce missing data patterns in three data sets; Boston Housing (BH), Danish Labour Force Survey(DLFS) and the Sample of Anonymised Records (SARS). Our performance measures include root-mean-square error and absolute error. We also compare the full set of imputations with the set of true values. This comparison (eg. using the Kolmogorov Smirnoff distance), measures how well the set of imputations preserves the marginal distribution as observed in the missing true values.

The Support Vector Machine is a new non-parametric prediction technique that has shown state-of-the art performance in some high-dimensional classificaton and regression problems, for example digit recognition and text retrieval. These algorithms exploit new regularisation concepts, e.g. the VC-Dimension, which control the capacity of models that are highly non-linear. SVMs require the solution of a convex quadratic program.

The imputation problem is in one sense harder than the standard prediction scenario as we must often restore values on more than one variable. Moreover during modelling or prediction of the variable, one may have to use units that are lacking values on other variables. We propose a method that trains a model for each variable missing values and offer two approaches to selecting training sets for each of the models.

The experiments undertaken show SVMs to be a useful tool. On BH data the SVM rmse is best by a 5% margin. On DLFS the SVM has 5% lower root-mean-square-error. SARS results show the SVM to perform best relative to the others on scalar variables.

Next: NEURAL NETWORKS FOR EDITING Up: NEURAL NETWORKS Previous: EDIT AND IMPUTATION USING

Pasi Koikkalainen
Fri Oct 18 19:03:41 EET DST 2002