Claudio Conversano, Roberta Siciliano
Department of Matematics and Statistics
University of Naples Federico II
Via Cintia, Monte Sant’Angelo, I-80126
Missing or incomplete data are a serious problem in many fields of research because they can lead to bias and ine±ciency in estimating the quantities of interest. The relevance of this problem is strictly proportional to the dimensionality of the collected data. Particularly, in data mining applications a substantial proportion of the data may be missing and predictions might be made for instances with missing inputs.
In recent years, several approaches for missing data imputation have been presented in the literature. Main reference in the field is the Little and Rubin book on statistical analysis with missing data [L&R-87]. In most situations, a common way for dealing with missing data is to discard records with missing values and restrict the attention to the completely observed records. This approach relies on the restrictive assumption that missing data are Missing Completely At Random (MCAR), i.e., that the missingness mechanism does not depend on the value of observed or missing attributes. This assumption rarely holds and, when the emphasis is on prediction rather than on estimation, discarding the records with missing data is not an option [S-98]. An alternative and weaker version of the MCAR assumption is the Missing at Random (MAR) condition. Under a MAR process the fact that data are missing depends on observed data but not on missing data themselves. While the MCAR condition means that the distributions of observed and missing data are indistinguishable, the MAR condition states that the distributions di®er but missing data points can be explained (and therefore predicted) by other observed variables. In principle, a way to meet the MAR condition is to include completely observed covariates that are highly related to the missing ones. Actually, most of the existing methods for missing data imputation discussed in Shafer [Sc-97] just assume the MAR condition and, in these settings, a common way to deal with missing data are conditional mean methods or model imputation, i.e., to fit a model on the complete cases for a given variable treating it as the outcome and then, for cases where this variable is missing, plug-in the available data in the model to get an imputed value. The most popular conditional mean method employs least squares regression but it can be often unsatisfactory for nonlinear data and biased if model misspecification occurs.
In this work, in order to overcome the shortcomings of the conditional mean imputation method, ¤The work was supported by EC Research Project IST-2000-26347 (INSPECTOR) Funds. we propose the iterative use of tree based models for missing data imputation in large data bases. The proposed procedure uses lexicographic order to rank missing values that occur in di®erent variables and deals with these incrementally, i.e, augmenting the data by the previously filled in records according to the defined order.
2. Overview of Tree Based Classifiers
Tree Based Classifiers consists of a recursive binary partition of a set of objects described by some explanatory variables (either numerical or and categorical) and a response variable [B-84]. Data are partitioned by choosing at each step a variable and a cut point along it, generating the most homogeneous subgroups respect to the response variable according to a goodness of split measure. The procedure results in a powerful graphical representation known as decision tree, which express the sequential grouping process. Once the tree is built, a response value or a class label is assigned to each terminal node. According to the nature of the response variable, they usually distinguish between Classification Tree (for the categorical response case) and Regression Tree (for the numerical response case). In the classification tree case, when the response variable takes value in a set of previously defined classes, the node is assigned to the class which presents the highest proportion of observations. Whereas, in the regression tree case, the value assigned to cases in a given terminal node is the mean of the response variable values associated with the observations belonging to the given node. Note that, in both cases, this assignment is probabilistic, in the sense that a measure of error is associated with it. Main advantages of these methods are:
3. Proposed Approach
The basic idea is as follows: given a variable for which data are missing, and set of other d (d < p) observed variables, the method works by using the former as response variable y and the latter as covariates . The resulting tree explains the distribution of the response variable in terms of the values of the covariates. Since the terminal nodes of the tree are homogeneous with respect to the response, they provide candidate imputation values for this variable. To deal with the data presenting missing values in many covariates, we introduce an incremental approach based on a suitably defined data preprocessing schema. The underlying assumption is the MAR process.
3.1 Missing Data Ranking
Let be the original data matrix, with d completely observed variables. For any record, if q is the number of covariates with missing data, which maximum value is Q = (p-d+1), there might be up to type of missingness, ranging from the simplest case where only one covariate is missing to the hardest condition to deal with, where all the Q covariates are missing. We perform a two-way rearrangement of X, one with respect to the columns and one with respect to the rows . We propose to use a lexicographic ordering [C&T-91] [K&U-01] that matches the ordering by value, corresponding to the number of missing values occurring in each record. Practically, we form a string vector of length n that indicates the occurrence and the number of missing values for each row of . This allows to order in a way that the first incomplete column presents the lowest number of missing values and it follows the complete observed ones. Furthermore, columns also are ordered in the way that the first m rows (m ¿ n) contain instances with no missing values and the remaining (n-m) rows present missing values. As a result, is partitioned into four disjoint matrices asfollows:
Note that, as a consequence of the ordering schema, only contains missing values while the other three blocks are completely observed with respect to their rows and columns.
3.2 Incremental Imputation
Once that the different types of missingness have been ranked and coded, the missing data imputation is iteratively made using tree based models. With respect to the records presenting only one missing attribute, a simple tree is used. Here, the variable with missing values is the response and the other observed variables are the covariates. The tree is built on the current complete data cases in and the its results are used to impute the cases in . In fact, terminal nodes of the tree represent candidate “imputed values”. Actual imputed values are get by dropping down the tree the cases of corresponding to the missing values in (for the variable under imputation), till a terminal node is reached. The conjunction of the filled-in cells of with the corresponding observed rows in generates new records which are appended to , that gains the rows whose missing values have been just imputed and a “new” column corresponding to the variable under imputation.
For records presenting multiple missing values, trees are used iteratively. In this case, according to the previously defined lexicographic ordering, the tree is first used to fill in the missing values of the covariate presenting the smallest number of incomplete records. The procedure is then repeated for the remaining covariates under imputation. In this way, we form as many trees as the number of covariates with missing values in the given missingness. In the end, the imputation of joint missingness derives from subsequent trees. A graphical representation of both the data preprocessing and the imputation of the first missing value is given in figure 1.
A formal description of the proposed imputation process can be given as follows. Let be the cell presenting a missing input in the r-th row and the s-th column of the matrix . The imputation of this input derives from the tree grown from the learning sample where . Obviously due to the lexicographical order the imputation is made separately for each set of rows presenting missing data in the same cells. As a result, this imputation process is incremental, because as it goes on more and more information is added to the data matrix, both respect the rows and the columns. In other words, is updated in each iteration, and the additional information is used for imputing the remaining missing inputs. Equivalently, the matrix containing missing values shrinks after each set of records with missing inputs has been filled-in. Furthermore, the imputation is also conditional because, in the joint imputation of multiple inputs, the subsequent imputations are conditioned on previously filled-in inputs.
4. Concluding Remarks
Tree based classifiers appear particularly promising because when dealing with missing data
because they enjoys two important properties:
a) they do not require the specification of a model structure;
b) they can deal with di®erent type of predictors (numerical and categorical). In this framework, we propose an automatic procedure that takes advantage of modern computing and allows to handle nonresponses in an easy and fast way, by defining a lexicographic order of the data. The results concerning the application of the proposed methodology on both artificial and real data set, not showed here due to the lack of space, confirm its e®ectiveness, in most cases when data are nonlinear and heterosckedastic.
Figure 1. An illustration of the incremental imputation process: a) original data matrix; b) data rearrangement based on the number of missingness in each column; c) data rearrangement based on the number of missingness in each row and definition of the lexicographical order; d) the rearranged data matrix after the first iteration of the imputation process.