next up previous
Next: About this document Up: ADDITIONAL PAPERS Previous: THE APPLICATION OF OUTPUT

TREE BASED MODELS AND CONDITIONAL PROBABILITY FOR AUTOMATIC DERIVATION OF VALIDATION RULES

Photis Stavropoulos

Liaison Systems SA
77 Akadimias Str,
106 78 Athens
Greece
email: photis@liaison.gr

One of the most important steps in any survey that collects large amounts of dat a is (automatic) editing. The starting point in any editing application is a set of edits defined according to some check plan, i.e. a list of a "error sources" , by a group of subject matter specialists. The present paper is concerned with a recently proposed approach for the automatic derivation of edits from clean da tasets. We deal with validation rules (i.e. conditions that data must satisfy), which the approach views as conditional probability statements. In other words, a rule involving certain variables is seen as a statement of what are the most p robable values of some of them given the others. The approach proceeds by specif ying the domains of the variables involved in a given rule and then estimating t he conditional probabilities on this probability space. In this way, a generic v alidation treatment is created which is free from formally defining rules.

As a tool to practically estimate the probabilities we propose the use of segmen tation via tree based models. Suppose a dataset contains N cases described by K explanatory variables (numerical and/or categorical) and a response variable. Th e data are partitioned in an optimal way according to the values of the explanat ory variables and the result is a tree. Each node corresponds to certain values of the explanatory variables and contains cases with a certain distribution of t he response variable. This conditional distribution of each node is a validation rule.

In the paper we investigate ways of obtaining as many rules as possible and also ways of overcoming the theoretical and computational problems of the approach. The work presented is carried out under the INSPECTOR IST project on automatic d ata validation.


next up previous
Next: About this document Up: ADDITIONAL PAPERS Previous: THE APPLICATION OF OUTPUT

Pasi Koikkalainen
Fri Oct 18 19:03:41 EET DST 2002