Data splitting for predictive analytics

An important question that a modelers often faces is how to evaluate the model that is built on a particular set of data. Usually the data is divided into two parts or two separate data sets.

The “training” data set is a term used generally for the samples used for creating the model, while the “test” or “validation” data set is used to evaluate the performance.

In other words, even after you build a great model that works perfectly with the train data, it is important to decide which samples will be used to evaluate performance of this model. Ideally, the model should be evaluated on observations that are part of the train data as they were used to build or fine-tune the model, and hence the evaluation will be biased.

When modeler is dealing with a large amount of data, she has the freedom to set aside a part of the sample data to evaluate the final model.

The challenge arises when the number of samples is not large. In that case test sets are sometimes avoided because every data point in the sample is utilized for model building.

Moreover, the test set size may not be sufficient to make reasonable judgements.

Also, in many cases, validation using a single test set can be a poor choice.

In such cases, resampling methods, such as cross-validation, can be used to produce appropriate estimates of model performance using the training set.

Resampling techniques, applied properly often produce performance estimates superior to a single test sets. This is because they evaluate many alternate versions of the data compared to test sets that only offers a single version.

Even in cases where a test set is deemed necessary, a proper approach should be taken for splitting the samples. Nonrandom approaches to splitting the data are sometimes appropriate.

Our purpose is to ensure that the model generalizes to similar sets of data. So we may build our test sample based on data that is current as well as collected from similar sources as the train data.

An appropriate example can be spam filtering; where it’s more important for the model to catch the new spamming techniques rather than older spamming schemes.

But many a times there is the desire to make the training and test sets as homogeneous as possible. Random sampling methods can be used to create similar data sets.

The simplest way to split the data into a training and test set is to take a simple random sample. This does not control for any of the data attributes, such as the percentage of data in the classes. When one class has a disproportionately small frequency compared to the others, there is a chance that the distribution of the outcomes may be substantially different between the training and test sets.

To account for the outcome when splitting the data, stratified random sampling applies random sampling within subgroups. In this way, there is a higher likelihood that the outcome distributions will match. When the outcome is a number, a similar strategy can be used; the numeric values are broken into similar groups (e.g., low, medium, and high) and the randomization is executed within these groups.

Alternatively, the data can be split on the basis of the predictor values.

A dissimilarity sample is also useful in certain cases in order to ensure that test and train data are not similar. Dissimilarity between two samples can be measured in a number of ways. The simplest method is to use the distance between the predictor values for two samples. If the distance is small, the points are in close proximity. Larger distances between points are indicative of dissimilarity.

So, use dissimilarity as a tool for data splitting, suppose the test set is initialized with a single observation. The dissimilarity between this initial observation and the unallocated observations can be calculated. The unallocated observation that is most dissimilar would then be added to the test set. To allocate more observations to the test set, a method is needed to determine the dissimilarities between groups of points (i.e., the two in the test set and the unallocated points). One approach is to use the average or minimum of the dissimilarities

For example, to measure the dissimilarities between the two observations in the test set and a single unallocated point, we can determine the two dissimilarities and average them. The third point added to the test set would be chosen as having the maximum average dissimilarity to the existing set. This process would continue until the targeted test set size is achieved.

Leave a Reply

Your email address will not be published.