An Off-Beat Approach to Train-Test-Validation Split Your Dataset | by Amarpreet Singh | Jul, 2024 – Towards Data Science

Generated with Microsoft Designer

We all require to sample our population to perform statistical analysis and gain insights. When we do so, the aim is to ensure that our samples distribution closely matches that of the population.

For this, we have various methods: simple random sampling (where every member of the population has an equal chance of being selected), stratified sampling (which involves dividing the population into subgroups and sampling from each subgroup), cluster sampling (where the population is divided into clusters and entire clusters are randomly selected), systematic sampling (which involves selecting every nth member of the population), etc etc. Each method has its advantages and is chosen based on the specific needs and characteristics of the study.

In this article, we wont be focusing on sampling methods themselves per se, but rather on using these concepts to split the dataset used for machine learning approaches into Train-Test-Validation sets. These approaches work for all kinds of Tabular data. We will be working in Python here.

Below are some approaches that you already might know:

This approach uses random-sampling method. Example code:

This approach ensures that the splits maintain the same proportion of classes as the original dataset (with random sampling again of course), which is useful for imbalanced datasets. This approach will work when your target variable is not a continuous variable.

In K-Fold cross-validation, the dataset is split into k subsets (folds). The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times.

As the name suggests, this is a combination of Stratified sampling and K-fold cross-validation.

Full example usage:

Now, you can use these methods to split your dataset but they have the following limitations:

Now, suppose you have a small total number of observations in your dataset and its difficult to ensure similar distributions amongst your splits. In that case, you can combine clustering and random sampling (or stratified sampling).

Below is how I did it for my problem at hand:

In this method, first, we cluster our dataset and then use sampling methods on each cluster to obtain our data splits.

For example, using HDBSCAN:

You can also use other clustering methods according to your problem for eg. K-Means clustering:

Now you can also add levels of granularities (any categorical variable) to your dataset to get more refined clusters as follows:

Once you have obtained cluster labels from any clustering method, you can use random sampling or stratified sampling to select samples from each cluster.

We will select indices randomly and then use these indices to select our train-test-val sets as follows:

As per my use-case, it was useful to sort my target variable y and then select every 1st, 2nd, and 3rd indices for train, test, and validation set respectively (all mutually exclusive), a.k.a systematic random sampling as below:

The above-discussed approaches of combining clustering with different sampling methods are very useful when you have a small number of observations in your dataset as they ensure to maintain similar distributions amongst the Train, Test and Validation sets.

Thanks for reading, and I hope you find this article helpful!

Visit link:

An Off-Beat Approach to Train-Test-Validation Split Your Dataset | by Amarpreet Singh | Jul, 2024 - Towards Data Science

Related Posts

Comments are closed.