Train-Test | BankIntel

Train-Test Split: What It Is and Why It Matters

In any machine learning project, evaluating the performance of a model accurately requires splitting the data into two disjoint sets: one for training the model (Train Set) and one for testing how well it performs on unseen data (Test Set). This split ensures that the model does not simply memorize the data, but instead learns patterns that can be applied to new, real-world scenarios. If the same data is used for both training and testing, the performance results will be misleadingly optimistic—because the model is being tested on what it has already seen.

Importance of Keeping Sets Disjoint

A disjoint split means that no data point used during training appears in the test set. This is critical because it simulates the real-world scenario where a model has to make predictions on completely new data. Using overlapping data would result in data leakage, where the model has unfair information during evaluation, leading to unreliable performance metrics.

How the Split Was Created

The data was split into 80% training and 20% testing, using train_test_split() from Scikit-learn, with a fixed random_state=42 to ensure reproducibility. To maintain the same distribution of the target variable ('yes' or 'no') across both sets, stratified sampling was used. This ensures that both the training and test sets reflect the real class imbalance of the original data (majority "no" and minority "yes").

This same base split was applied consistently across all models to allow fair performance comparisons.

Split Strategy for Each Algorithm

Split Timing: After completing preprocessing steps specific to each variant.
Split Ratio: 80% training, 20% testing (train_test_split with random_state=42).
Disjointness: Fully disjoint; test set was not used during training or preprocessing.

Naive Bayes Models

MultinomialNB Applied after binning numeric features (age, balance, campaign, previous) into 4 discrete bins. Categorical features were one-hot encoded. Model trained on the processed training set, with the same structure retained for testing.

BernoulliNB Numeric features were binarized (2 bins using quantiles). Categorical features were one-hot encoded into binary indicators. Split applied after converting all features to binary (0/1), suitable for BernoulliNB.

CategoricalNB Numeric features were binned into ordinal categories. Categorical features were ordinally encoded into integers. The split was applied after all transformations, ensuring the model received valid categorical inputs.

Decision Tree Classifier

Split Timing: After discretization and encoding. Split Ratio: 80% training, 20% testing. Special Considerations: Trees are not sensitive to scaling, so no normalization was done. The same train-test split was used as in other models to ensure consistent evaluation.

Logistic Regression

Split Timing: After preprocessing (binarizing and encoding features). Split Ratio: 80/20. Preprocessing Note: Logistic regression benefits from feature scaling. Numeric features were optionally scaled using StandardScaler after the split to prevent data leakage. This consistent strategy allowed the model to learn from clean training data and be evaluated on truly unseen examples.

Support Vector Machine (SVM)

Split Timing: After preprocessing (binarizing and encoding features). Split Ratio: 80/20. Preprocessing Note: SVM models are sensitive to the scale of input features. Numeric features were scaled using StandardScaler after the split to avoid data leakage and ensure better model performance. This approach helped the SVM model find an optimal separating boundary while being fairly evaluated on truly unseen test data.

Random Forest

Split Timing: After preprocessing (binarizing and encoding features). Split Ratio: 80/20. Preprocessing Note: Random Forest models do not require feature scaling because they are based on tree structures that are not affected by feature magnitude. Only binarization and encoding were performed before splitting. This setup allowed the Random Forest model to train on a clean, properly formatted dataset and be tested on new examples to measure its real-world predictive strength.

TRAIN-TEST SPLIT