A Decision Tree is a simple yet powerful machine learning algorithm used for both classification and regression tasks. It models decisions using a tree-like structure where each internal node represents a question or condition on a feature, each branch represents an outcome of that condition, and each leaf node represents a final prediction. The model learns by splitting the data based on the feature that best separates the classes, making it easy to interpret and visualize how predictions are made. Decision Trees are known for their transparency, speed, and ability to handle both numerical and categorical data.
The Decision Tree algorithm works like a series of yes/no questions that help make predictions. Imagine trying to decide whether a customer will subscribe to a term deposit. Instead of guessing, the algorithm creates a âtreeâ where each question narrows down the possibilities. These questions are based on patterns found in the data, such as whether the customer was contacted in a certain month or how they responded to a previous campaign.
Training the model means feeding it past data so it can learn which questions to ask and in what order. The algorithm looks through all the available information and finds the features that do the best job of separating customers who said "yes" from those who said "no." At each step, it picks the most useful question, splits the data accordingly, and continues doing this until it reaches a decision. The final outcome appears at the bottom of the tree, known as the âleaf,â where the model makes its prediction.
Once trained, the tree can take in new customer information and follow the same path of questions to reach a prediction. For example, if a new customer was contacted in August, had no housing loan, and responded positively in the past, the tree might follow those branches and conclude that this customer is likely to subscribe. This makes the prediction process fast, transparent, and easy to understand.
What makes Decision Trees especially useful is their ability to show how a decision is made. Each branch of the tree tells a story of why a certain outcome was predicted. This makes the model not just accurate but also explainableâan important feature when decisions affect real people, like in banking or marketing campaigns.
This image visually explains the basic structure of a Decision Tree. At the top is the Root Node, which represents the first decision the model makes based on a feature. It branches into Internal Nodes, each representing further decision points based on different feature values. These eventually lead to Leaf Nodes, which are the final outputs or predictions made by the model. Leaf nodes are not split any further and usually represent class labels (e.g., "yes" or "no"). This hierarchical structure reflects how a Decision Tree breaks down complex decisions into a series of simpler, interpretable rules.
Decision Trees are widely used in real-world applications where interpretability and decision logic matter. Common use cases include:
In theory, it is possible to construct an infinite number of different Decision Trees for a datasetâespecially if we allow:
This flexibility is powerful, but it also makes Decision Trees prone to overfitting. Thatâs why techniques like pruning, depth limiting, and ensemble methods (e.g., Random Forests) are commonly used to help trees generalize better to unseen data.
In this project, Decision Trees were used to uncover the most important factors influencing a customerâs likelihood to subscribe to a term deposit, offering both predictions and actionable insights.
Imagine a dataset of 10 customers:
"no"), and 4 did subscribe ("yes").\[ \text{Entropy(parent)} = -\left( \frac{6}{10} \log_2 \frac{6}{10} + \frac{4}{10} \log_2 \frac{4}{10} \right) \approx 0.97 \]
\[ \text{Weighted Entropy} = \left( \frac{4}{10} \cdot 0.81 \right) + \left( \frac{6}{10} \cdot 0.65 \right) = 0.715 \]
\[ \text{Information Gain} = 0.97 - 0.715 = 0.255 \]
Splitting on Contact Month = March results in a 25.5% reduction in uncertainty, making it a strong candidate for the root or next node.
The dataset used in this project originates from a cleaned version of a Portuguese bank marketing campaign. To prepare the data for machine learning:
This preprocessing step ensured the dataset was clean, consistent, and fully compatible with supervised learning models.
To evaluate model performance fairly, the data was split into two disjoint sets:
This separation is crucial to avoid data leakage and to ensure that the model is assessed on unseen data, providing a reliable estimate of how well it will generalize in real-world scenarios.
Using the same data for both training and testing would lead to data leakage, causing the model to "memorize" answers and artificially inflate performance. Disjoint datasets simulate real-world scenarios by testing how well the model performs on unseen customers, which is critical for determining how it would behave in actual deployment.
This separation guarantees that performance metrics like accuracy, precision, and recall are trustworthy and reflect the modelâs ability to generalize.
Three Decision Tree classifiers were trained using different configurations to compare performance and behavior:
Each tree was visualized to demonstrate how decisions are made based on customer attributes, such as job type, marital status, and previous campaign outcomes. These visualizations make it easier to understand the logic behind each prediction, which is especially valuable for business stakeholders seeking transparency in automated decision-making.
The image above displays the structure of a Decision Tree trained with default parameters, visualized up to a depth of 3 for clarity.
At the top is the root node, which splits the dataset based on the feature poutcome (previous campaign outcome). This indicates that whether or not a previous campaign was successful plays a significant role in predicting future subscription behavior.
From there, the tree branches out based on features such as:
Each node contains:
pdays ≤ 217.5)This tree shows a combination of business logic (e.g., recent follow-up, age groups) and campaign history as influential factors, and serves as a highly interpretable way to understand model decisions.
This tree was trained by limiting the number of features considered at each split using the square root of total features. This technique is often used to reduce overfitting and increase diversity in ensemble methods like Random Forests.
In this visualization:
previous, representing the number of times a client was contacted before. This suggests that past engagement history influences subscription behavior.balance, marital status, and month, which relate to a clientâs financial condition, social status, and timing of contact.This tree helps emphasize which limited combinations of attributes can still yield reasonably good predictions, making it efficient and less prone to noise.
This model is trained with two constraints:
Despite being shallow, the model captures key decision-making paths:
poutcome, again reinforcing the importance of previous campaign success.contact type and month, indicating that how and when a client was contacted significantly affects outcomes.housing loan status and pdays, reflecting the clientâs current commitments and timing since last contact.By prioritizing purity of splits using entropy, this tree ensures each decision brings maximum clarity. Its limited depth also makes it ideal for business presentations and quick interpretation.
The confusion matrices below display the performance of all three Decision Tree models on the test data. Each matrix shows how many predictions the model got right and where it made errors, helping to visually compare their effectiveness.
Understanding which features contributed the most to model decisions is key for making data-driven business strategies:
The feature importance chart for Tree 1 reveals which variables the default decision tree relied on most:
Tree 2âs feature importance plot, generated with limited feature sampling, highlights a slightly different hierarchy:
The bar chart above shows the relative importance of each feature used in Decision Tree 3, which was trained using a maximum depth of 3 and entropy as the splitting criterion.
Attributes like job, education, loan, marital status, and balance contributed very little to Tree 3âs decisionsâlikely due to either strong correlation with more dominant features or limited influence under a constrained tree depth.
The Receiver Operating Characteristic (ROC) curve is a graphical representation of a modelâs ability to distinguish between classes. The Area Under the Curve (AUC) gives a single value summary of this ability:
Although all models perform above chance, Tree 3 stands out in ROC-AUC performance, reflecting a stronger trade-off between true positive rate and false positive rate. Itâs especially useful when correctly identifying the "yes" class (subscribers) is importantâeven at the cost of missing a few non-subscribers.
max_features='sqrt'), discriminative power remains modest.
The bar chart below presents a side-by-side comparison of accuracy scores for three Decision Tree models trained on the marketing campaign dataset.
Accuracy: 82.65%
A baseline tree that evaluates all features without restrictions. While simple, it offers decent predictive power and a good starting point for comparison.
max_features='sqrt'
Accuracy: 83.32%
Slightly improves performance by limiting the number of features considered at each split. This adds randomness and reduces the risk of overfitting.
max_depth=3, criterion='entropy'
Accuracy: 89.07%
The best-performing model. Limiting depth helps generalize better on unseen data, while using entropy ensures more informative splits. This balance between simplicity and precision allows Tree 3 to make the most accurate predictions.
max_depth=3, criterion='entropy') delivered the best performance with an accuracy of 89.07%, indicating strong generalization with minimal overfitting.max_features='sqrt') achieved a moderate accuracy of 83.32%, slightly outperforming Tree 1 by introducing controlled feature selection at each split.poutcome, contact, and month as the most influential factorsâindicating that previous campaign results and timing are critical.balance, age, and day_of_week highly, showing a shift in feature relevance based on model configuration.The analysis helped uncover what truly influences a customer's decision to subscribe to a term deposit. One of the most important discoveries was the impact of previous campaign outcomes. Customers who had a positive experience or showed interest during earlier campaigns were much more likely to subscribe again. This shows that past behavior is a powerful indicator of future decisions and can help identify customers who are more open to offers.
Another valuable insight was the significance of when and how the customer was contacted. Certain months showed better results than others, and the method of contactâwhether it was through a call or another channelâalso made a difference. This suggests that timing and communication style are just as important as the message itself. These patterns can help tailor future campaigns by choosing the most effective times and channels to reach potential customers, ultimately leading to better responses.
Interestingly, the analysis showed that some commonly assumed important factorsâlike job title, marital status, or education levelâplayed a much smaller role in predicting outcomes. This shifts the focus toward behavioral and engagement-based attributes rather than demographic profiles. It suggests that customers should not be grouped only by who they are, but more importantly by how they interact with the bank and how theyâve responded in the past.
In the end, the approach used in this project allowed for simple, easy-to-understand decision paths that explain exactly how a conclusion was reached. This makes the results not only accurate but also actionable. The findings can be used to design more focused and cost-effective campaigns by identifying the right people to contact and the best way to engage them. Overall, this approach supports smarter, more efficient decision-making in marketing without relying on overly complex or hidden methods.