Naïve Bayes is a supervised machine learning algorithm commonly used for classification tasks. It is based on Bayes’ Theorem, a fundamental rule in probability that describes how to update the probability of a hypothesis based on new evidence. The "naïve" aspect comes from the assumption that all features are independent of each other, which simplifies computation. Despite this assumption rarely holding true in real-world data, Naïve Bayes often delivers competitive performance.
Bayes' Theorem allows us to combine prior knowledge (P(H)) with new evidence (P(E|H)) to make predictions. In classification, this means estimating which class label is most likely given the features of a data point. Naïve Bayes applies this by assuming feature independence and using it to efficiently compute these probabilities even in high-dimensional datasets.
The image above illustrates Bayes’ Theorem, which is the foundational concept behind the Naïve Bayes algorithm. It provides a mathematical framework to calculate the probability of a hypothesis (H) given some observed evidence (E).
Naïve Bayes calculates the probability that a given input belongs to a certain class based on its features. It evaluates each feature independently and multiplies their probabilities to find the most likely class. The algorithm is probabilistic, meaning it not only classifies the input but also provides the probability behind the classification decision.
In simple terms: “Given these input features, which class is most likely?”
Naïve Bayes is frequently applied in domains that involve classification of structured or text-based data. It is particularly suitable for:
Scikit-learn offers four major variants of the Naïve Bayes classifier. Each is optimized for a specific data distribution and feature type.
| Variant | Feature Type | Data | Typical Use Case |
|---|---|---|---|
| Gaussian NB | Continuous numeric | Normal (Gaussian) | Medical or sensor data |
| Multinomial NB | Count/Frequency | Multinomial distribution | Text classification, spam detection |
| Bernoulli NB | Binary (0/1) | Bernoulli distribution | Binary text features, presence detection |
| Categorical NB | Categorical labels | Categorical distribution | Demographic or survey data |
The Multinomial Naïve Bayes algorithm is a supervised classification technique based on Bayes' Theorem. It is especially well-suited for discrete feature data, such as word counts in text classification or frequency-based attributes. What makes it "naïve" is the assumption that all features are conditionally independent given the class label — a simplification that allows the model to perform surprisingly well, even when the assumption doesn’t hold perfectly.
In the Multinomial variant, the algorithm calculates the probability of a sample belonging to each class based on the frequency of features, and then predicts the class with the highest probability. It works particularly well in applications where input features represent counts or proportions, such as spam detection, document classification, or in this case, predicting customer responses to marketing efforts.
Despite its simplicity, the Multinomial Naïve Bayes model is computationally efficient, easy to implement, and often performs competitively with more complex models, especially when the dataset contains categorical or textual features.
This image provides a simplified visual representation of how the Multinomial Naïve Bayes algorithm works. It illustrates the core idea that the model calculates the probability of a data point belonging to a particular class based on the frequency of features, assuming that all features are conditionally independent. Each input feature contributes to the final prediction through a calculated likelihood, which is combined with the overall class probability (prior). This approach makes Multinomial Naïve Bayes particularly effective for categorical or count-based data, as used in this project.
In Naïve Bayes models, particularly the Multinomial variant, smoothing is essential to handle cases where a particular feature-category combination does not appear in the training data. This scenario leads to a zero probability, which can cause the entire posterior probability of a class to become zero — effectively eliminating that class from consideration during prediction.
To prevent this, a technique called Laplace Smoothing (also known as add-one smoothing) is applied. It adjusts the probability estimates slightly to ensure that no probability is ever exactly zero, even for unseen combinations. This makes the model more robust, especially when working with sparse or limited datasets, and improves its ability to generalize to new or rare observations in the test set.
The dataset was carefully preprocessed to align with the assumptions of the Multinomial Naive Bayes model, which requires non-negative, categorical or frequency-based features. The following steps were performed:
duration, pdays, and day_of_week, were removed to prevent skewed or misleading results.
age, balance, campaign, and previous were discretized into four quantile-based bins using KBinsDiscretizer. This transformation converted raw numerical values into categorical groupings, ensuring compatibility with the Naive Bayes algorithm.
job, marital, education, default, housing, loan, contact, month, and poutcome) were transformed using one-hot encoding to create binary indicator variables.
y (subscription status) was label-encoded as binary: 0 for "no" and 1 for "yes".
It is essential that there is no overlap between the training and testing sets. If the model were tested on data it had already seen during training, the results would be misleading—it could appear to perform better than it actually does in real-world scenarios. A disjoint split ensures that the test results accurately reflect the model's true ability to generalize to new customers.
The dataset was successfully preprocessed for the Multinomial Naïve Bayes model, resulting in a feature matrix with 47 columns. The full dataset (prepared_bank_data_nb.csv) contains 45,211 records, which were split into training (train_data_nb.csv, 36,168 records) and testing (test_data_nb.csv, 9,043 records) sets. Importantly, no missing values were found in any of the files, ensuring the data is clean and ready for modeling.
The Multinomial Naïve Bayes model was trained on the preprocessed dataset using an 80-20 train-test split. After fitting the model, predictions were made on the test set.
Before training the Multinomial Naïve Bayes model, the dataset was divided into features (X) and the target variable (y). The y column, which indicates whether a customer subscribed to a term deposit ("yes" or "no"), was extracted as a separate variable and removed from the feature set. This step ensures that the model is trained only on the input features and does not accidentally learn from the outcome labels. The resulting feature matrix (X_mnb) was then used for model training, while the target series (y_mnb) was used for supervision during learning and evaluation.
=== MultinomialNB Results ===
Accuracy: 0.8865
Classification Report:
precision recall f1-score support
no 0.91 0.97 0.94 7952
yes 0.55 0.31 0.40 1091
accuracy 0.89 9043
macro avg 0.73 0.64 0.67 9043
weighted avg 0.87 0.89 0.87 9043
Bernoulli Naïve Bayes is a variant of the Naïve Bayes classification algorithm that is specifically designed for binary or boolean feature data. It is based on the assumption that features follow a Bernoulli distribution, meaning each feature is either present (1) or absent (0) in a given sample.
This model is particularly effective when the input data consists of binary indicators, such as in text classification tasks, where features often represent the presence or absence of specific keywords. Unlike Multinomial Naïve Bayes, which considers feature counts, Bernoulli Naïve Bayes only considers whether a feature appears or not.
The foundation of Bernoulli Naïve Bayes lies in Bayes’ Theorem, which is used to calculate the probability of a data point belonging to a certain class, given the features it contains. The algorithm makes a simplifying assumption that all features are conditionally independent of one another within each class — an assumption that rarely holds true in real-world data, but often still yields good results due to the model’s robustness and efficiency.
Before applying Bernoulli Naïve Bayes, all features must be converted into binary format. In this project, continuous numerical features such as age, balance, campaign, and previous were binarized using quantile-based discretization. This transformation split the data into high and low values. Categorical variables were one-hot encoded, converting each unique category into a separate binary column.
This image offers a clear visual explanation of how the Bernoulli Naïve Bayes algorithm works. It highlights the key idea that the model assumes features are binary—either present (1) or absent (0)—and uses this information to calculate the probability of a data point belonging to each class. The algorithm evaluates the presence or absence of each feature using the Bernoulli distribution, multiplying these probabilities along with the prior probability of each class. This makes it especially effective for datasets with one-hot encoded or binarized features, such as in text classification or, in this case, customer attributes.
To prepare the dataset for the Bernoulli Naïve Bayes model, the following preprocessing steps were performed:
duration, pdays, and day_of_week were removed as they are either continuous, not binary, or not relevant for the Bernoulli model.
age, balance, campaign, and previous were discretized into two bins using quantile-based binning, effectively converting them into binary values.
job, marital, education, and others were one-hot encoded to represent each category as a binary column.
y was label-encoded for classification.
The binarized dataset was split into training and testing sets using an 80–20 split to evaluate the model’s performance on unseen data.
BernoulliNB classifier from sklearn.naive_bayes was initialized using default parameters.
Xb_train) and corresponding target labels (yb_train) using the .fit() method.
Xb_test) using the .predict() method.
yb_test) using accuracy_score and classification_report to assess the model’s performance.
=== BernoulliNB Results ===
Accuracy: 0.8616
Classification Report:
precision recall f1-score support
no 0.92 0.92 0.92 7952
yes 0.42 0.42 0.42 1091
accuracy 0.86 9043
macro avg 0.67 0.67 0.67 9043
weighted avg 0.86 0.86 0.86 9043
The Categorical Naïve Bayes algorithm is a probabilistic classification model specifically designed for datasets in which all the input features are categorical in nature. Unlike other variants such as Gaussian Naïve Bayes (which assumes features are continuous and normally distributed) or Multinomial Naïve Bayes (which works with frequency or count-based features), Categorical Naïve Bayes operates on data where each feature can take one of a fixed number of discrete values, such as "married", "single", "unemployed", or "tertiary education".
At its core, Categorical Naïve Bayes uses Bayes’ Theorem to calculate the probability that a data point belongs to a particular class, based on the combination of its feature values. It works by learning the likelihood of each category value occurring within each class from the training data. When a new observation is encountered, the model combines these probabilities with prior class probabilities to make a prediction. The "naïve" assumption it makes—that all features are independent given the class—allows it to perform these calculations efficiently, even on high-dimensional data.
To ensure compatibility with the Categorical Naïve Bayes algorithm, the dataset was carefully preprocessed with the goal of representing all input features as discrete categorical values. Below is a detailed explanation of each step:
duration, pdays, and day_of_week were removed. These features were either not meaningful for modeling or not suitable for categorical encoding.
age, balance, campaign, and previous were discretized into four quantile-based bins using KBinsDiscretizer. This transformation converted them into ordinal categorical variables, suitable for use in the Categorical Naïve Bayes model.
job, marital, education, loan, and contact were encoded using Ordinal Encoding, which converts each category into an integer label. Unlike one-hot encoding, this method preserves categorical structure in a format directly usable by the classifier.
X_cat (feature matrix) and y_cat (target labels), preparing it for training with the Categorical Naïve Bayes algorithm. The target column y was label-encoded into binary form: "no" as 0 and "yes" as 1.
The fully preprocessed categorical dataset was divided into training and testing sets using an 80–20 split to evaluate the model’s generalization performance on unseen data.
CategoricalNB model from sklearn.naive_bayes was initialized using default parameters.
Xc_train) and corresponding target labels (yc_train) using the .fit() method.
Xc_test) using the .predict() method.
yc_test) using accuracy_score and classification_report.
=== CategoricalNB Results ===
Accuracy: 0.8857
Classification Report:
precision recall f1-score support
no 0.91 0.96 0.94 7952
yes 0.54 0.32 0.41 1091
accuracy 0.89 9043
macro avg 0.73 0.64 0.67 9043
weighted avg 0.87 0.89 0.87 9043
The model correctly predicted 7,682 non-subscribers (true negatives) and 335 subscribers (true positives).
It misclassified 270 non-subscribers as subscribers (false positives) and missed 756 actual subscribers (false negatives).
This indicates the model performs well overall, especially for the "no" class, but has room for improvement in detecting "yes" responses — which is typical in class-imbalanced datasets.
The model performs reasonably well, identifying the majority of non-subscribers correctly. Compared to MultinomialNB:
• It improved recall for the "yes" class (456 vs. 335 true positives).
• But also introduced more false positives, misclassifying more "no" instances as "yes".
This trade-off shows that BernoulliNB is slightly better at detecting actual subscribers, though it sacrifices some precision on the "no" class.
Compared to MultinomialNB and BernoulliNB:
• It strikes a balance between precision and recall for the "yes" class.
• Fewer false positives than BernoulliNB (295 vs. 617).
• Better true positive count than MultinomialNB (352 vs. 335).
Overall, CategoricalNB offers more balanced performance, especially in handling class imbalance without heavily compromising on either side.
The bar chart illustrates the accuracy scores achieved by three different variants of the Naïve Bayes classification algorithm—MultinomialNB, BernoulliNB, and CategoricalNB—when applied to the same marketing dataset.
The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate against the False Positive Rate at different thresholds. The closer the curve is to the top-left corner, the better the model performs.
The Precision-Recall (PR) curve is ideal for evaluating imbalanced datasets. It shows how well a model identifies the minority class (“yes”).
This bar chart compares how well each Naïve Bayes model predicts the minority class ("yes") using three metrics:
The best model depends on campaign goals:
• BernoulliNB is best when maximizing outreach (high recall).
• MultinomialNB is best when minimizing waste (high precision).
• CategoricalNB is the most balanced across all three metrics.
SMOTE (Synthetic Minority Over-sampling Technique) is a popular method used to address class imbalance in classification problems. Instead of simply duplicating existing minority class samples, SMOTE generates new synthetic examples by interpolating between actual minority class instances. This helps the model learn more generalizable patterns and reduces bias toward the majority class.
In this project, the original dataset was imbalanced, with the "no" class (non-subscribers) significantly outnumbering the "yes" class (subscribers). To counter this, SMOTE was applied only to the training set, creating a balanced distribution of both classes. This ensured that the Multinomial Naïve Bayes model had equal exposure to examples from both classes during training, helping it better detect potential subscribers.
Although SMOTE improved the model's recall for the "yes" class, it also introduced more false positives, lowering precision. This trade-off was addressed further through threshold tuning, which aimed to find the optimal balance between precision and recall.
=== DEFAULT THRESHOLD RESULTS ===
Accuracy: 0.7030
Classification Report:
precision recall f1-score support
no 0.94 0.71 0.81 7952
yes 0.24 0.66 0.35 1091
accuracy 0.70 9043
macro avg 0.59 0.68 0.58 9043
weighted avg 0.85 0.70 0.75 9043
SMOTE.
In binary classification, models typically use a default probability threshold of 0.5 to assign class labels—predicting a sample as positive ("yes") if the predicted probability is greater than or equal to 0.5. However, this default threshold may not yield the best results, especially in imbalanced datasets where one class is significantly underrepresented.
To improve the model’s performance on the minority class ("yes"), threshold tuning was performed. Instead of relying on the default threshold, the model’s predicted probabilities for the "yes" class were evaluated across a range of thresholds (from 0.0 to 1.0 in small increments).
The plot above illustrates how precision, recall, and F1-score for the "yes" class change across different classification thresholds. The goal was to identify the threshold that best balances precision and recall, as measured by the F1-score.
This fine-tuning enables the model to make more effective marketing predictions—improving its ability to identify potential subscribers while minimizing false alarms.
After applying threshold tuning (best threshold ≈ 0.735), the Multinomial Naïve Bayes model demonstrated improved balance between precision and recall, particularly for the minority "yes" class.
0.31 at default threshold to 0.46).
0.35 to 0.43), indicating more balanced performance.
=== FINAL MODEL EVALUATION WITH TUNED THRESHOLD ===
Accuracy: 0.8543
Classification Report:
precision recall f1-score support
no 0.92 0.91 0.92 7952
yes 0.41 0.46 0.43 1091
accuracy 0.85 9043
macro avg 0.67 0.69 0.67 9043
weighted avg 0.86 0.85 0.86 9043
Confusion Matrix:
[[7221 731]
[ 587 504]]
The analysis identified the top 10 features most strongly associated with a customer subscribing to a term deposit ("yes"). These were ranked based on their influence in the model, with higher values indicating stronger positive association with the target class.
=== FEATURE EXPLORATION === Top 10 features most indicative of 'yes': poutcome_success: 2.6608 month_mar: 2.3094 month_sep: 2.1556 month_dec: 2.1045 month_oct: 1.9884 job_student: 1.2429 month_apr: 0.8889 job_retired: 0.8885 month_feb: 0.6030 job_unemployed: 0.5270
poutcome_success) is the strongest predictor of future subscriptions.
March, September, December, and October—align with higher engagement, highlighting seasonal trends in customer responsiveness.
students and retired individuals are more likely to subscribe, indicating that targeted campaigns for these groups could enhance marketing effectiveness.
This study applied multiple variants of the Naïve Bayes algorithm to predict whether a customer is likely to subscribe to a term deposit based on their personal, financial, and interaction history with the bank. Despite the simplicity of Naïve Bayes models, they proved to be surprisingly effective for this marketing prediction task—particularly when paired with thoughtful preprocessing, class balancing, and threshold tuning.
One of the most valuable takeaways is that even basic models can yield meaningful insights when properly tuned. By addressing the imbalance between subscribers and non-subscribers using techniques like SMOTE, and by fine-tuning decision thresholds, we were able to significantly improve the model's ability to detect potential subscribers—without sacrificing overall accuracy.
From a business standpoint, the findings revealed that past campaign success, certain contact months (like March, September, and December), and customer demographics such as students and retirees are strong indicators of future subscription interest. This knowledge can directly inform how and when to engage different segments of customers, helping the bank personalize its outreach strategy for better results.
While Naïve Bayes may not always outperform more complex models, its transparency, efficiency, and interpretability make it a reliable starting point for predicting customer behavior in marketing campaigns. The models built here not only help identify which customers are most likely to respond positively—but also offer clear, actionable insights that can drive smarter and more strategic marketing decisions.