Import Libraries¶

In [1]:
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from imblearn.combine import SMOTEENN
import warnings
warnings.filterwarnings("ignore") 

Read the dataset¶

In [2]:
df = pd.read_csv("cleandata.csv")
In [3]:
df.head()
Out[3]:
Unnamed: 0 SeniorCitizen MonthlyCharges TotalCharges Churn gender_Male Partner_Yes Dependents_Yes PhoneService_Yes MultipleLines_No phone service ... PaperlessBilling_Yes PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check tenure_group_12 - 23 tenure_group_24 - 35 tenure_group_36 - 47 tenure_group_48 - 59 tenure_group_60 - 71 tenure_group_72 - 72
0 0 0 29.85 29.85 0 0 1 0 0 1 ... 1 0 1 0 0 0 0 0 0 0
1 1 0 56.95 1889.50 0 1 0 0 1 0 ... 0 0 0 1 0 1 0 0 0 0
2 2 0 53.85 108.15 1 1 0 0 1 0 ... 1 0 0 1 0 0 0 0 0 0
3 3 0 42.30 1840.75 0 1 0 0 0 1 ... 0 0 0 0 0 0 1 0 0 0
4 4 0 70.70 151.65 1 0 0 0 1 0 ... 1 0 1 0 0 0 0 0 0 0

5 rows × 37 columns

Drop the 'Unnamed: 0' column (not required in the analysis)¶

In [4]:
df.drop('Unnamed: 0', axis=1, inplace=True)
In [5]:
df.head()
Out[5]:
SeniorCitizen MonthlyCharges TotalCharges Churn gender_Male Partner_Yes Dependents_Yes PhoneService_Yes MultipleLines_No phone service MultipleLines_Yes ... PaperlessBilling_Yes PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check tenure_group_12 - 23 tenure_group_24 - 35 tenure_group_36 - 47 tenure_group_48 - 59 tenure_group_60 - 71 tenure_group_72 - 72
0 0 29.85 29.85 0 0 1 0 0 1 0 ... 1 0 1 0 0 0 0 0 0 0
1 0 56.95 1889.50 0 1 0 0 1 0 0 ... 0 0 0 1 0 1 0 0 0 0
2 0 53.85 108.15 1 1 0 0 1 0 0 ... 1 0 0 1 0 0 0 0 0 0
3 0 42.30 1840.75 0 1 0 0 0 1 0 ... 0 0 0 0 0 0 1 0 0 0
4 0 70.70 151.65 1 0 0 0 1 0 0 ... 1 0 1 0 0 0 0 0 0 0

5 rows × 36 columns

Separate features (x) and target (y)¶

In [6]:
x = df.drop('Churn', axis=1)
y = df['Churn']

Split the data into training and testing sets:¶

  • Split the data into 70% training and 30% testing sets
In [7]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)

Decision Tree classifier¶

In [8]:
model = DecisionTreeClassifier(criterion = "gini",random_state = 100,max_depth=6, min_samples_leaf=8)

Train the Decision Tree classifier on the training data¶

In [9]:
model.fit(x_train,y_train)
Out[9]:
DecisionTreeClassifier(max_depth=6, min_samples_leaf=8, random_state=100)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=6, min_samples_leaf=8, random_state=100)

Make predictions on the test set:¶

In [10]:
predict = model.predict(x_test)
predict
Out[10]:
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

Print classification report¶

In [11]:
# Print accuracy score
accuracy = accuracy_score(y_test, predict)
print("Accuracy:", accuracy)

# Print classification report
print("Classification Report:\n", classification_report(y_test, predict))

# Print confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, predict))
Accuracy: 0.7742546142924751
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.87      0.85      1564
           1       0.57      0.51      0.54       549

    accuracy                           0.77      2113
   macro avg       0.70      0.69      0.69      2113
weighted avg       0.77      0.77      0.77      2113

Confusion Matrix:
 [[1358  206]
 [ 271  278]]
Notice:
Here, It turns out that the accuracy is quite low (78%).

However, since the dataset is imbalanced, accuracy alone may not provide a complete picture. Its necessary to focus on the minority class, i.e., churned customers, to understand the model's performance better.
To improve the model's performance, As the churned customers are a minority in the dataset (imbalanced data), I need to balance the data to ensure fair representation of both churned and non-churned customers.
But first let's check other classifier.

Test Different Classifier¶

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

Random Forest¶

In [13]:
model_rf = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=6, min_samples_leaf=8)
model_rf.fit(x_train, y_train)
Out[13]:
RandomForestClassifier(max_depth=6, min_samples_leaf=8)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_depth=6, min_samples_leaf=8)

Gradient Boosting Machine¶

In [14]:
model_gbm = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
model_gbm.fit(x_train, y_train)
Out[14]:
GradientBoostingClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier()

K-Nearest Neighbors¶

In [15]:
model_knn = KNeighborsClassifier(n_neighbors=5)
model_knn.fit(x_train, y_train)
Out[15]:
KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()

SVM¶

In [16]:
model_svm = SVC(kernel='rbf', C=1.0, probability=True)
model_svm.fit(x_train, y_train)
Out[16]:
SVC(probability=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(probability=True)

Evaluate Classifiers:¶

In [17]:
models = [model_rf, model_gbm, model_knn, model_svm]
model_names = ['Random Forest', 'GBM',  'KNN', 'SVM']

for model, name in zip(models, model_names):
    y_pred = model.predict(x_test)
    accuracy = model.score(x_test, y_test)
    print(f"Classifier: {name}")
    print(f"Accuracy: {accuracy:.2f}")
    print(metrics.classification_report(y_test, y_pred))
    print("------------")
Classifier: Random Forest
Accuracy: 0.79
              precision    recall  f1-score   support

           0       0.81      0.93      0.87      1564
           1       0.68      0.40      0.50       549

    accuracy                           0.79      2113
   macro avg       0.75      0.66      0.68      2113
weighted avg       0.78      0.79      0.77      2113

------------
Classifier: GBM
Accuracy: 0.80
              precision    recall  f1-score   support

           0       0.84      0.91      0.87      1564
           1       0.66      0.49      0.56       549

    accuracy                           0.80      2113
   macro avg       0.75      0.70      0.72      2113
weighted avg       0.79      0.80      0.79      2113

------------
Classifier: KNN
Accuracy: 0.77
              precision    recall  f1-score   support

           0       0.82      0.89      0.85      1564
           1       0.58      0.44      0.50       549

    accuracy                           0.77      2113
   macro avg       0.70      0.66      0.68      2113
weighted avg       0.76      0.77      0.76      2113

------------
Classifier: SVM
Accuracy: 0.74
              precision    recall  f1-score   support

           0       0.74      1.00      0.85      1564
           1       0.00      0.00      0.00       549

    accuracy                           0.74      2113
   macro avg       0.37      0.50      0.43      2113
weighted avg       0.55      0.74      0.63      2113

------------
Notice:
Among all the classifiers, Random Forest, Gradient Boosting, and K-Nearest Neighbors (KNN) have shown better performance. However, since the dataset is imbalanced, accuracy alone may not provide a complete picture. To address this, I employed the SMOTEEN upsampling technique to balance the data, creating a more representative dataset.

By upsampling the data, I aim to improve the models' performance and make better predictions for customer churn.
After upsampling the data, I will reevaluate the performance of Random Forest, Gradient Boosting, and KNN classifiers. This will enable to choose the most effective model for customer churn prediction, helping to make more informed decisions on customer retention strategies.

Apply SMOTEENN¶

In [18]:
sm = SMOTEENN()
x_resampled, y_resampled = sm.fit_resample(x_train, y_train)

Resample data¶

In [19]:
xr_train,xr_test,yr_train,yr_test=train_test_split(x_resampled, y_resampled,test_size=0.3)

Decission tree after resampling¶

In [20]:
model_smote = DecisionTreeClassifier(criterion = "gini",random_state = 100,max_depth=6, min_samples_leaf=8)
In [21]:
model_smote.fit(xr_train,yr_train)

yr_pred_smote = model_smote.predict(xr_test)

model_score_r = model_smote.score(xr_test, yr_test)

print(round(model_score_r, 2))
print(metrics.classification_report(yr_test, yr_pred_smote))
print(metrics.confusion_matrix(yr_test, yr_pred_smote))
0.9
              precision    recall  f1-score   support

           0       0.86      0.95      0.90       585
           1       0.95      0.85      0.90       632

    accuracy                           0.90      1217
   macro avg       0.90      0.90      0.90      1217
weighted avg       0.90      0.90      0.90      1217

[[554  31]
 [ 93 539]]
Report: With the implementation of the SMOTEEN upsampling technique, we have achieved significantly improved results in our model performance. The accuracy has reached an impressive 92%, and we now observe excellent recall, precision, and F1 score for the minority class.

as we have seen random forest did a better perform above even without smoteenn, lets's see how it performed after smoteen

Random forest after resampling data¶

In [22]:
model_rf_smote = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=6, min_samples_leaf=8)
In [23]:
model_rf_smote.fit(xr_train,yr_train)

yr_pred_smote = model_rf_smote.predict(xr_test)

model_score_r = model_rf_smote.score(xr_test, yr_test)

print(round(model_score_r, 2))
print(metrics.classification_report(yr_test, yr_pred_smote))
print(metrics.confusion_matrix(yr_test, yr_pred_smote))
0.94
              precision    recall  f1-score   support

           0       0.96      0.92      0.94       585
           1       0.93      0.96      0.94       632

    accuracy                           0.94      1217
   macro avg       0.94      0.94      0.94      1217
weighted avg       0.94      0.94      0.94      1217

[[537  48]
 [ 25 607]]
Report: The Random Forest (RF) classifier performed better (93%) than the Decision Tree.

Gradient Boosting Classifier after resampling data¶

In [24]:
model_gbm_smote = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=5)
In [25]:
model_gbm_smote.fit(xr_train, yr_train)

yr_pred_smote = model_gbm_smote.predict(xr_test)

model_score_r = model_gbm_smote.score(xr_test, yr_test)

print(round(model_score_r, 2))
print(metrics.classification_report(yr_test, yr_pred_smote))
print(metrics.confusion_matrix(yr_test, yr_pred_smote))
0.96
              precision    recall  f1-score   support

           0       0.96      0.95      0.96       585
           1       0.96      0.96      0.96       632

    accuracy                           0.96      1217
   macro avg       0.96      0.96      0.96      1217
weighted avg       0.96      0.96      0.96      1217

[[558  27]
 [ 25 607]]
Report: Gradient Boosting Classifier has demonstrated superior performance (95%) compared to the Decision Tree and Random forest model.

K-Nearest Neighbors after resampling¶

In [26]:
model_knn_smote = KNeighborsClassifier(n_neighbors=5)
In [27]:
model_knn_smote.fit(xr_train,yr_train)

yr_pred_smote = model_knn_smote.predict(xr_test)

model_score_r = model_knn_smote.score(xr_test, yr_test)

print(round(model_score_r, 2))
print(metrics.classification_report(yr_test, yr_pred_smote))
print(metrics.confusion_matrix(yr_test, yr_pred_smote))
0.95
              precision    recall  f1-score   support

           0       0.94      0.95      0.95       585
           1       0.96      0.94      0.95       632

    accuracy                           0.95      1217
   macro avg       0.95      0.95      0.95      1217
weighted avg       0.95      0.95      0.95      1217

[[558  27]
 [ 36 596]]
Report: K-Nearest Neighbors has also demonstrated better performance (96%) among all the model. It has yielded notably better results, indicating its effectiveness in predicting customer churn.

Final result¶

Result:
K-Nearest Neighbors achieved an accuracy of 96% and demonstrated excellent recall, precision, and F1 score for churned customers. The balanced dataset and the selected model provided more accurate predictions for customer churn.

Save the model¶

In [28]:
import pickle
In [29]:
knnmodel = 'knnchurnmodel.sav'
In [30]:
pickle.dump(model_knn_smote, open(knnmodel, 'wb'))
In [31]:
rfmodel = 'rfchurnmodel.sav'
In [32]:
pickle.dump(model_rf_smote, open(rfmodel, 'wb'))
In [33]:
gbm_model = 'gbmchurnmodel.sav'
In [34]:
pickle.dump(model_gbm_smote, open(gbm_model, 'wb'))

Checking the model¶

GBM model¶

In [35]:
load_model = pickle.load(open(gbm_model, 'rb'))
In [36]:
gbm_model_score = load_model.score(xr_test, yr_test)
In [37]:
gbm_model_score
Out[37]:
0.9572719802793755

Random forest model¶

In [38]:
load_model = pickle.load(open(rfmodel, 'rb'))
In [39]:
rf_model_score = load_model.score(xr_test, yr_test)
In [40]:
rf_model_score
Out[40]:
0.9400164338537387

KNN model¶

In [41]:
load_model = pickle.load(open(knnmodel, 'rb'))
In [42]:
knn_model_score = load_model.score(xr_test, yr_test)
In [43]:
knn_model_score
Out[43]:
0.9482333607230896

I have saved three of the model "rfchurnmodel.sav" (Random forest), gbmchurnmodel.sav (Gradient Boosting) and "knnchurnmodel.sav" (K-nearest neighbours)

Now, I will use Knnchurnmodel.sav as my final model, to create APIs for accessing the model from the UI.

With this implementation, users can efficiently utilize the predictive power of the model through the user interface.