Logistic Regression in Python using Pandas and Seaborn(For Beginners in ML)

Surmayi
Analytics Vidhya
Published in
4 min readOct 31, 2020

--

Data Set and Problem Statement

We will be working with an advertising data set, indicating whether or not a particular internet user clicked on an Advertisement. We will try to create a model that will predict whether or not they will click on an advertisement based on the features of that user. This data set contains the following features:

  • ‘Daily Time Spent on Site’: consumer time on site in minutes
  • ‘Age’: cutomer age in years
  • ‘Area Income’: Avg. Income of geographical area of consumer
  • ‘Daily Internet Usage’: Avg. minutes a day consumer is on the internet
  • ‘Ad Topic Line’: Headline of the advertisement
  • ‘City’: City of consumer
  • ‘Male’: Whether or not consumer was male
  • ‘Country’: Country of consumer
  • ‘Timestamp’: Time at which consumer clicked on Ad or closed window
  • ‘Clicked on Ad’: 0 or 1 indicated clicking on Ad

Lets go step by step in analysing, visualizing and modeling a Logistic Regression fit using Python

#First, let's import all the necessary libraries-

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Read the data in a data frame-

ad_data = pd.read_csv(‘advertising.csv’)

#Checkout the data using the head, describe, and info functions provided by pandas

ad_data.head()
ad_data.info()
ad_data.describe()

Exploratory Data Analysis

This is always helpful to understand the data, behavioral properties of various features, and dependencies if any.

We try to do as much visualization as possible. But here, lets do the common graphs, that would help us understand the key features.

Let's explore the data using Data Visualization libraries provided by Python, Seaborn

# Plot a histogram of age

plt.hist(ad_data[‘Age’],bins=40)

# Creating a jointplot showing Area Income versus Age.

sns.jointplot(‘Age’,’Area Income’ ,data=ad_data)

#Creating a jointplot showing the kde distributions of Daily Time spent on site vs. Age

sns.jointplot(‘Age’,’Daily Time Spent on Site’ ,data=ad_data,kind=’kde’,color=’red’)

# Creating a jointplot of ‘Daily Time Spent on Site’ vs. ‘Daily Internet Usage’

sns.jointplot(‘Daily Time Spent on Site’,’Daily Internet Usage’ ,data=ad_data,color=’green’)

# Finally, creating a pairplot with the hue defined by the ‘Clicked on Ad’ column feature to analyze the relationship between each and every variable

sns.pairplot(ad_data,hue=’Clicked on Ad’)

# We can see that the data points of blue and orange are actually separated, which is a good indicator.

Logistic Regression —

Split Data into Training and Test set

from sklearn.model_selection import train_test_split

Variable X contains the explanatory columns, which we will use to train our model to fit whether the Ad is clicked or not

X= ad_data.drop([‘Ad Topic Line’,’City’,’Timestamp’,’Clicked on Ad’,’Country’],axis=1)
y = ad_data[‘Clicked on Ad’]
X_train, X_test,y_train, y_test = train_test_split(X,y,test_size=0.33, random_state=42)

Training the model

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)

We get below, which shows the parameters which are set by default using the fit() method-

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)

Predictions and Evaluation of our model

Lets now predict values from the test data -

prediction = logrig.predict(X_test)

#We create a classification report for the logistic regression model. This compares our actual and predicted values

from sklearn.metrics import classification_report

print(classification_report(y_test,prediction))

precision    recall  f1-score   support

0 0.87 0.96 0.91 162
1 0.96 0.86 0.91 168

avg / total 0.91 0.91 0.91 330

This shows our model has an accuracy of about 91%.

All Done!! We have just completed the logistic regression in python using sklearn.

--

--