Logistic Regression with R and Python

Ruma Sinha
10 min readDec 7, 2020
Photo by Richard Lee on Unsplash

The objective of this article is to explore and understand Logistic Regression. Will be working on two datasets. One will do in Python and the other will do in R.

The datasets: (a) The NBA Rookie stats datasets from data.world. https://data.world/exercises/logistic-regression-exercise-1

The objective is based on NBA rookie stats, need to predict which players have 5 years or more career longevity?

(b) The other dataset is from UCI http://archive.ics.uci.edu/ml/machine-learning-databases/adult/

The objective is to predict whether the income is > 50000 or < = 50000 based on various feature variables.

Will be exploring all the steps that are involved in machine learning task. Reading the data => Exploratory data analysis => Scaling/Categorical to numeric features => Model Training => Testing => Feature selection/importance => Grid Search

A brief overview of Logistic Regression

Logistic Regression is a supervised classification technique that helps in predicting the probability that an observation is of a certain class. Works very well on linearly separable classes. Mainly used for binary classification. With Multinomial logistic regression, we can train the model where response variables have more then two outcomes. Here, we work on the binary classification problems.

The few concepts around logistic regression as a probabilistic model.

The odds ratio: What are the odds in favor of a particular event. Can be written as p/(1-p) where p is the probability of the positive event. By positive event we mean for example in the above two scenarios, if the income is >50000 then it is positive event and class label will be “1”. If the career longevity of a nba player is ≥5 years then it is a positive event and class label will be “1”.

Lets say a player has won 5 matches and lost in 2 matches. The odds of winning is 5/2. The probability is 5/(5+2).

Next is the logit function or the log of the odds ratio. logit(p) = log(p/(1-p)). Based on a threshold value, the logit function maps probabilities to the class labels. For eg. lets say the threshold t value is 0.5, if p > = 0.5 then positive outcome or label predicted is 1. if p < 0.5 then negative outcome or label predicted is 0.

p(y=1|X) is the conditional probability that a particular sample belongs to class 1 given its features x.

Predicting the probability that a certain sample belongs to a particular class is the inverse form of the logit function or the sigmoid function. The sigmoid function always ranges from 0 to 1.Training a logistic regression model is about the best fitting sigmoid function on the given data points.

The predictor variables need to be independent of each other to avoid multicollinearity.

The cost function is the Log Loss or the Binary Cross Entropy. The gradient descent optimizer to minimize the cost function and learning the optimum weights hence finding the best decision boundary line separating the data points belonging to the two classes correctly.

How a logistic model maps a generalized linear function of the form y = b0+b1x1+b2x2, range of y is -infinity to infinity but in logistic regression, y can be {0,1}

p/(1-p) outputs 0 when y is 0 else infinity where y is 1

log(p/1-p) = b0+b1x1+b2x2
p is the probability of an event to happen.
x1, x2 are the independent variables which determines the occurrence of the event.

The sigmoid curve

For the first ML problem, the objective is to predict the 5-Year Career Longevity for NBA Rookies
the target y = 0 if career years played < 5
the target y = 1 if career years played >= 5

Reading the data into pandas dataframe and checking the rows and columns as well as checking for the null rows. Since very few rows have 3P% field as null values, we will drop those records. We will be having 1329 samples to work with and 21 features including the response variable.

The first few rows of data:

Renamed all the columns as:

The Name represents name of the players so we will not include this field in our analysis. TargetFiveYears is the response variable having either 1 or 0. How many rows having class 1 and how many rows having class 0? 826 data rows belong to class 1 and 503 data rows belong to class 0.

Exploring the data points further with boxplots and scatterplots. Very clear that the number of years more in the career then the GamesPlayed, MinutesPlayed, PointsPerGameand turnovers are more. The scatterplots clearly show strong correlation between GamesPlayed and MinutesPlayed, GamesPlayed and PointsPerGame, MinutesPlayed and PointsPerGame, OffensiveRebounds and DefensiveRebounds.

All the feature variables are numeric and we check for multicollinearity.

Pairwise collinearity where correlation coefficient value > 0.8:

Will start with the feature selection process and model training side by side.

Training the model with all the variables as is, we get the accuracy of 0.70. For class 0, we can see 72 correctly classified and for class 1, 206 correctly classified.

For feature selection, using the Sklearn.SelectFromModel class. We get top six features as shown below.

['PointsPerGame', 'FieldGoalAttempts', 'FreeThrowAttempt',
'OffensiveRebounds', 'Blocks', 'Turnovers']

Training the model with only these features, we get the accuracy as 0.69. For class 1, 204 samples correctly classified reduced from 206. For class 0, it remains the same 72.

Next feature selection is using Regularization LassoCV. Lasso is the L1 regularization technique. Lasso shrinks the less important feature’s coefficient to zero hence removing some features altogether. Works well for feature selection. The important features are OffensiveRebounds →GamesPlayed →FieldGoalPercent →Assists → PointsPerGame → FreeThrowPercent → ThreePointPercent

Keeping these features we train the model. We definitely see some improvement like, 81 samples correctly getting classified for class 0 and 207 getting correctly classified for class 1. The accuracy is 0.72

Next we try doing the k fold cross validation with k value of 5. Gives the accuracy 70.7%.

# five fold cross validation
kfold = KFold(n_splits=5,random_state=seed)
modellr = LogisticRegression()
results = cross_val_score(modellr,X,y,cv=kfold)
print(results)
print(“Accuracy: “,results.mean()*100)

[0.65789474 0.73684211 0.7481203  0.69548872 0.69811321]
Accuracy: 70.72918144417649

With the Random grid search , the accuracy achieved is 71.13%

With this final model, 82 observations from class 0 correctly classified and 207 class 1 observations correctly classified.

In logistic regression model, each observation is given an explicit probability of belonging to every class. Lets understand for the first row in the test dataset. For the first observation there is 33% probability of being in the negative class (0) and a 67% chance of being in the positive class (1). For this particular row, we can fetch the entire information from the original nba_data_df and we can see the name of the player is Corie Blount. The TargetFiveYears column has value 1. The model has predicted correctly for this row since the probability of belonging to class 1 is 67%.

By default, scikit-learn predicts an observation is part of the positive class if the probability is greater than 0.5 (called the threshold)
Hence, depending on the trade offs between the true positive rate and the false positive rate that the business objective need we can change the threshold accordingly.

The code is available in https://github.com/rumsinha/LogisticRegression

The next ML problem is to predict if the income will be > 50000 or < = 50000 based on the available feature variables.

We will be working with R and the basic steps remain the same.

As first step once we frame the ML problem, read the data into R. We can see that the dataset contains 32561 observations and 15 variables of which 14 are the predictors and one is the target or response. The dataset is having no headers. We will create the column header.

The features and target variables are:

age…the age of an individual
workClass…the employment status of an individual
fnlwgt…the number of people census believes the entry represents
education…the highest level of an education acheived
education_num…the highest level of an education acheived in numeric form
marital_status…the marital status
occupation…the generic type of an occupation
relationshipStatus…how the individual is related
race…the race
gender…the gender
capital_gain…the capital gain of an individual
capital_loss…the capital loss of an individual
hours_per_week…the number of hours per week, an individual has reported for work
native_country…country of origin
salaryRange…the response or target variable

In the first problem, we had all the predictor variables as numeric but in this case it is a mix of numeric and character.

Will not include the fnlwgt column and the education_num column. The five point summary of the other numeric columns are:

We can clearly observe different scales among the four variables. In the age, minimum is 17 and maximum is 90. For hours_per_week, minimum is 1 and maximum is 99. For capital_gain, capital_loss mostly 0 values.

Binning the age variable in equal representation of 20. The age group 40–60 has more individuals having income > 50000.

Similarly binning the hours_per_week in equal representation of 20. The 20–40 and 40–60 having more individuals in the income group of more than 50,000.

Analyzing the factor variables.

(a) workclass

combining “Federal-gov”,”Local-gov”,”State-gov” into “gov”
combining “Never-worked”, “Without-pay” into “no_work_no_pay”
combining “Self-emp-inc”, “Self-emp-not-inc” into “Self-emp-inc_and_not-inc”
“?” to “unknown”

(b) education

combining “11th”,”9th” ,”7th-8th” ,”5th-6th”,”10th”,”1st-4th”,”12th” ,”Preschool”,”HS-grad” as”schoolDropouts”
combining “Assoc-acdm”,”Assoc-voc” as “assoc”

© marital_status

combining “Divorced”.”Widowed” ,”Separated” as “Separated”
combining “Married-civ-spouse”,”Married-AF-spouse”,”Married-spouse-absent” as “Married”

(d) relationshipstatus

combining “Not-in-family”, “Unmarried”, “Other-relative” as “Other relative”

Exploring with different plots. From the gender per se, more data points representing male. From the race per se, more data points representing white. Among the married, more individuals with income > 50000.

Converting the response variable into factor having values 1 or 0.

24720 individuals with income < = 50000 and 7841 individuals with income > 50000.

AdultData_df$salaryRange <- str_trim(AdultData_df$salaryRange)
AdultData_df$salaryRange = as.factor(ifelse(AdultData_df$salaryRange == “>50K”,1,0))

Will be creating the dummy variables from the factor variables using the caret package, dummyVars function. Next is splitting the dataset into training and test.

The variable importance plot:

The model summary

Will remove the race and workclass features since they don’t look significant and then train again.

Predictions on the test data as:

predictions<-predict.train(object=modellr,testData_df[,predictors],type=”raw”)

The accuracy is 0.85.

Precision = What proportion of positive identifications actually correct = TP/(TP+FP) = 1385/(1385+479) = 0.74

Recall = What proportion of actual positives identified correctly = TP/(TP+FN) = 1385/(1385+967)=0.588 = Specificity

Default threshold is 0.5 and based on the business objective we can increase or decrease the threshold for the trade offs between true positive rates and false positive rates.

When we tested on the adult.test dataset with the model, we got an accuracy of 0.84.

This is an imbalanced dataset and in future article we will work on this dataset with imbalanced technique like upsampling/downsampling/SMOTE.

The code is available in https://github.com/rumsinha/LogisticRegression

Conclusion

In this article, we explored logistic regression with two datasets, NBA stats and Census Income. Both having the target variables binary with values 0 or 1.

Checked for null values.

Checked for multicollinearity.

Performed EDA with various plots.

Performed feature selection.

How some variables have more influence on the target compared to the remaining as shown by the coefficients in the model summary and the p values.

Also performed hyperparameter tuning with grid search and randomized search with k fold cross validation.

--

--