Logistic Regression — General Overview

5 min readJan 21, 2022

In the early twentieth century, logistic regression was used mainly in the social and biological sciences. It is used as a classifier, to distinguish whether a certain observation is true or false. Logistic regression is used when the dependent target variable was categorical.

In other words, in the case of:

Grades — distinguish whether a student will pass (1) v. not pass (0)
Email — distinguish whether an email is spam (1) v. not spam (0)
Cancer — distinguish whether a newly discovered tumor is malignant cancer (1) v. benign cancer (0)

As you can see, this method is mainly used to distinguish between two classes.

Simple Logistic Regression

The simple logistic regression can output: either 0 or 1. It uses the linear regression model as inputs to the sigmoid function. If y goes to infinity, then p(prediction) will become 1 and if y goes to negative infinity then p will become 0.

Courtesy of https://www.saedsayad.com/logistic_regression.htm

Can linear regression be used?

Short answer yes. Long answer: Let’s consider a scenario where email classification is problem at hand. An algorithm is created where 0.1 is added every time a spam-word is used in an e-mail. Every e-mail is assigned a score and a class (spam v. not spam) and plotted as you see below:

The e-mails with green are spams (1) and e-mails with red are not spam (0).

We fit a linear regression model to data and it produces this output. We can set a threshold and still force it to work, however, this is not optimal. Since on this problem we only have two classes, how would we describe the case when the fitted line is h(x)<0 or h(x)<1 (h(x) the fitted linear regression line)?

Now consider the following case, where logistic regression is fitted to the data. First and foremost, all the data is within this range: 0≥ h(x) ≥ 1. This makes more sense, right?

For all values where h(x)>=0.5→ y=1 (meaning it’s spam e-mail)
For all values where h(x)<0.5→ y=0(meaning it’s not spam e-mail)

What you are seeing there as 0.5 value is called decision boundary, which serves as a rigid boundary for distinguishing one class from the other. You can oftentimes see it in the literature as threshold as well.

Mathematically this can be written as:

h(x) = P(Y=1|x:θ) or the probability that Y=1 given X which is parameterized as θ.
P(Y=1|x:θ) + P(Y=0|x:θ) = 1 → This is simply saying that the sum of percentages that an e-mail is not spam and an e-mail is spam is 100%.
P(Y=0|x:θ) = 1-P(Y=1|x:θ)

Cost Function

This might look complicated but it’s actually is not. Let’s walk through it…

We know that if our model predicts that it’s a spam e-mail (y=1) when it’s actually isn’t (y=0), we want the error for the model to be high and vice versa. An oppositely, when the model predicts that an e-mail is spam and it is actually spam, we want the error to be low. There’s a mathematical function which can do that, and it is shown down below:

If we try to merge the above mathematical expressions in one, we get the formula down below.

What does this mean in English? Let me tell you…

That mathematical expression covers both cases, when the actual label is y=0 & y=1. When the ground truth (actual label) y=0(or not spam) and the prediction is y=0, we want the error to be zero, and when the prediction is y=1, we want the error to be very high. That’s exactly what you are observing in the first part of the graph. The opposite is happening when the actual label is y=1.

The reason we do this is because while we train our model, we need to maximize the probability of each class and we do this by minimizing the error (loss, cost — many names but is essentially referring to the same thing) function.

Here’s something for you to play around with: https://www.desmos.com/calculator/klq5shgvd3

Multi-class Classification

Courtesy of Andrew Ng’s class of Machine Learning

Yes, you can use what you already learned above for multi-class classification. There’s this cool technique called One v. All Classification (when there are more than 2 classes), where you choose one class and then lump all the other classes in a single second class, using binary classification.

Instead of y={0,1} now you have y={0,1,2,...,n} — this way we divide the problem into n+1 (+1 because the index starts from 0) binary classification problems; try to maximize the probability by minimizing the loss function just like before for each classifier. For the final prediction, we return the value of the highest prediction.

Sum up

Hope this article helped you get a clear general idea on how logistic regression works.

I’d love to hear your ideas on what you’d like to read next — let me know down below in the comment section!

You can always connect with me via LinkedIn.