Logistic regression models establish a non-linear relationship between a dependent variable \(y\) and independent variable(s) \(x\), and are typically used for binary classification problems. This relationship is expressed as:
Here, \(\beta_0\) represents the intercept, \(\beta_1\) represents the slope, and \(\sigma\) represents the sigmoid function. To find these parameters, we use the Maximum Likelihood method.
While linear regression uses the least squares method, logistic regression uses the Maximum Likelihood method. This method finds parameters that maximize the probability of observing the given data:
For each data point:
This can be expressed with a single formula:
To simplify mathematical calculations, we take the logarithm of the likelihood function to use sums instead of products:
Using the sigmoid function:
Where \(z_i = \beta_0 + \beta_1 x_i\).
To maximize the log-likelihood function, we take partial derivatives with respect to \(\beta_0\) and \(\beta_1\) and set them equal to zero:
At the maximum point, these derivatives must be zero:
Since these equations are non-linear, they don't have a closed-form solution. Therefore, numerical optimization methods are typically used.
We can use the Gradient Ascent method to solve these equations. This is an iterative algorithm that updates parameter values at each step:
Here \(\alpha\) represents the learning rate.
For faster convergence, the Newton-Raphson method can be used. This method also uses second-order derivatives (the Hessian matrix):
Second-order derivatives:
Newton-Raphson update:
When there are multiple independent variables, the model is extended as follows:
Here \(\textbf{x} = [x_1, x_2, \ldots, x_p]^T\) and \(\boldsymbol{\beta} = [\beta_1, \beta_2, \ldots, \beta_p]^T\) are vectors.
The log-likelihood function and its derivatives can be extended similarly.
Let's perform a step-by-step calculation with a small dataset:
\(\beta_0^{(0)} = 0\), \(\beta_1^{(0)} = 0\)
For each \(x_i\), calculate \(z_i = \beta_0 + \beta_1 x_i\) and \(P(y_i=1|x_i) = \sigma(z_i)\):
\(z_1 = 0 + 0 \cdot 1 = 0\), \(P(y_1=1|x_1) = \sigma(0) = 0.5\)
\(z_2 = 0 + 0 \cdot 2 = 0\), \(P(y_2=1|x_2) = \sigma(0) = 0.5\)
\(z_3 = 0 + 0 \cdot 3 = 0\), \(P(y_3=1|x_3) = \sigma(0) = 0.5\)
\(z_4 = 0 + 0 \cdot 4 = 0\), \(P(y_4=1|x_4) = \sigma(0) = 0.5\)
\(z_5 = 0 + 0 \cdot 5 = 0\), \(P(y_5=1|x_5) = \sigma(0) = 0.5\)
\(\frac{\partial \ell}{\partial \beta_0} = \sum_{i=1}^{5} [y_i - \sigma(z_i)] = (0-0.5) + (0-0.5) + (0-0.5) + (1-0.5) + (1-0.5) = -0.5\)
\(\frac{\partial \ell}{\partial \beta_1} = \sum_{i=1}^{5} [x_i(y_i - \sigma(z_i))] = 1(0-0.5) + 2(0-0.5) + 3(0-0.5) + 4(1-0.5) + 5(1-0.5) = -0.5 - 1 - 1.5 + 2 + 2.5 = 1.5\)
\(\beta_0^{(1)} = \beta_0^{(0)} + 0.1 \cdot \frac{\partial \ell}{\partial \beta_0} = 0 + 0.1 \cdot (-0.5) = -0.05\)
\(\beta_1^{(1)} = \beta_1^{(0)} + 0.1 \cdot \frac{\partial \ell}{\partial \beta_1} = 0 + 0.1 \cdot 1.5 = 0.15\)
This process is repeated until convergence is achieved.
\(\beta_0 \approx -3.0\), \(\beta_1 \approx 1.0\)
\(P(y=1|x) = \frac{1}{1 + e^{-(-3.0 + 1.0x)}}\)
This model predicts low probability (approximately 0) for low x values and high probability (approximately 1) for high x values. The threshold value is approximately x = 3, which aligns with our dataset.
Logistic regression divides the feature space with a decision boundary that separates the two classes. In the binary classification case, this boundary consists of points that satisfy the equation:
In the multivariate case:
This is a hyperplane that linearly divides the feature space.
To prevent overfitting, a penalty term can be added to the log-likelihood function:
Here \(\lambda\) is the regularization parameter, and larger values provide more regularization.
Logistic regression is a powerful statistical method for classification problems. Unlike linear regression, it works within a probability framework and estimates parameters using the Maximum Likelihood method.
Despite its simplicity, this method forms a fundamental building block in the field of machine learning and is an important step in understanding more complex algorithms.