Mathematical Derivation of Logistic Regression Coefficients

Logistic regression models establish a non-linear relationship between a dependent variable \(y\) and independent variable(s) \(x\), and are typically used for binary classification problems. This relationship is expressed as:

\[ P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}} = \sigma(\beta_0 + \beta_1 x) \]

Here, \(\beta_0\) represents the intercept, \(\beta_1\) represents the slope, and \(\sigma\) represents the sigmoid function. To find these parameters, we use the Maximum Likelihood method.

1. The Foundation of Logistic Regression: Likelihood Function

While linear regression uses the least squares method, logistic regression uses the Maximum Likelihood method. This method finds parameters that maximize the probability of observing the given data:

\[ L(\beta_0, \beta_1) = \prod_{i=1}^{n} P(y_i|x_i; \beta_0, \beta_1) \]

For each data point:

\[ P(y_i|x_i; \beta_0, \beta_1) = \begin{cases} P(y_i=1|x_i) & \text{if } y_i = 1 \\ 1 - P(y_i=1|x_i) & \text{if } y_i = 0 \end{cases} \]

This can be expressed with a single formula:

\[ P(y_i|x_i; \beta_0, \beta_1) = P(y_i=1|x_i)^{y_i} \cdot (1-P(y_i=1|x_i))^{1-y_i} \]

2. Log-Likelihood

To simplify mathematical calculations, we take the logarithm of the likelihood function to use sums instead of products:

\begin{align} \ell(\beta_0, \beta_1) &= \log L(\beta_0, \beta_1) \\ &= \sum_{i=1}^{n} \log P(y_i|x_i; \beta_0, \beta_1) \\ &= \sum_{i=1}^{n} \left[ y_i \log P(y_i=1|x_i) + (1-y_i) \log(1-P(y_i=1|x_i)) \right] \end{align}

Using the sigmoid function:

\begin{align} \ell(\beta_0, \beta_1) &= \sum_{i=1}^{n} \left[ y_i \log \sigma(z_i) + (1-y_i) \log(1-\sigma(z_i)) \right] \end{align}

Where \(z_i = \beta_0 + \beta_1 x_i\).

3. Finding the Maximum Point

To maximize the log-likelihood function, we take partial derivatives with respect to \(\beta_0\) and \(\beta_1\) and set them equal to zero:

Partial derivative with respect to β₀:

\[ \frac{\partial \ell}{\partial \beta_0} = \sum_{i=1}^{n} \left[ y_i - \sigma(z_i) \right] \]

Partial derivative with respect to β₁:

\[ \frac{\partial \ell}{\partial \beta_1} = \sum_{i=1}^{n} \left[ x_i(y_i - \sigma(z_i)) \right] \]

4. Setting Derivatives to Zero

At the maximum point, these derivatives must be zero:

\[ \frac{\partial \ell}{\partial \beta_0} = 0 \Rightarrow \sum_{i=1}^{n} \left[ y_i - \sigma(z_i) \right] = 0 \]
\[ \frac{\partial \ell}{\partial \beta_1} = 0 \Rightarrow \sum_{i=1}^{n} \left[ x_i(y_i - \sigma(z_i)) \right] = 0 \]

Since these equations are non-linear, they don't have a closed-form solution. Therefore, numerical optimization methods are typically used.

5. Numerical Optimization: Gradient Ascent

We can use the Gradient Ascent method to solve these equations. This is an iterative algorithm that updates parameter values at each step:

\[ \beta_0^{(t+1)} = \beta_0^{(t)} + \alpha \cdot \frac{\partial \ell}{\partial \beta_0} \] \[ \beta_1^{(t+1)} = \beta_1^{(t)} + \alpha \cdot \frac{\partial \ell}{\partial \beta_1} \]

Here \(\alpha\) represents the learning rate.

6. Newton-Raphson Method

For faster convergence, the Newton-Raphson method can be used. This method also uses second-order derivatives (the Hessian matrix):

\[ H = \begin{bmatrix} \frac{\partial^2 \ell}{\partial \beta_0^2} & \frac{\partial^2 \ell}{\partial \beta_0 \partial \beta_1} \\ \frac{\partial^2 \ell}{\partial \beta_1 \partial \beta_0} & \frac{\partial^2 \ell}{\partial \beta_1^2} \end{bmatrix} \]

Second-order derivatives:

\[ \frac{\partial^2 \ell}{\partial \beta_0^2} = -\sum_{i=1}^{n} \sigma(z_i)(1-\sigma(z_i)) \] \[ \frac{\partial^2 \ell}{\partial \beta_1^2} = -\sum_{i=1}^{n} x_i^2 \sigma(z_i)(1-\sigma(z_i)) \] \[ \frac{\partial^2 \ell}{\partial \beta_0 \partial \beta_1} = \frac{\partial^2 \ell}{\partial \beta_1 \partial \beta_0} = -\sum_{i=1}^{n} x_i \sigma(z_i)(1-\sigma(z_i)) \]

Newton-Raphson update:

\[ \begin{bmatrix} \beta_0^{(t+1)} \\ \beta_1^{(t+1)} \end{bmatrix} = \begin{bmatrix} \beta_0^{(t)} \\ \beta_1^{(t)} \end{bmatrix} - H^{-1} \begin{bmatrix} \frac{\partial \ell}{\partial \beta_0} \\ \frac{\partial \ell}{\partial \beta_1} \end{bmatrix} \]

7. Multivariate Logistic Regression

When there are multiple independent variables, the model is extended as follows:

\[ P(y=1|\textbf{x}) = \sigma(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p) = \sigma(\beta_0 + \textbf{x}^T \boldsymbol{\beta}) \]

Here \(\textbf{x} = [x_1, x_2, \ldots, x_p]^T\) and \(\boldsymbol{\beta} = [\beta_1, \beta_2, \ldots, \beta_p]^T\) are vectors.

The log-likelihood function and its derivatives can be extended similarly.

8. Step-by-Step Calculation with an Example

Let's perform a step-by-step calculation with a small dataset:

Step 1: Set initial parameters

\(\beta_0^{(0)} = 0\), \(\beta_1^{(0)} = 0\)

Step 2: Calculate probabilities for the first iteration

For each \(x_i\), calculate \(z_i = \beta_0 + \beta_1 x_i\) and \(P(y_i=1|x_i) = \sigma(z_i)\):

\(z_1 = 0 + 0 \cdot 1 = 0\), \(P(y_1=1|x_1) = \sigma(0) = 0.5\)

\(z_2 = 0 + 0 \cdot 2 = 0\), \(P(y_2=1|x_2) = \sigma(0) = 0.5\)

\(z_3 = 0 + 0 \cdot 3 = 0\), \(P(y_3=1|x_3) = \sigma(0) = 0.5\)

\(z_4 = 0 + 0 \cdot 4 = 0\), \(P(y_4=1|x_4) = \sigma(0) = 0.5\)

\(z_5 = 0 + 0 \cdot 5 = 0\), \(P(y_5=1|x_5) = \sigma(0) = 0.5\)

Step 3: Calculate derivatives

\(\frac{\partial \ell}{\partial \beta_0} = \sum_{i=1}^{5} [y_i - \sigma(z_i)] = (0-0.5) + (0-0.5) + (0-0.5) + (1-0.5) + (1-0.5) = -0.5\)

\(\frac{\partial \ell}{\partial \beta_1} = \sum_{i=1}^{5} [x_i(y_i - \sigma(z_i))] = 1(0-0.5) + 2(0-0.5) + 3(0-0.5) + 4(1-0.5) + 5(1-0.5) = -0.5 - 1 - 1.5 + 2 + 2.5 = 1.5\)

Step 4: Update parameters (with α = 0.1)

\(\beta_0^{(1)} = \beta_0^{(0)} + 0.1 \cdot \frac{\partial \ell}{\partial \beta_0} = 0 + 0.1 \cdot (-0.5) = -0.05\)

\(\beta_1^{(1)} = \beta_1^{(0)} + 0.1 \cdot \frac{\partial \ell}{\partial \beta_1} = 0 + 0.1 \cdot 1.5 = 0.15\)

This process is repeated until convergence is achieved.

Final approximate values (after several iterations)

\(\beta_0 \approx -3.0\), \(\beta_1 \approx 1.0\)

Logistic regression model

\(P(y=1|x) = \frac{1}{1 + e^{-(-3.0 + 1.0x)}}\)

This model predicts low probability (approximately 0) for low x values and high probability (approximately 1) for high x values. The threshold value is approximately x = 3, which aligns with our dataset.

9. Geometric Interpretation of Logistic Regression

Logistic regression divides the feature space with a decision boundary that separates the two classes. In the binary classification case, this boundary consists of points that satisfy the equation:

\[ \beta_0 + \beta_1 x = 0 \]

In the multivariate case:

\[ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p = 0 \]

This is a hyperplane that linearly divides the feature space.

10. Regularization

To prevent overfitting, a penalty term can be added to the log-likelihood function:

L2 Regularization (Ridge):

\[ \ell_{ridge}(\boldsymbol{\beta}) = \ell(\boldsymbol{\beta}) - \lambda \sum_{j=1}^{p} \beta_j^2 \]

L1 Regularization (Lasso):

\[ \ell_{lasso}(\boldsymbol{\beta}) = \ell(\boldsymbol{\beta}) - \lambda \sum_{j=1}^{p} |\beta_j| \]

Here \(\lambda\) is the regularization parameter, and larger values provide more regularization.

Conclusion

Logistic regression is a powerful statistical method for classification problems. Unlike linear regression, it works within a probability framework and estimates parameters using the Maximum Likelihood method.

Despite its simplicity, this method forms a fundamental building block in the field of machine learning and is an important step in understanding more complex algorithms.