Logistic Regression in 20 Lines of Python
Logistic regression is a classification algorithm widely used in industry. Its structure is simple, with the following main advantages and disadvantages:
pros
- Training and inference are both fast
- Easy to implement
- Low memory usage
- Good interpretability
cons
- Because it cannot fit nonlinear relationships, it places higher demands on feature engineering.
- It is relatively sensitive to multicollinearity
Structurally, the only difference between logistic regression and linear regression is the addition of the sigmoid function, also called the logistic function, which converts the output into a normalized value.
Linear regression
$$ Y=X \cdot W+B $$
Logistic regression$$ Y=\sigma (X \cdot W+B) $$
where$$ \sigma(t)=\frac{1}{1+e^{-t}} $$
is the sigmoid function$$ \hat y=\left\{\begin{array}{ll} 0, & z<0 \\ 0.5, & z="0" 1,>0 \end{array}, \quad z=w^{T} x+b\right. $$
The sigmoid function is introduced because the output of linear regression does not lie between 0 and 1, nor can it fit discrete variables. An ideal classification function is not differentiable, while the logit function is a convex function differentiable to arbitrary order and has good mathematical properties. In addition, the sigmoid function can represent likelihood continuously, although its output is not a “probability” in the strict mathematical sense.
Once the model is fixed, the next step is to estimate the parameters by maximum likelihood.
Under a Bernoulli distribution, maximum likelihood estimation leads to the cross-entropy loss function.
Assume
$$ \begin{array}{l} P(Y=1 \mid x)=p(x) \\ P(Y=0 \mid x)=1-p(x) \end{array} $$
The probability density function of the Bernoulli distribution is$$ f_{X}(x)=p^{x}(1-p)^{1-x}=\left\{\begin{array}{ll} p & \text { if } x=1 \\ 1-p & \text { if } x=0 \end{array}\right. $$
The likelihood function is$$ L(w)=\prod\left[p\left(x_{i}\right)\right]^{y_{i}}\left[1-p\left(x_{i}\right)\right]^{1-y_{i}} $$
Rewriting it in logarithmic form for easier computation, we find that it is exactly the same as cross-entropy:$$ \ln L(w)=\sum\left[y_{i} \ln p\left(x_{i}\right)+\left(1-y_{i}\right) \ln \left(1-p\left(x_{i}\right)\right)\right] $$
Its negative can be used as the loss function, so maximizing the likelihood is equivalent to minimizing the loss:$$ J(w)=-\frac{1}{N} \ln L(w) $$
This article uses gradient descent to search for the optimal parameters.
The derivation of the partial derivative of the loss with respect to the weights is omitted here:
$$ \frac{\partial J(w)}{\partial w_{i}}=\left(p\left(x_{i}\right)-y_{i}\right) x_{i} $$
The detailed derivation can be found in [1].Update the parameters:
$$ w_{i}^{k+1}=w_{i}^{k}-\alpha \frac{\partial J(w)}{\partial w_{i}} $$
The math is done. The following is the implementation.
The complete Python code needed to define a usable logistic regression model is under 20 lines:
1 | import numpy as np |
References
[1] https://blog.csdn.net/jasonzzj/article/details/52017438
Logistic Regression in 20 Lines of Python