Logistic Regression

Sigmoid Function

A sigmoid function is a bounded, differentiable, real function that is defined for all real input values and has a non-negative derivative at each point and exactly one inflection point. Some of the functions listed below. This Wikipedia page has good examples.

Logistic Function \begin{equation} f(x)=\frac{1}{1+e^{-x}} \end{equation}

Hyperbolic Tangent \begin{equation} f(x)=\tanh(x) \end{equation}

Generalized Logistic Function \begin{equation} f(x)=(1+e^{-x})^{\alpha}, \alpha > 0 \end{equation}

\begin{equation} \text{Error Function}=-\frac{1}{m}\sum_{1}^m(1-y_i)\ln(1-p_i)+y_i\ln(p_i) \end{equation}

Where we divide by m for convenience while taking derivatives and where m is the number of rows in training example.

Gradient of the Error Function.

Since we use the sigmoid function for calculating the probabilties, therefore, $p_i=\text{sigmoid}(Wx^{(i)}+b)$ wher $x^{(i)}$ is the i^th row in dataset and W is the corressponding weights/coefficients in the linear equation.

To optimize error we will require gradient of the error function. Which is straightforward application of derivatives.

\begin{align} \text{sigmoid}'(x)&=\frac{e^{-x}}{(1+e^{-x})^2}\\ &=\text{sigmoid}(x)\text{ sigmoid}(1-x) \end{align}

The error function dependent on the weights and bias. That is it is a scalar valued function from $\mathbb{R}^{n+1}$ to $\mathbb{R}$. \begin{equation} E(W, b)=-\frac{1}{m}\sum_{1}^m(1-y_i)\ln(1-p_i)+y_i\ln(p_i) \end{equation} Where $W=(w_1, w_2, \cdots, w_n)$.

Note that $$\frac{\partial p_i}{\partial w_j}=\frac{\partial (\text{sigmoid}(Wx^{(i)}+b))}{\partial w_j}=(p_i)(1-p_i)\cdot x_j$$

To evaluate gradient of E we will differentiante the term $(1-y_i)\ln(1-p_i)+y_i\ln(p_i)$, call it $E_i$, with repect of $w_j$. \begin{align} \frac{\partial E_i}{\partial w_j}&=\frac{\partial E_i}{\partial w_j}(1-y_i)\ln(1-p_i)+y_i\ln(p_i)\\ &=-(1-y_i)p_i\cdot x_j+y_i(1-p_i)\cdot x_j\\ &= (y_i-p_i)x_j \end{align}

Note: $x_j$ is the entry from the i^th row and j^th column.

Finally, \begin{equation} \nabla E=-\sum_{i}^m(y_i-p_i)\cdot x_j \end{equation}

We use this for gradient descent algorithm (which is also used in quadratic minimzation problem).

The above gradient can be written in matrix form \begin{equation} \nabla E=-(Y-P)\cdot X \end{equation}

Let W denote the initialised weights.

Algorithm

W = W - learnrate*(grad(E))

Functions for regression.

Let us define the functions for the regression. We will require sigmoid function, a probabiltity function, error function, and finally a function which updates weights. We also add two functions for plotting the error and boundary lines.

Training the dataset

The below function trains the dataset and plots the error and boundary line for the two dimensional data.

Final Stage