Learning rate:

The perceptron is a simple but limited linear model. A perceptron separates space using a hyperplane (note: in two dimensions this is a line). Points above the plane are treated as being in the positive class while points below are treated as negative instances. This model is of limited use because most datasets with two classes are not linearly separable.

Any vector \(\Theta \in \mathbb{R}^{n + 1}\) defines a hyperplane. This plane consists of all the points perpendicular to \(\Theta\). For a perceptron classifier the decision boundary is a plane defined by such a vector. The direction of the vector determines the "up" side of the plane (ie: which side of the plane contains instances of the positive class).

Given a position vector \(x\) that defines a point, we want determine which side of the plane it's on in order to classify it. To classify we look at the sign of \(\Theta^T x\).

In \(\mathbb{R}^2\) \(\Theta^T x = |\Theta||x|cos(\phi)\) where \(\phi\) is the angle between the vectors. In the case where \(x\) is a positive example, \(\phi\) is less than 90 degrees, so the dot product is positive. This case is illustrated by the following images where the thick line is the decision boundary, the purple line is \(\Theta\), the grey line points to \(x\), and \(\phi\) is the white angle. Note that \(\Theta\) is perpendicular to the boundary.

In the case of a negative sample point, \(x\) and \(\Theta\) point in opposing directions so \(\phi\) is between 90 and 180 degrees. This means the dot product is negative. Note that just as in the previous case, the sign of the dot product matches the point's class.

For the general case of a perceptron parameterized by \(\Theta\), the function $$h_\Theta(x) = sign(\Theta^T x)$$ where $$ sign(x) = \begin{cases} 1 & \text{if } x > 0\\ -1 & \text{otherwise} \end{cases} $$ determines which side of the decision boundary \(x\) is on.

Given a set of \(m\) points \(D = \{(x_i, y_i) | x_i \in \mathbb{R}^{n}, y_i \in \{1, -1\}\} \) we want to find a plane in \(\mathbb{R}^n\) that separates the \(x_i\) so that all the points with class \(y_i = 1\) are on the opposite side as those with \(y_i = -1\). Note: the data must actually be linearly separable in order for the algorithm to work.

Proceed as follows:

- Augment each \(x_i\) by adding a bias component with value 1 (this is to handle data where the separating plane does not go through the origin).
- Randomly pick an initial \(\Theta\)
- Compare \(h_\Theta(x_i)\) with \(y_i\) for each sample
- While there is some \(x_i\) that is misclassified (ie: \(h_\Theta(x_i) \neq y_i\)) make the following update: $$ \Theta = \Theta + \alpha y_i x_i $$ The value \(\alpha\) is learning rate of the algorithm and it controls how much \(\Theta\) changes with each iteration. In order for the algorithm to work \(\alpha\) must be positive.

The first question is: does the algorithm actually work? To answer this question, consider \(x_i, y_i\) that is misclassified.

In this case the sample is positive, but was misclassified as negative. This means \(sign(\Theta^T x) = h_\Theta(x) = -1\). To flip the sign \(\Theta\) must change so that \(\Theta^T x > 0\). After the update \(\Theta_{new} = \Theta + \alpha y_i x_i\), \(\Theta^T x = \Theta_{old}^T x + \alpha*1*x^T x > \Theta_{old}^T x\). This means the parameter is moving in the right direction (at least for this sample instance).

The case for misclassification of a negative instance is similar.

Thm: If the data are linearly separable, then PLA will converge eventually. Proof of perceptron convergence(mirror).

Written by Adrian Stoll on 18 Jul 2017