-
Email: spam / not spam
-
Online transactions: fraudulent (yes/no)?
-
Tumor: malignant / benign?
y ∈ {0, 1} 0: "Negative class" (e.g. benign tumor) 1: "Positive class" (e.g. malignant tumor)
y ∈ {0, 1, 2, 3, 4}
Linear regression for classification problems is not a good idea.
In order to get 0 <= h(x) <= 1
:
h(x) = g(Θ' * x)
g(z) = 1 / (1 + e^(-z)) # sigmoid function or logistic function
Since 0 <= h(x) <= 1
, this can be interpreted as the probability of y = 1 on input x.
x = [x0 x1] = [1 tumorSize]
h(x) = 0.7
# then tell the patient there is a 70% chance of tumor being malignant
Because of this interpretation, we have that:
P(y=0|x;Θ) + P(y=1|x;Θ) = 1
P(y=0|x;Θ) = 1 - P(y=1|x;Θ)
Suppose there are two possible predictions: y = {0, 1}. Based on the function g(z), h(x) will be 0.5 when z = 0, where z = Θ' * x. Thus we will:
predict "y=1" if h(x) ≥ 0.5
predict "y=0" if h(x) ≤ 0.5 (less than)
Decision boundary is the name given to the line/plane that divides the categories that we are trying to classify. It is a property of the hipothesis function h(x) and not of the data set. That means it is the plane generated by the hipothesis function. Whenever Θ' * x >= 0
, then we are predicting y=1
. On the other hand, when Θ' * x < 0
then we are predicting y=0
.
More complex decision boundaries can be calculated, just as before, by adding new features according to how we want the boundary to look like and performing that operation in the features. For example:
h(x) = g(Θ0 + Θ1x1 + Θ2x2 + Θ3x1² + Θ4x2²)
# if we choose Θ = [-1 0 0 1 1], then we predict:
"y = 1" if -1 + x1² + x2² >= 0
x1² + x2² >= 1
# So, for h(x) = 0.5
x1² + x2² = 1
The cost function used for linear regression is not a good choice for logistic regression because given the sigmoid function involved, it would be non-convex. Running gradient descent in a non-convex function can converge to a local minimum that is different than the global minimum.
A good cost function J that is guaranteed to be convex (and also non-negative) is:
Cost(h(x), y) = | -log(h(x)), if y=1
| -log(1-h(x)), if y=0
When plotting the two funcions, we can see that for y=1, the cost goes very high when h(x) approaches 0, and for y=0, the cost goes very high when h(x) approaches 1. This is actually very good for our purposes.
We can further simplify both Cost(h(x), y) and J(Θ):
Cost(h(x), y) = -y * log(h(x)) - (1-y) * log(1 - h(x))
# replacing these values on our original cost function J, we have:
J(Θ) = -(1/m) * ∑( y * log(h(x)) + (1-y) * log(1 - h(x)) )
To use gradient descent to minimize the cost function J, we will use the same template used for linear regression.
repeat {
# simultaneously update all Θj
Θj = Θj - α * ∂/∂Θj (J(Θ))
}
# and the partial derivative of J(Θ) is described as:
∂/∂Θj (J(Θ)) = (1/m) * ∑( (h(x) - y) * xj )
# keeping in mind that the hipothesis function is:
h(x) = 1 / (1 + e^(Θ'*x))
To apply logistic regression, it is convenient to use a vectorized notation when updating all values of theta. To do this, the update of theta would be:
Θ = Θ - α * (1/m) * ∑( (h(x) - y) * x )
-
Gradient Descent
-
Conjugate Gradient
-
BFGS
-
L-BFGS
-
Advantages
- No need to manually pick α
- Often faster than gradient descent
-
Disadvantages
- More complex
Octave implements a minimization function called fminunc
(function minimum unconstrained) that, provided a initial guess for theta and a costo function that computes both the cost function and the partial derivatives for theta iteration, returns the optimum theta and the value for J(Θ).
# Example
theta' = [ Θ1, Θ2, Θ3, ..., Θn+1 ]
function [jVal, gradient] = costFunction(theta)
jVal = [code to compute _J(Θ)_];
gradient(1) = [code to compute ∂/∂Θ1 (J(Θ))]
...
gradient(n+1) = [code to compute ∂/∂Θn+1 (J(Θ))]
- Email foldering/tagging: Work, Friends, Family, Hobby (y = [1,2,3,4])
- Medical diagrams: Not ill, Cold, Flu (y = [1,2,3])
- Weather: Sunny, Cloudy, Rain, Snow (y = [1,2,3,4])
The one-vs-all (one-vs-rest) approach uses logistic regression to create n classifiers that know how to separate each class from all others (becoming a regular logistic regression problem).
For each clas i, train a logistic regression classifier hi(x) to predict the probability of y=i. On a new input x, pick the class i that maximizes max(hi(x))
, because that would be the most confident prediction.