Logistic Regression

Classification Problems

Email: spam / not spam
Online transactions: fraudulent (yes/no)?
Tumor: malignant / benign?

Binary classification

y ∈ {0, 1} 0: "Negative class" (e.g. benign tumor) 1: "Positive class" (e.g. malignant tumor)

Multi class classification

y ∈ {0, 1, 2, 3, 4}

Linear regression for classification problems is not a good idea.

Hipothesis Representation

In order to get 0 <= h(x) <= 1:

h(x) = g(Θ' * x)
g(z) = 1 / (1 + e^(-z)) # sigmoid function or logistic function

Since 0 <= h(x) <= 1, this can be interpreted as the probability of y = 1 on input x.

x = [x0 x1] = [1 tumorSize]
h(x) = 0.7
# then tell the patient there is a 70% chance of tumor being malignant

Because of this interpretation, we have that:

P(y=0|x;Θ) + P(y=1|x;Θ) = 1
P(y=0|x;Θ) = 1 - P(y=1|x;Θ)

Decision Boundary

Suppose there are two possible predictions: y = {0, 1}. Based on the function g(z), h(x) will be 0.5 when z = 0, where z = Θ' * x. Thus we will:

predict "y=1" if h(x) ≥ 0.5
predict "y=0" if h(x) ≤ 0.5 (less than)

Decision boundary is the name given to the line/plane that divides the categories that we are trying to classify. It is a property of the hipothesis function h(x) and not of the data set. That means it is the plane generated by the hipothesis function. Whenever Θ' * x >= 0, then we are predicting y=1. On the other hand, when Θ' * x < 0 then we are predicting y=0.

More complex decision boundaries can be calculated, just as before, by adding new features according to how we want the boundary to look like and performing that operation in the features. For example:

h(x) = g(Θ0 + Θ1x1 + Θ2x2 + Θ3x1² + Θ4x2²)

# if we choose Θ = [-1 0 0 1 1], then we predict:
"y = 1" if -1 + x1² + x2² >= 0
x1² + x2² >= 1

# So, for h(x) = 0.5
x1² + x2² = 1

Cost Function

The cost function used for linear regression is not a good choice for logistic regression because given the sigmoid function involved, it would be non-convex. Running gradient descent in a non-convex function can converge to a local minimum that is different than the global minimum.

A good cost function J that is guaranteed to be convex (and also non-negative) is:

Cost(h(x), y) = |   -log(h(x)), if y=1
                | -log(1-h(x)), if y=0

When plotting the two funcions, we can see that for y=1, the cost goes very high when h(x) approaches 0, and for y=0, the cost goes very high when h(x) approaches 1. This is actually very good for our purposes.

Simplified Cost Function and Gradient Descent

We can further simplify both Cost(h(x), y) and J(Θ):

Cost(h(x), y) = -y * log(h(x)) - (1-y) * log(1 - h(x))

# replacing these values on our original cost function J, we have:

J(Θ) = -(1/m) * ∑( y * log(h(x)) + (1-y) * log(1 - h(x)) )

To use gradient descent to minimize the cost function J, we will use the same template used for linear regression.

repeat {
	# simultaneously update all Θj
	Θj = Θj - α * ∂/∂Θj (J(Θ))
}

# and the partial derivative of J(Θ) is described as:
∂/∂Θj (J(Θ)) = (1/m) * ∑( (h(x) - y) * xj )

# keeping in mind that the hipothesis function is:
h(x) = 1 / (1 + e^(Θ'*x))

To apply logistic regression, it is convenient to use a vectorized notation when updating all values of theta. To do this, the update of theta would be:

Θ = Θ - α * (1/m) * ∑( (h(x) - y) * x )

Advanced Optimization

Gradient Descent
Conjugate Gradient
BFGS
L-BFGS
Advantages
- No need to manually pick α
- Often faster than gradient descent
Disadvantages
- More complex

Octave implements a minimization function called fminunc (function minimum unconstrained) that, provided a initial guess for theta and a costo function that computes both the cost function and the partial derivatives for theta iteration, returns the optimum theta and the value for J(Θ).

# Example
theta' = [ Θ1, Θ2, Θ3, ..., Θn+1 ]
function [jVal, gradient] = costFunction(theta)
	jVal = [code to compute _J(Θ)_];
	gradient(1) = [code to compute ∂/∂Θ1 (J(Θ))]
	...
	gradient(n+1) = [code to compute ∂/∂Θn+1 (J(Θ))]

Multi-class Classification: One-vs-All

Email foldering/tagging: Work, Friends, Family, Hobby (y = [1,2,3,4])
Medical diagrams: Not ill, Cold, Flu (y = [1,2,3])
Weather: Sunny, Cloudy, Rain, Snow (y = [1,2,3,4])

The one-vs-all (one-vs-rest) approach uses logistic regression to create n classifiers that know how to separate each class from all others (becoming a regular logistic regression problem).

For each clas i, train a logistic regression classifier hi(x) to predict the probability of y=i. On a new input x, pick the class i that maximizes max(hi(x)), because that would be the most confident prediction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

06.Logistic.Regression.markdown

06.Logistic.Regression.markdown

Logistic Regression

Classification Problems

Binary classification

Multi class classification

Hipothesis Representation

Decision Boundary

Cost Function

Simplified Cost Function and Gradient Descent

Advanced Optimization

Multi-class Classification: One-vs-All

Files

06.Logistic.Regression.markdown

Latest commit

History

06.Logistic.Regression.markdown

File metadata and controls

Logistic Regression

Classification Problems

Binary classification

Multi class classification

Hipothesis Representation

Decision Boundary

Cost Function

Simplified Cost Function and Gradient Descent

Advanced Optimization

Multi-class Classification: One-vs-All