Assume that we have one dimensional data and y=1.2*(x-2)^2+3.2. This implies that y has a closed form solution, known apriori. Then it is straight forward to obtain the first derivative and perform a gradient descent. Here is the R code for the above.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#create some values | |
xs <- seq(0,4,len=20) | |
#define the function we want to optimize | |
f<- function(x) | |
{ | |
1.2 * (x-2)^2 + 3.2 | |
} | |
#Plot the function | |
plot(xs, f(xs), type="l", xlab="x",ylab=expression(1.2 * (x-2)^2 + 3.2)) | |
#Calculate the gradient of the function | |
grad <- function(x) | |
{ | |
1.2*2*(x-2) | |
} | |
#Closed form solution | |
lines(c(2,2), c(3,8), col="red",lty=2) | |
text(2.1, 7, "Closed Form Solution", col="red", pos=4) | |
#Implementation of gradient descent | |
#Initialize the first guess | |
x<-0.1 | |
#store initial x values | |
xtrace<-x | |
#store the y-values | |
ftrace<-f(x) | |
# The learning rate alpha | |
alp <- 0.6 | |
for (step in 1: 100) | |
{ | |
#gradient descent step | |
x <- x - alp * grad(x) | |
xtrace <- c(xtrace,x) | |
ftrace <- c(ftrace, f(x)) | |
} | |
lines(xtrace, ftrace, type = "b", col="blue") | |
text(0.5, 6, "Gradient Descent", col = "blue", pos=4) | |
#print final value of x | |
print(x) |
Now suppose, we still have one dimensional data but the functional form of y is unknown. How would we do Gradient Descent in that case? Below is the R code to do just that.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Generate one dimensional data | |
X<-sample(x=1:100,size=20) | |
#Simulate labels | |
Y<-sample(x=c(-1,1),size=20,replace=TRUE) | |
#Here's how my data looks | |
data<-as.matrix(cbind(X,Y)) | |
#I am going to use squared loss i.e. Loss=(y - h(x))^2 where h(x)=WX | |
#You could also do a slight variant where you have | |
#a bias term in the hypothesis i.e. h(x)=WX+b | |
#Estimate the gradient as follows | |
grad<-function(x,y,w) | |
{ | |
gradientPt<--2*x*(y-x*w) | |
gradient<-sum(gradientPt) | |
return(gradient) | |
} | |
#Perform the descent | |
grad.descent<-function(X,Y,W,maxitr) | |
{ | |
#Set the number of examples | |
nEG<-nrow(as.matrix(X)) | |
#Set the alpha parameter | |
alpha=0.5 | |
#Set the number of iterations | |
for (i in 1:maxitr) | |
{ | |
W <- W-(alpha*grad(X,Y,W))/nEG | |
print(W) | |
} | |
return(W) | |
} | |
#Define the hypothesis | |
h<-function(X,W) | |
{return (X*W)} | |
#Find the loss | |
Loss<-matrix(data=0,nrow=20,ncol=1) | |
W<-grad.descent(X,Y,1,10) | |
for(j in 1:20) | |
{ | |
Loss[j]<-(Y[j]-h(X[j],W))^2 | |
} | |
# Find the mean of the Squared Loss | |
mean(Loss) |
Now here are some variants of these to experiment with:
a. Replace the squared loss with a differentiable loss function of your choice and observe the impact on your favorite data set.
b. How does the parameters of the algorithm (alpha, # of iterations, choice of starting point) affect the convergence time of the algorithm?
c. How will you modify this code to implement the stochastic gradient descent algorithm?
d. Suppose we added a bias term to your hypothesis and asked you to repeat the experiments. What do you observe -- is the bias term beneficial?
e. Suppose we changed the hypothesis to be nonlinear for e.g. h(x) = w^2 x+ wx + b. Is the solution you find any better?
f. How will you modify the code above to implement Newton's algorithm (Clue: You need to use Taylor expansion for representing the function with higher order terms).
If you can report the above results on your favorite data set, I'd like to hear from you!
Also, feel free to ask questions in the comments if you are wondering about something.