Wednesday, March 27, 2019

R code for Gradient Descent

Assume that we have one dimensional data and y=1.2*(x-2)^2+3.2. This implies that y has a closed form solution, known apriori. Then it is straight forward to obtain the first derivative and perform a gradient descent. Here is the R code for the above.

#create some values
xs <- seq(0,4,len=20)
#define the function we want to optimize
f<- function(x)
{
1.2 * (x-2)^2 + 3.2
}
#Plot the function
plot(xs, f(xs), type="l", xlab="x",ylab=expression(1.2 * (x-2)^2 + 3.2))
#Calculate the gradient of the function
grad <- function(x)
{
1.2*2*(x-2)
}
#Closed form solution
lines(c(2,2), c(3,8), col="red",lty=2)
text(2.1, 7, "Closed Form Solution", col="red", pos=4)
#Implementation of gradient descent
#Initialize the first guess
x<-0.1
#store initial x values
xtrace<-x
#store the y-values
ftrace<-f(x)
# The learning rate alpha
alp <- 0.6
for (step in 1: 100)
{
#gradient descent step
x <- x - alp * grad(x)
xtrace <- c(xtrace,x)
ftrace <- c(ftrace, f(x))
}
lines(xtrace, ftrace, type = "b", col="blue")
text(0.5, 6, "Gradient Descent", col = "blue", pos=4)
#print final value of x
print(x)
view raw GDForClosedForm hosted with ❤ by GitHub


Now suppose, we still have one dimensional data but the functional form of y is unknown. How would we do Gradient Descent in that case? Below is the R code to do just that.

#Generate one dimensional data
X<-sample(x=1:100,size=20)
#Simulate labels
Y<-sample(x=c(-1,1),size=20,replace=TRUE)
#Here's how my data looks
data<-as.matrix(cbind(X,Y))
#I am going to use squared loss i.e. Loss=(y - h(x))^2 where h(x)=WX
#You could also do a slight variant where you have
#a bias term in the hypothesis i.e. h(x)=WX+b
#Estimate the gradient as follows
grad<-function(x,y,w)
{
gradientPt<--2*x*(y-x*w)
gradient<-sum(gradientPt)
return(gradient)
}
#Perform the descent
grad.descent<-function(X,Y,W,maxitr)
{
#Set the number of examples
nEG<-nrow(as.matrix(X))
#Set the alpha parameter
alpha=0.5
#Set the number of iterations
for (i in 1:maxitr)
{
W <- W-(alpha*grad(X,Y,W))/nEG
print(W)
}
return(W)
}
#Define the hypothesis
h<-function(X,W)
{return (X*W)}
#Find the loss
Loss<-matrix(data=0,nrow=20,ncol=1)
W<-grad.descent(X,Y,1,10)
for(j in 1:20)
{
Loss[j]<-(Y[j]-h(X[j],W))^2
}
# Find the mean of the Squared Loss
mean(Loss)
view raw GDForOneDim.R hosted with ❤ by GitHub


Now here are some variants of these to experiment with:
a. Replace the squared loss with a differentiable loss function of your choice and observe the impact on your favorite data set.
b. How does the parameters of the algorithm (alpha, # of iterations, choice of starting point) affect the convergence time of the algorithm?
c. How will you modify this code to implement the stochastic gradient descent algorithm?
d. Suppose we added a bias term to your hypothesis and asked you to repeat the experiments. What do you observe -- is the bias term beneficial?
e. Suppose we changed the hypothesis to be nonlinear for e.g. h(x) = w^2 x+ wx + b. Is the solution you find any better?
f. How will you modify the code above to implement Newton's algorithm (Clue: You need to use Taylor expansion for representing the function with higher order terms).

If you can report the above results on your favorite data set, I'd like to hear from you!

Also, feel free to ask questions in the comments if you are wondering about something.

Friday, March 15, 2019

Contour Plots of Loss Functions in R

In machine learning, loss functions are used to estimate how well learning algorithms perform. It is often written as loss = L(y, y_hat) where y is the true label and y_hat is predicted. Commonly used loss functions include Squared, Absolute or Laplace, Huber, Hinge, Logistic and others.

Using the Iris data set from UCIrvine,  I demonstrate how contour plots of loss functions can be obtained using R.

Background Information: The Iris data after being downloaded, was pre-processed in the following manner: Only two classes (100 examples) were selected to ensure the problem remained that of binary classification. Furthermore, two attributes were selected to enable visualizations via contour plots. The glmnet package was used to build a lasso model as shown below:

# Demonstration of the glmnet package
library(glmnet)
#load data
A<- read.csv('/Users/Haimonti/Courses/MGS618/MLforITManagers2019/Labs/Lab2/iris.csv')
x<-as.matrix(A[,1:2])
y<-as.matrix(A[,3])
fit.lasso1<-glmnet(x,y, family="gaussian", alpha=1)
predict.lasso1<-predict(fit.lasso1,new=x,s=min(fit.lasso1$lambda))
loss1<-abs(y-predict.lasso1)
print(loss1[1:10])
#data.loess<-loess(y~x[,1]*x[,2], data=A)
xgrid <- seq(min(x[,1]), max(x[,1]), 0.2)
ygrid <- seq(min(x[,2]), max(x[,2]), 0.2)
newMat<-expand.grid(xgrid,ygrid)
newPred<-predict(fit.lasso1,newx=as.matrix(newMat),s=min(fit.lasso1$lambda))
truePred<-matrix(1,117,1)
for (h in 80:117)
{
truePred[h,1]=-1
}
#truePred<-sample(x=c(1,-1),117,replace=TRUE)
loss2<-abs(truePred-newPred)^2
p=nrow(as.matrix(xgrid))
q=nrow(as.matrix(ygrid))
data.fit<-matrix(0,p,q)
m=1
for(i in 1:p)
{
for (j in 1:q)
{
data.fit[i,j]<-loss2[m]
m=m+1
}
}
print(data.fit)
#zNew<-matrix(predict(fit.lasso1,newx=as.matrix(data.fit)),nrow(xgrid),nrow(ygrid))
#print(nrow(as.matrix(xgrid)))
#print(nrow(as.matrix(ygrid)))
#print(nrow(zNew))
#contour(xgrid,ygrid,data.fit,xlab="Sepal",ylab="Petal")
filled.contour(xgrid,ygrid,data.fit,plot.title = title(main = "Absolute Loss"),xlab="Sepal",ylab="Petal", color =heat.colors)
view raw contFunc.R hosted with ❤ by GitHub


The filled contour plot generated from this is shown below:
Add in your favorite loss function -- Huber or Hinge and see some nice contour plots with the Iris data!