Assume that we have one dimensional data and y=1.2*(x-2)^2+3.2. This implies that y has a closed form solution, known apriori. Then it is straight forward to obtain the first derivative and perform a gradient descent. Here is the R code for the above.
Now suppose, we still have one dimensional data but the functional form of y is unknown. How would we do Gradient Descent in that case? Below is the R code to do just that.
Now here are some variants of these to experiment with:
a. Replace the squared loss with a differentiable loss function of your choice and observe the impact on your favorite data set.
b. How does the parameters of the algorithm (alpha, # of iterations, choice of starting point) affect the convergence time of the algorithm?
c. How will you modify this code to implement the stochastic gradient descent algorithm?
d. Suppose we added a bias term to your hypothesis and asked you to repeat the experiments. What do you observe -- is the bias term beneficial?
e. Suppose we changed the hypothesis to be nonlinear for e.g. h(x) = w^2 x+ wx + b. Is the solution you find any better?
f. How will you modify the code above to implement Newton's algorithm (Clue: You need to use Taylor expansion for representing the function with higher order terms).
If you can report the above results on your favorite data set, I'd like to hear from you!
Also, feel free to ask questions in the comments if you are wondering about something.