Monday, July 29, 2019

Stanford's Machine Learning by Andrew Ng - Locally Weighted Regression



In addition to the size of the data set influencing how well the algorithm performs, there's also the number of parameters. One can underfit the data. The data might not follow a straight line while one parameter gives a
linear relation and so the fit isn't very good. If there are as many parameters as data points, the fit will go straight through every data point but might be way off outside of these. Which results in overfitting.


Instead of using to many parameters ,one might try to weight data points near values that you want to predict heavier then data points far away. Using a weight factor w so that data points that are likely to be more relevant to the prediction are counter heavier

$$ θ_j := θ_j - \alpha(\sum_{i=0}^{m}w_i(θ^Tx-y_i))x_j$$

$$ w_i = exp\left(\frac{(x_i-x)^2}{2T^2}\right) $$

Combining this with the batch stochastic gradiënt formula results in the following algorithm:

def locallyWeightedRegression(trainingset, learningRate, mean, iteration = 1000, width=1):     weights = len(trainingset.transpose())-1     theta = np.zeros(weights)     for i in range(iteration):         paras = theta         nextTheta = np.zeros(len(paras))         for index, parameter in enumerate(paras):             xi, yi = trainingset.transpose()[0:3], trainingset.transpose()[3]             x = np.full((len(trainingset),3), mean)             w = xi[index]-x.transpose()[index]             weightFactor = -(w**2)/(2*width**2)             #print('weight factor', weightFactor, type(weightFactor))             costForThisVector = learningRate*np.exp(weightFactor.astype(float))*(theta.dot(xi)-yi)*xi[index]             #print('cost vector', costForThisVector)             nextTheta[index] = parameter - np.sum(costForThisVector)/len(trainingset)         theta = np.array(nextTheta)     return theta
Until now, I calculated the root mean square error of every fourth data example and used the other three fourths as the training examples. This would not work very well here. Either I need to do a new regression for every test, which is computationally expensive, or I create a massive error trying to use the algorithm only once and fitting it to test points it wasn't meant to fit too.

Thursday, July 25, 2019

Stanford's Machine Learning by Andrew Ng - Stochastic Gradient Descent




Batch Gradient Descent is very slow for very large datasets. Luckily an alternative exist: Stochastic gradient descent or incremental gradient descent. Rather then going over the entire dataset, the algorithm just looks at one training example at a time and then takes a small step towards minimization. The mathematical formula is the same, but the loop is different.

$$ θ_j := θ_j - \alpha(\sum_{i=0}^{m}θ^Tx-y_i)x_j$$


With θ the different weights, x the different training examples variables and y the different training examples targets. I tried to put this into a python algorithm as well.


import Numpy as np import math def StochasticGradientDescent (trainingset, learningRate): weights = len(trainingset.transpose()) - 1 theta = np.zeros(weights) for trainingExample in trainingset: nextTheta = [0 for number in theta] for index, parameter in enumerate(theta): xi, yi = trainingExample[0:weights], trainingExample[weights] nextTheta[index] = parameter - learningRate*(theta.dot(xi)-yi)*xi[index] theta = np.array(nextTheta) return theta def RMS_Error (controlData, theta): parameters = max(theta.shape) print(parameters, theta, controlData) rms_error= (theta.dot(controlData.transpose()[0:parameters])-controlData.transpose()[parameters])**2 n = len(rms_error) error = math.sqrt(np.sum(rms_error)/n) return error
For my not too big dataset, stochastic gradient's root mean square error wasn't to bad compared to batch gradiënt descent. For cubic it actually got a lower one, weirdly. But it was higher for the squared and linear.

RMS_error for linear corruption perception = 24247.5 ; theta = (-2.7, 70.6, 45), Learning Rate : 1/1e4
RMS_error for quadratic corruption perception = 15570 ; theta = (-2.52e-4, 4.66, 9.449e-2), Learning Rate: 1/1e8
RMS_error for cubic corruption perception = 9895 ; theta = (-2.979e-5, 0.113, -8.34*e-4), Learning Rate: 1/1e11


He also talks about a non-iterative algorithm to optimize the theta's, Namely the Normal equations, I tried to implement it using Numpy, but my RMS error is off the chart compared to gradiënt descent, So I think there's a mistake somewhere.

$$ θ = (X^TX)^{-1}X^T\overrightarrow{y}$$

def normalEquation(trainingset): weights = len(trainingset.transpose()) - 1 x = trainingset.transpose()[:weights].transpose() y = trainingset.transpose()[weights].transpose() XTX = np.linalg.inv(np.matrix(np.dot(x.transpose(),x),dtype='float')) XTY = np.dot(x.transpose(),y) theta = XTY.dot(XTX).getA() return theta

Monday, July 22, 2019

Stanford's Machine Learning by Andrew Ng - Batch Gradient Descent




Lecture 1 doesn't contain much meat, so I am going to ignore that one.

In week 2, Batch Gradient Descent is the first algorithm introduced.

After watching the second lecture of Andrew Ng's Machine Learning , I am going to try a probably bad attempt at linear regression. What I am going to try is predict the GDP per capita of a country on the basis two parameters, the amount of corruption in the country and its oil exports.

I used the corruption perception index of transparency International (2018 numbers). On the least corrupt side (highest score), you see pretty much every rich country, while on the very corrupt side (lowest score), you see very poor countries like Afghanistan, Somalia and oil exporting countries like Iraq and Kazakhstan. I think there will be a correlation in there that every corrupt country is poor, unless it exports oil.

I took the GDP per capita from the worldbank (2018) and oil as percentage of total merchandise export (2018).

So, each training example consist of an x value that includes the corruption perception and export export, as well as an y value that includes the GDP per capita.
$$Training Example = (x_i , y_i )$$
The first you need to do is create a hypothesis of how much each parameter weights as well as a baseline. These weight parameters are θ .

$$ h_θ = θ_0 + θ_1 x_1 + θ_2 x_2 = \sum_{i=0}^{n} θ_i x_i = θ^Tx$$

The best fit would be if the avarage distance between the hypothesis and the actual value is minimized. This is done using the Cost Function that calculates how far off each hypothesis is for every training example m.

$$ J(θ) = \frac{1}{2} \sum_{i=0}^{m} (h_θ(x_i)-y_i)^2$$

The very first algorithm is Batch Gradient Descent. The cost function J(θ) is lowered every iteration using every training example. Using the derivatvie of j(θ), at every iteration a step towards a minima is token. The size of this step is depended on how far of the initial function was and the learning rate α.

$$ θ_j := θ_j - \alpha\frac{∂}{∂θ_j}J(θ)$$
$$ θ_j := θ_j - \alpha\frac{∂}{∂θ_j}\frac{1}{2} (\sum_{i=0}^{m}h_θ(x_i)-y_i)^2$$
$$ θ_j := θ_j - \alpha\frac{2}{2}(\sum_{i=0}^{m}h_θ(x_i)-y_i)\frac{∂}{∂θ_j}\sum_{i=0}^{m} (h_θ(x_i)-y_i)$$
$$ θ_j := θ_j - \alpha(\sum_{i=0}^{m} h_θ(x_i)-y_i)x_j = θ_j - \alpha(\sum_{i=0}^{m}θ^Tx-y_i)x_j$$

So, that's the mathematical expression of the batch gradiënt descent algorithm. I got CSV-files online from the sources stated above and lumped the data together in a dict as {country : [x1,x2,y]} and saved it as a pickle file. So first thing to do when using the data is loading the pickle file and getting the data into arays using Numpy and Pandas.

import numpy as np import pandas as pd import pickle import math # country data from pickle data = pickle.load(open("countries.p", 'rb')) # get the training sets : first column is x1, corruption percetion, # second column is x2, oil exports as percentage of total exports # third column is y, the gdp per Capita dataArray = pd.DataFrame.from_dict(data).transpose() # make training sets, x0 is 1, x1 , x2 and y x1 = pd.to_numeric(dataArray[0], errors='coerce').fillna(0).astype(np.int64)**1 x0,x2,y = np.ones(len(dataArray[0])), dataArray[1], dataArray[2] # prepare the data and the checkData TrainingData, checkData= np.array([x0,x1,x2,y]).transpose(), TrainingData[::4]

Then I tried the following function to get the parameters that minimize the cost function:

# prepare the trainingset mask = np.ones(len(TrainingData), dtype='bool') mask[range(0,len(mask),4)] = False # batch gradient descent def BatchGradientDescent (trainingset, learningRate, iteration): theta = np.array([0.0,0.0,0.0]) for i in range(iteration): paras = theta nextTheta = [0,0,0] for index, parameter in enumerate(paras): costFunction = 0 for trainingSet in trainingset: # split the trainingdata into one set xi, yi = trainingSet[0:3], trainingSet[3] # take the dot product and the distance from yi costForThis = learningRate*(theta.dot(xi) - yi)*xi[index] costFunction += costForThis nextTheta[index] = parameter - costFunction/len(trainingset) theta = np.array(nextTheta) return theta theta = BatchGradientDescent(TrainingData[mask], 1/(1e4), 30000)
This didn't work, as instead of minimizing the cost, all parameters went to infinity and the program errored out. It turns out that I made a mistake while assigning the "costForThis" variable. I accidently changed the algorithm too:

$$ θ_j := \alpha(\sum_{i=0}^{m}θ_j - θ^Tx-y_i)x_j$$


Notice the place of the θ, that Didn't work. Anyway, after this correction it did work. The next small fix is that I use both numpy and a for loop to loop over its data, which is less then ideal. So I tried to replace the for loop with a matrix operation.
# prepare the trainingset mask = np.ones(len(TrainingData), dtype='bool') mask[range(0,len(mask),4)] = False # batch gradient descent def BatchGradientDescent (trainingset, learningRate, iteration): theta = np.array([0.0,0.0,0.0]) for i in range(iteration): paras = theta nextTheta = [0,0,0] for index, parameter in enumerate(paras): xi = trainingset.transpose()[0:3] yi = trainingset.transpose()[3] costForThisVector = learningRate*(theta.dot(xi)-yi)*xi[index] nextTheta[index] = parameter - np.sum(costForThisVector)/len(trainingset) theta = np.array(nextTheta) return theta theta = BatchGradientDescent(TrainingData[mask], 1/(1e4), 500)

Lastly, I wanted to know if the fit for this data would be more accurate if i used the square of the corruption perception rather than the perception itself. So I used x1² this time and tried to run the program again, this time it the θ's went infinite again and eventually turned into Nan. As I couldn't find whats wrong this time, I went to stackoverflow for some aid, where they said that the learning rate is often a problem if it iterates to infinity, the step is too large and jumps far across the local minima each iteration. After using 1/1e7 as the learning rate, the error for the squared values were slightly lower then when the data is used lineraly. For corruption perception cubed and a learning rate of 1/1e11, the error is even smaller.

I check the roo mean square error using the following function:
rms_error = 0 for check in checkData: print("theta:", theta,"\ncheck", check) x_term = theta.dot(check[0:3]) y_term = check[3] error = (x_term - y_term)**2 rms_error += error print([x_term, y_term], error) length = checkData.shape[0] print(math.sqrt(rms_error/length), theta)

Yes, They could probably be done without the loop using Numpy a little bit better. Here is the result for linear, quadratic and cubic versions of the corruption perception:
RMS_error for linear corruption perception = 13.845 (-8755, 561, 81) Learing Rate: 1/1e4 , 30k iterations
RMS_error for quadratic corruption perception = 11.179 (-0.217, 7.34, 4.81) Learning Rate : 1/1e7, 10k iterations
RMS_error for cubic corruption perception = 10.223 (4.35e-6, 1.045e-1, 8.266e-4) Learning Rate: 1/1e11, 1k iterations


For the linear corruption perception, the program predicts a negative GDP per capita for Somalia, Thats not good.