Journey To Nowhere: Stanford's Machine Learning by Andrew Ng

Lecture 1 doesn't contain much meat, so I am going to ignore that one.

In week 2, Batch Gradient Descent is the first algorithm introduced.

After watching the second lecture of Andrew Ng's Machine Learning , I am going to try a probably bad attempt at linear regression. What I am going to try is predict the GDP per capita of a country on the basis two parameters, the amount of corruption in the country and its oil exports.

I used the corruption perception index of transparency International (2018 numbers). On the least corrupt side (highest score), you see pretty much every rich country, while on the very corrupt side (lowest score), you see very poor countries like Afghanistan, Somalia and oil exporting countries like Iraq and Kazakhstan. I think there will be a correlation in there that every corrupt country is poor, unless it exports oil.

I took the GDP per capita from the worldbank (2018) and oil as percentage of total merchandise export (2018).

So, each training example consist of an x value that includes the corruption perception and export export, as well as an y value that includes the GDP per capita.

$Training Example = (x_i , y_i )$

The first you need to do is create a hypothesis of how much each parameter weights as well as a baseline. These weight parameters are θ .

$h_θ = θ_0 + θ_1 x_1 + θ_2 x_2 = \sum_{i=0}^{n} θ_i x_i = θ^Tx$

The best fit would be if the avarage distance between the hypothesis and the actual value is minimized. This is done using the Cost Function that calculates how far off each hypothesis is for every training example m.

$J(θ) = \frac{1}{2} \sum_{i=0}^{m} (h_θ(x_i)-y_i)^2$

The very first algorithm is Batch Gradient Descent. The cost function J(θ) is lowered every iteration using every training example. Using the derivatvie of j(θ), at every iteration a step towards a minima is token. The size of this step is depended on how far of the initial function was and the learning rate α.

$θ_j := θ_j - \alpha\frac{∂}{∂θ_j}J(θ)$

$θ_j := θ_j - \alpha\frac{∂}{∂θ_j}\frac{1}{2} (\sum_{i=0}^{m}h_θ(x_i)-y_i)^2$

$θ_j := θ_j - \alpha\frac{2}{2}(\sum_{i=0}^{m}h_θ(x_i)-y_i)\frac{∂}{∂θ_j}\sum_{i=0}^{m} (h_θ(x_i)-y_i)$

$θ_j := θ_j - \alpha(\sum_{i=0}^{m} h_θ(x_i)-y_i)x_j = θ_j - \alpha(\sum_{i=0}^{m}θ^Tx-y_i)x_j$

So, that's the mathematical expression of the batch gradiënt descent algorithm. I got CSV-files online from the sources stated above and lumped the data together in a dict as {country : [x1,x2,y]} and saved it as a pickle file. So first thing to do when using the data is loading the pickle file and getting the data into arays using Numpy and Pandas.

import numpy as np
import pandas as pd
import pickle
import math

# country data from pickle
data = pickle.load(open("countries.p", 'rb'))

# get the training sets : first column is x1, corruption percetion,
# second column is x2, oil exports as percentage of total exports
# third column is y, the gdp per Capita
dataArray = pd.DataFrame.from_dict(data).transpose()

# make training sets, x0 is 1, x1 , x2 and y
x1 = pd.to_numeric(dataArray[0], errors='coerce').fillna(0).astype(np.int64)**1
x0,x2,y = np.ones(len(dataArray[0])), dataArray[1], dataArray[2]

# prepare the data and the checkData
TrainingData, checkData= np.array([x0,x1,x2,y]).transpose(), TrainingData[::4]

Then I tried the following function to get the parameters that minimize the cost function:

# prepare the trainingset
mask = np.ones(len(TrainingData), dtype='bool')
mask[range(0,len(mask),4)] = False

# batch gradient descent
def BatchGradientDescent (trainingset, learningRate, iteration):
    theta = np.array([0.0,0.0,0.0])
    for i in range(iteration):
        paras = theta
        nextTheta = [0,0,0]
        for index, parameter in enumerate(paras):
            costFunction = 0
            for trainingSet in trainingset:
                # split the trainingdata into one set
                xi, yi = trainingSet[0:3], trainingSet[3]

                # take the dot product and the distance from yi
                costForThis = learningRate*(theta.dot(xi) - yi)*xi[index]
                costFunction += costForThis
            nextTheta[index] = parameter - costFunction/len(trainingset)
        theta = np.array(nextTheta)
    return theta
theta = BatchGradientDescent(TrainingData[mask], 1/(1e4), 30000)

This didn't work, as instead of minimizing the cost, all parameters went to infinity and the program errored out. It turns out that I made a mistake while assigning the "costForThis" variable. I accidently changed the algorithm too:

$θ_j := \alpha(\sum_{i=0}^{m}θ_j - θ^Tx-y_i)x_j$

Notice the place of the θ, that Didn't work. Anyway, after this correction it did work. The next small fix is that I use both numpy and a for loop to loop over its data, which is less then ideal. So I tried to replace the for loop with a matrix operation.

# prepare the trainingset
mask = np.ones(len(TrainingData), dtype='bool')
mask[range(0,len(mask),4)] = False

# batch gradient descent
def BatchGradientDescent (trainingset, learningRate, iteration):
    theta = np.array([0.0,0.0,0.0])
    for i in range(iteration):
        paras = theta
        nextTheta = [0,0,0]
        for index, parameter in enumerate(paras):
            xi = trainingset.transpose()[0:3]
            yi = trainingset.transpose()[3]
            costForThisVector = learningRate*(theta.dot(xi)-yi)*xi[index]
            nextTheta[index] = parameter - np.sum(costForThisVector)/len(trainingset)
        theta = np.array(nextTheta)
    return theta
theta = BatchGradientDescent(TrainingData[mask], 1/(1e4), 500)

Lastly, I wanted to know if the fit for this data would be more accurate if i used the square of the corruption perception rather than the perception itself. So I used x1² this time and tried to run the program again, this time it the θ's went infinite again and eventually turned into Nan. As I couldn't find whats wrong this time, I went to stackoverflow for some aid, where they said that the learning rate is often a problem if it iterates to infinity, the step is too large and jumps far across the local minima each iteration. After using 1/1e7 as the learning rate, the error for the squared values were slightly lower then when the data is used lineraly. For corruption perception cubed and a learning rate of 1/1e11, the error is even smaller.

I check the roo mean square error using the following function:

rms_error = 0

for check in checkData:
    print("theta:", theta,"\ncheck", check)
    x_term = theta.dot(check[0:3])
    y_term = check[3]
    error = (x_term - y_term)**2
    rms_error += error
    print([x_term, y_term], error)
length = checkData.shape[0]
print(math.sqrt(rms_error/length), theta)

Yes, They could probably be done without the loop using Numpy a little bit better. Here is the result for linear, quadratic and cubic versions of the corruption perception:
RMS_error for linear corruption perception = 13.845 (-8755, 561, 81) Learing Rate: 1/1e4 , 30k iterations
RMS_error for quadratic corruption perception = 11.179 (-0.217, 7.34, 4.81) Learning Rate : 1/1e7, 10k iterations
RMS_error for cubic corruption perception = 10.223 (4.35e-6, 1.045e-1, 8.266e-4) Learning Rate: 1/1e11, 1k iterations

For the linear corruption perception, the program predicts a negative GDP per capita for Somalia, Thats not good.

Journey To Nowhere

Monday, July 22, 2019

Stanford's Machine Learning by Andrew Ng - Batch Gradient Descent

No comments:

Post a Comment