Journey To Nowhere

Monday, July 29, 2019

Stanford's Machine Learning by Andrew Ng - Locally Weighted Regression

In addition to the size of the data set influencing how well the algorithm performs, there's also the number of parameters. One can underfit the data. The data might not follow a straight line while one parameter gives a
linear relation and so the fit isn't very good. If there are as many parameters as data points, the fit will go straight through every data point but might be way off outside of these. Which results in overfitting.

Instead of using to many parameters ,one might try to weight data points near values that you want to predict heavier then data points far away. Using a weight factor w so that data points that are likely to be more relevant to the prediction are counter heavier

$θ_j := θ_j - \alpha(\sum_{i=0}^{m}w_i(θ^Tx-y_i))x_j$

$w_i = exp\left(\frac{(x_i-x)^2}{2T^2}\right)$

Combining this with the batch stochastic gradiënt formula results in the following algorithm:

def locallyWeightedRegression(trainingset, learningRate, mean, iteration = 1000, width=1):
    weights = len(trainingset.transpose())-1
    theta = np.zeros(weights)
    for i in range(iteration):
        paras = theta
        nextTheta = np.zeros(len(paras))
        for index, parameter in enumerate(paras):
            xi, yi = trainingset.transpose()[0:3], trainingset.transpose()[3]
            x = np.full((len(trainingset),3), mean)
            w = xi[index]-x.transpose()[index]
            weightFactor = -(w**2)/(2*width**2)
            #print('weight factor', weightFactor, type(weightFactor))
            costForThisVector = learningRate*np.exp(weightFactor.astype(float))*(theta.dot(xi)-yi)*xi[index]
            #print('cost vector', costForThisVector)
            nextTheta[index] = parameter - np.sum(costForThisVector)/len(trainingset)
        theta = np.array(nextTheta)
    return theta

Until now, I calculated the root mean square error of every fourth data example and used the other three fourths as the training examples. This would not work very well here. Either I need to do a new regression for every test, which is computationally expensive, or I create a massive error trying to use the algorithm only once and fitting it to test points it wasn't meant to fit too.

Thursday, July 25, 2019

Stanford's Machine Learning by Andrew Ng - Stochastic Gradient Descent

Batch Gradient Descent is very slow for very large datasets. Luckily an alternative exist: Stochastic gradient descent or incremental gradient descent. Rather then going over the entire dataset, the algorithm just looks at one training example at a time and then takes a small step towards minimization. The mathematical formula is the same, but the loop is different.

$θ_j := θ_j - \alpha(\sum_{i=0}^{m}θ^Tx-y_i)x_j$

With θ the different weights, x the different training examples variables and y the different training examples targets. I tried to put this into a python algorithm as well.

import Numpy as np
import math
def StochasticGradientDescent (trainingset, learningRate):
    weights = len(trainingset.transpose()) - 1
    theta = np.zeros(weights)
    for trainingExample in trainingset:
        nextTheta = [0 for number in theta]
        for index, parameter in enumerate(theta):
            xi, yi = trainingExample[0:weights], trainingExample[weights]
            nextTheta[index] = parameter - learningRate*(theta.dot(xi)-yi)*xi[index]
        theta = np.array(nextTheta)
    return theta

def RMS_Error (controlData, theta):
    parameters = max(theta.shape)
    print(parameters, theta, controlData)
    rms_error= (theta.dot(controlData.transpose()[0:parameters])-controlData.transpose()[parameters])**2
    n = len(rms_error)
    error = math.sqrt(np.sum(rms_error)/n)
    return error

For my not too big dataset, stochastic gradient's root mean square error wasn't to bad compared to batch gradiënt descent. For cubic it actually got a lower one, weirdly. But it was higher for the squared and linear.

RMS_error for linear corruption perception = 24247.5 ; theta = (-2.7, 70.6, 45), Learning Rate : 1/1e4
RMS_error for quadratic corruption perception = 15570 ; theta = (-2.52e-4, 4.66, 9.449e-2), Learning Rate: 1/1e8
RMS_error for cubic corruption perception = 9895 ; theta = (-2.979e-5, 0.113, -8.34*e-4), Learning Rate: 1/1e11

He also talks about a non-iterative algorithm to optimize the theta's, Namely the Normal equations, I tried to implement it using Numpy, but my RMS error is off the chart compared to gradiënt descent, So I think there's a mistake somewhere.

$θ = (X^TX)^{-1}X^T\overrightarrow{y}$

def normalEquation(trainingset):
    weights = len(trainingset.transpose()) - 1
    x = trainingset.transpose()[:weights].transpose()
    y = trainingset.transpose()[weights].transpose()
    XTX = np.linalg.inv(np.matrix(np.dot(x.transpose(),x),dtype='float'))
    XTY = np.dot(x.transpose(),y)
    theta = XTY.dot(XTX).getA()
    return theta

Monday, July 22, 2019

Stanford's Machine Learning by Andrew Ng - Batch Gradient Descent

Lecture 1 doesn't contain much meat, so I am going to ignore that one.

In week 2, Batch Gradient Descent is the first algorithm introduced.

After watching the second lecture of Andrew Ng's Machine Learning , I am going to try a probably bad attempt at linear regression. What I am going to try is predict the GDP per capita of a country on the basis two parameters, the amount of corruption in the country and its oil exports.

I used the corruption perception index of transparency International (2018 numbers). On the least corrupt side (highest score), you see pretty much every rich country, while on the very corrupt side (lowest score), you see very poor countries like Afghanistan, Somalia and oil exporting countries like Iraq and Kazakhstan. I think there will be a correlation in there that every corrupt country is poor, unless it exports oil.

I took the GDP per capita from the worldbank (2018) and oil as percentage of total merchandise export (2018).

So, each training example consist of an x value that includes the corruption perception and export export, as well as an y value that includes the GDP per capita.

$Training Example = (x_i , y_i )$

The first you need to do is create a hypothesis of how much each parameter weights as well as a baseline. These weight parameters are θ .

$h_θ = θ_0 + θ_1 x_1 + θ_2 x_2 = \sum_{i=0}^{n} θ_i x_i = θ^Tx$

The best fit would be if the avarage distance between the hypothesis and the actual value is minimized. This is done using the Cost Function that calculates how far off each hypothesis is for every training example m.

$J(θ) = \frac{1}{2} \sum_{i=0}^{m} (h_θ(x_i)-y_i)^2$

The very first algorithm is Batch Gradient Descent. The cost function J(θ) is lowered every iteration using every training example. Using the derivatvie of j(θ), at every iteration a step towards a minima is token. The size of this step is depended on how far of the initial function was and the learning rate α.

$θ_j := θ_j - \alpha\frac{∂}{∂θ_j}J(θ)$

$θ_j := θ_j - \alpha\frac{∂}{∂θ_j}\frac{1}{2} (\sum_{i=0}^{m}h_θ(x_i)-y_i)^2$

$θ_j := θ_j - \alpha\frac{2}{2}(\sum_{i=0}^{m}h_θ(x_i)-y_i)\frac{∂}{∂θ_j}\sum_{i=0}^{m} (h_θ(x_i)-y_i)$

$θ_j := θ_j - \alpha(\sum_{i=0}^{m} h_θ(x_i)-y_i)x_j = θ_j - \alpha(\sum_{i=0}^{m}θ^Tx-y_i)x_j$

So, that's the mathematical expression of the batch gradiënt descent algorithm. I got CSV-files online from the sources stated above and lumped the data together in a dict as {country : [x1,x2,y]} and saved it as a pickle file. So first thing to do when using the data is loading the pickle file and getting the data into arays using Numpy and Pandas.

import numpy as np
import pandas as pd
import pickle
import math

# country data from pickle
data = pickle.load(open("countries.p", 'rb'))

# get the training sets : first column is x1, corruption percetion,
# second column is x2, oil exports as percentage of total exports
# third column is y, the gdp per Capita
dataArray = pd.DataFrame.from_dict(data).transpose()

# make training sets, x0 is 1, x1 , x2 and y
x1 = pd.to_numeric(dataArray[0], errors='coerce').fillna(0).astype(np.int64)**1
x0,x2,y = np.ones(len(dataArray[0])), dataArray[1], dataArray[2]

# prepare the data and the checkData
TrainingData, checkData= np.array([x0,x1,x2,y]).transpose(), TrainingData[::4]

Then I tried the following function to get the parameters that minimize the cost function:

# prepare the trainingset
mask = np.ones(len(TrainingData), dtype='bool')
mask[range(0,len(mask),4)] = False

# batch gradient descent
def BatchGradientDescent (trainingset, learningRate, iteration):
    theta = np.array([0.0,0.0,0.0])
    for i in range(iteration):
        paras = theta
        nextTheta = [0,0,0]
        for index, parameter in enumerate(paras):
            costFunction = 0
            for trainingSet in trainingset:
                # split the trainingdata into one set
                xi, yi = trainingSet[0:3], trainingSet[3]

                # take the dot product and the distance from yi
                costForThis = learningRate*(theta.dot(xi) - yi)*xi[index]
                costFunction += costForThis
            nextTheta[index] = parameter - costFunction/len(trainingset)
        theta = np.array(nextTheta)
    return theta
theta = BatchGradientDescent(TrainingData[mask], 1/(1e4), 30000)

This didn't work, as instead of minimizing the cost, all parameters went to infinity and the program errored out. It turns out that I made a mistake while assigning the "costForThis" variable. I accidently changed the algorithm too:

$θ_j := \alpha(\sum_{i=0}^{m}θ_j - θ^Tx-y_i)x_j$

Notice the place of the θ, that Didn't work. Anyway, after this correction it did work. The next small fix is that I use both numpy and a for loop to loop over its data, which is less then ideal. So I tried to replace the for loop with a matrix operation.

# prepare the trainingset
mask = np.ones(len(TrainingData), dtype='bool')
mask[range(0,len(mask),4)] = False

# batch gradient descent
def BatchGradientDescent (trainingset, learningRate, iteration):
    theta = np.array([0.0,0.0,0.0])
    for i in range(iteration):
        paras = theta
        nextTheta = [0,0,0]
        for index, parameter in enumerate(paras):
            xi = trainingset.transpose()[0:3]
            yi = trainingset.transpose()[3]
            costForThisVector = learningRate*(theta.dot(xi)-yi)*xi[index]
            nextTheta[index] = parameter - np.sum(costForThisVector)/len(trainingset)
        theta = np.array(nextTheta)
    return theta
theta = BatchGradientDescent(TrainingData[mask], 1/(1e4), 500)

Lastly, I wanted to know if the fit for this data would be more accurate if i used the square of the corruption perception rather than the perception itself. So I used x1² this time and tried to run the program again, this time it the θ's went infinite again and eventually turned into Nan. As I couldn't find whats wrong this time, I went to stackoverflow for some aid, where they said that the learning rate is often a problem if it iterates to infinity, the step is too large and jumps far across the local minima each iteration. After using 1/1e7 as the learning rate, the error for the squared values were slightly lower then when the data is used lineraly. For corruption perception cubed and a learning rate of 1/1e11, the error is even smaller.

I check the roo mean square error using the following function:

rms_error = 0

for check in checkData:
    print("theta:", theta,"\ncheck", check)
    x_term = theta.dot(check[0:3])
    y_term = check[3]
    error = (x_term - y_term)**2
    rms_error += error
    print([x_term, y_term], error)
length = checkData.shape[0]
print(math.sqrt(rms_error/length), theta)

Yes, They could probably be done without the loop using Numpy a little bit better. Here is the result for linear, quadratic and cubic versions of the corruption perception:
RMS_error for linear corruption perception = 13.845 (-8755, 561, 81) Learing Rate: 1/1e4 , 30k iterations
RMS_error for quadratic corruption perception = 11.179 (-0.217, 7.34, 4.81) Learning Rate : 1/1e7, 10k iterations
RMS_error for cubic corruption perception = 10.223 (4.35e-6, 1.045e-1, 8.266e-4) Learning Rate: 1/1e11, 1k iterations

For the linear corruption perception, the program predicts a negative GDP per capita for Somalia, Thats not good.

Monday, March 18, 2019

CS - week 3 - Pointers

CS50 - week 3 - Pointers

lecture 4
This lecture is about pointers in C. Which is honestly a really confusing subject. The concept pointer as david explains it is easy enough: It points to where some value is stored in memory. But using it is another level of hard, especially as when using pointers you must also allocate memory to store stuff in and later free it.

After introducing pointers, David next explains that strings actually don't exist in C. it's a char*. A pointer that points to a series of chars in memory.

During the forthcoming problem set, image manipulation is the subject. So two more concepts are introduced: data persistence (files) and RGB pixels (values how red, green and blue a certain pixel must be).

problem set 4

WhoDunIt?
There's a message hidden in a an image. You start with some code to copy the file. Just adding some code to change one color out of three RGB trio to 0 or 255 eventually reveals the message. Not so hard.

Resize, easy
The objective here is to resize an image, multiplying by a natural number. You again start with the same code to copy the file. The BMP-files that get copied consist of two parts, a header and the pixels. During copying, the header must be changed slightly to reflect the new image size. The bitmap consists of lines of pixels, each lines must be a multiple of four pixels. If the numbers is not a multiple of four, some "padding" is included. This has to be accounted for after the resize. Each line of pixels has to be stored in memory and then outputted X times, where X is the multiplying factor. With padding added after, which is not necessary the same amount for the output as the input. Took just 4 hours.

Resize, hard
Same objective, except this time it can be multiplied by a rational number. So you have to account for the fact that it can shrink the image instead of grow. For shrinking, I first multiplied by the rational number times 10, then copy every tenth pixel. This was a difficult task that took 6 hours to complete.

Recover
You have one file full of deleted JPG files. The task is to recover all of them. The begin and end of these files have to be detected and their content copied and writed into a JPG file. this task was another 4 hours.

This entire problem set took 15 hours and was the longest Problem Set of them all.

Thursday, February 28, 2019

Something about pitches

Something About Pitches

CS50's problem set 3 was music-themed. The program played the pitches described in a .txt file. In that file a combination of letters (A-G) and numbers (2-5) was used to describe the frequencies in musical notes. But honestly I am not much of a music connoisseur. It goes not much beyond recognizing the latest songs that were played way to much on the radio lately. I don't know much of the concept "music".

So apparently, the very first step in learning anything about music is involves learning to recognize the relation between pitches. I found this app that plays some music for like 2 seconds and then plays a single pitch. The app insist it should be possible to recognize that single pitch after hearing the music for 2 seconds as "reference". In the very first level in this app, it only plays 4 pitches, C4 D4 E4 and F4.

First Attempt: Yeah, this is not going anywhere. Every single attempt of the twenty was a guess and I guess right 25% of the time. With only 4 options and no repeats, this is worse then random guessing actually, as that should be right 33% of the time. The music played in front of that one pitch doesn't seem to help very much 😖.

Second Attempt: Still not much pitch recognition going on here. But because their are only 4 pitches, there is quite some repetition among the pitches. I noticed three tings that made me able to perform slightly better than guessing:

If it play pitch A, then pitch B, then pitch A again, it's sometimes possible to tell that it was played two pitches ago, therefore guessing correctly.
I became reasonably accurate this time around to guess whether pitch B was lower/higher then pitch A. So if pitchA was D4 and pitch B appears to be higher, I only had to guess between E4 and F4 and able to eliminate C4.
once three different pitches were played in a single session, when the fourth came I could usually tell that this one wasn't played before and was the missing fourth pitch.

On my second attempt I had 55% right, basically I was able to eliminate one because it never repeated and on average one more because of the three tings I noticed. 😋

Next Few Attempts: The highest and lowest of the pitches, C4 and F4, were slightly easier the get right then those in middle. I avaraged somewhere between 50% and 80% most of the time now except for one lucky streak where I got 95% (😲). As time went on, C4 (which is called the tonic), came to stand out more and more. After about 10 games of twenty attempt, C4 became so distinct that I only got it wrong when I misclicked and after a few more I also stopped having false positives for C4.

As The Attempt Kept Going: After every attempt, the app plays a resolution. That is, it plays every note from the pitch I should have guessed down to the tonic. So if I had to guess D4, it plays D4-C4. If I had to guess F4, it plays F4-E4-D4-C4. So the D4-C4 was repeated almost every time (only when I had to say it's C4, it didn't), So after a while whenever the single pitch was D4, I almost heard a phantom noise of C4 right after, making it possible to identify it was indeed D4. When both C4 and D4 were (almost) guaranteed to be guessed right, scores below 65% didn't happen anymore and most were 80% correct or higher.

Monday, February 25, 2019

CS50 - week 2 - Algorithms

CS50 - week 2 - Algorithms

lecture 3
lecture 3 is in week 2 and the lecture-week numbering is probably not going to match for quite a while. This lecture introduces some more information about strings. Which is an array of chars of length one character larger then the amount of chars (due to an ending "/0" char), which we must keep in mind.

Most of the rest of this lecture was on the topic of algorithms. More precisely: searching algorithms and sorting algorithms. Some properties about these were discussed, such as how long these take to perform as the the number to search/sort increases. In the very zeroth lecture (as they started at lecture 0), David introduces binary search(= go to middle of set you want to search, discard half then go to the middle of the remainder, sorted sets only.) and here he discusses it again and compares it to linear search (= just look at everything in order). Binary search is much faster then linear search, btw. (as David states, you effectively half your problem each step.)

Then in the discussing only sorting, he first talks about three equally efficient sorting methods. (bubble sort, insertion sort and selection sort). For each of these doubling the amount of stuff you sort, you quadruple the amount of time it takes. Then he introduces as fourth sorting method: merge sort, which is just weird, but quicker.

problem set 3

There is one problem set, split into three "helper functions". This problem set is removed now. I thought it was a nice introduction to code that was over multiple files. But it was easier then previous problems, taking less than 3 hours to complete.

Helper function 1: duration
All three functions were meant to read instruction to play music. The first function must get the duration of a certain note. notes are either 1/1, 1/2, 1/4 or 1/8. so turning their chars into ints and multiplying by 8 gave their relative duration.
int duration(string dur)
{
if (!dur)
{
return 1;
}

int number = 8 * (dur[0] - 48) / (dur[2] - 48);
{
return number;
}

}

Helper function 2: is rest
the second function determines whether a note was meant to be played at a specific time. this wasn't very difficult either.
bool is_rest(string s)
{
if (!s)
{
return 1;
}
if (strcmp(s, "") == 0)
{
return true;
}
else
{
return false;
}

}

Helper function 3: pitch
This was pretty much the meat of the exercise. turning A4 into 440hz or c4 into 256hz. So it was given that a4 was 440 hz. One ocatave higher meant doubling the pitch. Different notes were 2^1/12 apart and lastly, flat and sharp notes exist. Near the bottom I multiplied and later divided by 10000 because otherwise the floats would round in annoying ways.
int frequency(string notes)
{
// error
if (!notes)
{
return 1;
}
// check length
bool three = false;
if (strlen(notes) == 3)
{
three = true;
}
char note = notes[0];
int ndis = 0; // for note A
// determine the note

switch (note)
{
case 'B':
ndis = 2;
break;
case 'G':
ndis = -2;
break;
case 'F':
ndis = -4;
break;
case 'E':
ndis = -5;
break;
case 'D':
ndis = -7;
break;
case 'C':
ndis = -9;
break;
}

// determine octave location (place 2 or 3)
int i = 0;
if (three == true)
{
i = 1;
}
int octave = notes[1 + i] - 48;

// correct for sharp or flat
if (i == 1)
{
char fs = notes[1];
if (fs == 'b')
{
ndis--;
}
else
{
ndis++;
}
}

// output frequency
float a = 440;

float exp = 10000 * ndis / 12;

float c = a * pow(2, exp / 10000) ;

// correct for octave
c = c * pow(2, octave - 4) ;

int d = round(c);
return d;

}

Thursday, February 21, 2019

CS50 - week 1, continued - some More C

CS50 - week 1, continued - Some more C

lecture 2

So, this is still week 1, apparently, despite it being lecture 2 and problem set 2. o well... This lecture starts with some simple loop examples that are easier versions of problem set 1's mario problems (,which is why I watched lecture 2 before doing problem set1, Which I kept doing. In hindsight this is only useful up to problem set 4). Some debugging help is shown here as well, like some commands programmed by the cs50 team, but honestly it never gave me any useful feedback and I always debugged by overloading on printf statements.

Afterwards the lecture goes a bit deeper in the basic data types in C: int, float, char, string (which actually doesn't exist), and so on. Some functions made by the cs50 team were also shown to get user input and it's also shown how strings are just a row of chars. The relation between int and char (ASCII) was also shown. The problem set involves toying with chars and strings in the form of cryptography

problem set 2

three problems this week:

Caeser.c (easy)

Make a program that, given a number and a word, replaces every letter in the word with another further in the alphabet by that number.

so ab and 2 gives cd

idiot and 3 gives lglrw

this wasn't too bad.

Viginere.c ("easy")

That caeser is the oldest known way to encrypt a message, by "rotating" all letters across the alphabet. Once you are aware of it, it is ofcourse easy to break, so a slightly upgraded encryption method was used later (still centuries ago). The Viginere rotates your to be encrypted message by another word (the cipher). It is just a slight upgrade from the Caeser method and I already got the code from that one, so how hard could this be? well, as the quotes around "easy" show, this was not easy. It took me 2 hours to do caeser and another 3 to upgrade it to the viginere method for a total of 5 hours.

Crypt.c (hard)

I admit, this was hard. So hard in fact that I skipped it for later as I was getting nowhere.