Gradient Descent Introduction

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Gradient Descent Introduction

Post by Desperado »

Hello everybody,

i am interested in a simple gradient descent implementation. Unfortunately i am not able to put the puzzle pieces together.

Here is what i think that i understand and what i can do for now:

base model:

1. i have a sample set of positions including results
2. i have a parameter list with N elements.
3. i have a cost function (MSE)
a. to minimize the cost function, which is a squared function, i need the derivative which leads to a linear model y=mx+b.
b. solved this, i can tune the parameter the way, that y=mx+b gets close to 0.

example:

1. SAMPLESIZE 10000
2. PARAMETERLIST 5
3. MSE = (sum(result-computed_value)^2) / SAMPLESIZE

How do i have to iterate over my parameterlist and the samples to compute m,b ?
Do i have to compute m,b for each single sample ? m,b for the batch ? how do i get m,b for the batch ?

I red some articles on the web, but i am interested in the dialogue and the practice how to handle it in the context chess parameter tuning.
So, i think i got the idea but need to know how to do it.

Thanks a lot in advance...
tomitank
Posts: 276
Joined: Sat Mar 04, 2017 12:24 pm
Location: Hungary

Re: Gradient Descent Introduction

Post by tomitank »

matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Gradient Descent Introduction

Post by matthewlai »

Desperado wrote: Sun Dec 09, 2018 3:32 pm Hello everybody,

i am interested in a simple gradient descent implementation. Unfortunately i am not able to put the puzzle pieces together.

Here is what i think that i understand and what i can do for now:

base model:

1. i have a sample set of positions including results
2. i have a parameter list with N elements.
3. i have a cost function (MSE)
a. to minimize the cost function, which is a squared function, i need the derivative which leads to a linear model y=mx+b.
b. solved this, i can tune the parameter the way, that y=mx+b gets close to 0.

example:

1. SAMPLESIZE 10000
2. PARAMETERLIST 5
3. MSE = (sum(result-computed_value)^2) / SAMPLESIZE

How do i have to iterate over my parameterlist and the samples to compute m,b ?
Do i have to compute m,b for each single sample ? m,b for the batch ? how do i get m,b for the batch ?

I red some articles on the web, but i am interested in the dialogue and the practice how to handle it in the context chess parameter tuning.
So, i think i got the idea but need to know how to do it.

Thanks a lot in advance...
Let's start very simple, and say each training example has an input x (let's say material balance - single number) and output y^ (result).

And let's say you use a very simple model y(x) = wx + b. That is, you think y is linear to x. w and b would be your parameters. Note that here y(x) is the output of the model given x, and you want it to predict y^ (the actual result).

Then when you add MSE to the mix, you get L(x) = (1/2)(y(x) - y^)^2.

Now you want to figure out how to update w and b to make your function a better model. To do that, we take partial derivative of L with respect to w and b (remember that they are part of y), because that tells us how changing them will affect L(x).

Code: Select all

dL(x)/dw = dL(x)/dy(x) * dy(x)/dw # chain rule
              = (y(x) - y^) * x
dL(x)/db = (y(x) - y^) # dy(x)/db = 1
So that gives you which direction you need to move w and b in to INCREASE the loss. Our goal is to decrease the loss, so we go in the opposite direction, and your updates are thus:

w' = w - dL(x)/dw = w - (y(x) - y^) * x
b' = b - dL(x)/db = b - (y(x) - y^)

Now this is for two parameters and a single scalar feature. In reality we use a lot more, but the idea remains the same. You just have to compute dL(x)/dw_i for all parameters w_i, and apply w_i' = w_i - dL(x)/dw_i. We usually write them in vector notation for brevity, but that's just notation. You can have your model in any form as long as it's differentiable with respect to all the parameters.

As for how often to apply the update - the theoretically correct way is to compute the update for all your samples, average them, and then apply. In reality no one does that because it's extremely computationally intensive, and usually you can get pretty good updates by just using a small set of randomly selected samples, and this allows you to get more updates given the same amount of compute. This is called stochastic gradient descent (well, technically pure SGD is using only one sample at a time, but in common usage people still call it SGD when applied to small batches).
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: Gradient Descent Introduction

Post by Desperado »

Thanks both of you so far. Unfortunately my time is very limited, so there might be a delay for further posts.

Now,first, i would like to follow Matthew's description and i would appreciate if you can guide me through.
matthewlai wrote: Sun Dec 09, 2018 3:57 pm Let's start very simple, and say each training example has an input x (let's say material balance - single number) and output y^ (result).
Already at this point my confusion starts, sounds funny, i know... :lol:

1.=> e(x) = y^
For me x are my feature parameters (say material p,n,b,r,q ->1,3,3,5,9) used by my evaluation function e(), that produces an output y^
2. error = given result in sample - evaluation output = y(x) - y^
That is how i get the error (and finally sum all squared errors of the sample set).

But what you are saying is, that x is based on one feature. (eg material delta, put into a number). That only would go hand in hand
if my evaluation would be based on one feature (eg material balance). I simply do not have an x0,x1,...xn this way.

Code: Select all

x    = feature-parameter-list
e(x) = evaluation using x
y^   = output of e(x)
y(x) = result given by sample
err  = y(x) - y^

Based on my explanation above, y(x) is constant as x is constant too (my feature parameter list). That is why i would have only one x.
If i use the way you described, would i need to break the features/paramaters into seperated x0,x1,...,xn input values ?
like..
delta matrerial -> 200 = x0
delta mobility -> 40 = x1
delta passer -> 75 = x2
and so on...

Thanks a lot :!:

PS: i guess it does not make sense to go on with your description as long this is not clarified. Maybe one additional note. The "parameters" m,b
are not the "paramters" i want to tune, but i need them to modify my x (feature parameters) rapidly. At least this is the way i think at the moment.
And let's say you use a very simple model y(x) = wx + b. That is, you think y is linear to x. w and b would be your parameters. Note that here y(x) is the output of the model given x, and you want it to predict y^ (the actual result).
Maybe you will catch me before i can catch myself... :)
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Gradient Descent Introduction

Post by matthewlai »

Desperado wrote: Mon Dec 10, 2018 9:21 pm Thanks both of you so far. Unfortunately my time is very limited, so there might be a delay for further posts.

Now,first, i would like to follow Matthew's description and i would appreciate if you can guide me through.
matthewlai wrote: Sun Dec 09, 2018 3:57 pm Let's start very simple, and say each training example has an input x (let's say material balance - single number) and output y^ (result).
Already at this point my confusion starts, sounds funny, i know... :lol:

1.=> e(x) = y^
For me x are my feature parameters (say material p,n,b,r,q ->1,3,3,5,9) used by my evaluation function e(), that produces an output y^
2. error = given result in sample - evaluation output = y(x) - y^
That is how i get the error (and finally sum all squared errors of the sample set).

But what you are saying is, that x is based on one feature. (eg material delta, put into a number). That only would go hand in hand
if my evaluation would be based on one feature (eg material balance). I simply do not have an x0,x1,...xn this way.

Code: Select all

x    = feature-parameter-list
e(x) = evaluation using x
y^   = output of e(x)
y(x) = result given by sample
err  = y(x) - y^

Based on my explanation above, y(x) is constant as x is constant too (my feature parameter list). That is why i would have only one x.
If i use the way you described, would i need to break the features/paramaters into seperated x0,x1,...,xn input values ?
like..
delta matrerial -> 200 = x0
delta mobility -> 40 = x1
delta passer -> 75 = x2
and so on...

Thanks a lot :!:

PS: i guess it does not make sense to go on with your description as long this is not clarified. Maybe one additional note. The "parameters" m,b
are not the "paramters" i want to tune, but i need them to modify my x (feature parameters) rapidly. At least this is the way i think at the moment.
And let's say you use a very simple model y(x) = wx + b. That is, you think y is linear to x. w and b would be your parameters. Note that here y(x) is the output of the model given x, and you want it to predict y^ (the actual result).
Maybe you will catch me before i can catch myself... :)
If you want to work with multiple features already, that's fine too, but single feature is easier and I just thought it would be best to get the single feature case sorted out first.

If you have multiple features (x0, x1, x2), the evaluation function may look something like this: y = w0*x0 + w1*x1 + w2*x2 + b

In this case you would have 4 parameters - w0, w1, w2, b.
PS: i guess it does not make sense to go on with your description as long this is not clarified. Maybe one additional note. The "parameters" m,b
are not the "paramters" i want to tune, but i need them to modify my x (feature parameters) rapidly. At least this is the way i think at the moment.
In general the goal of training is not changing your features. You are trying to find a function that given a set of features, will give you a useful output, and you achieve that by changing the weights (w0, w1, w2, b). Features are, in general, not something you control, but something you get from the board.

Think of a very simple example first. Let's say your training set is:
1 => 3
2 => 5
3 => 7

And you are trying to find a function f(x) = mx + b that gives you that relationship, your goal is to find an m and b that give you a function that models the relationship. In this case the optimal parameters are m=2, b=1. You don't really control x. The user of the function can give you any x.

Now imagine x is your feature list, and f(x) is the evaluation output. You want the set of parameters that, given any feature, will give you a prediction of the outcome of the game.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.