An objective test process for the rest of us?

hgm · Post by **hgm** » Wed Sep 26, 2007 8:25 pm

hristo wrote:I believe that most strong engines will show fluctuations in their relative strength to one another as a function of time given to them (some engines will perform relatively better depending on the time control).

You believe, you suspect...

Can't you be a bit more concrete?

You claim that there is a pair of engines that plays 54-46 at 40 moves/40 minutes, and that that result will change to 42-58 at 50 moves/45 min?

If so, (or something similar), which engines are this?

hgm · Post by **hgm** » Wed Sep 26, 2007 8:40 pm

bob wrote:I completely agree with that. Searching to some fixed depth or fixed number of nodes removes that issue, yet it influences games in significant ways that I want to observe. If I miss a move, and study shows that a little more time would help, then can I find a way to use more time there? I have two examples where I have done so.

1. Fail low triggers more time. I can recall Slate/Atkin bouncing in their chairs wondering "we just discovered this move sucks... are we going to have enough time to find that move Y is better as Levy pointed out?" I don't have that problem as when X fails low, I am going to search _hard_ to try to find Y and use a lot more time than usual to do so if possible.

2. I have an OK move, but my program is about to change its mind to a new better move. I could tell this since I display the current move being searched once every 15 seconds just to let the operator know how long I have been searching and what I am searching on. Normally the ply-1 moves go by quickly (after the first one) and the farther down the list I get, the faster they go by. Unless I run into a move that looks more promising and there it can take a lot of time to dismiss it (but it requires more nodes which triggers my ply-1 move reordering to move this near the top for the next iteration to check it more carefully earlier in the search). But once I get into searching a move that will become the new best move, if I have enough time to complete the search, I don't worry about having enough time because I never to a time abort in the middle of an iteration until all currently active ply-1 moves are completed (since I do a parallel search and also split at the root, I can have N ply-1 moves in various stages of completion).

This still does not explain in any way why you expect one version of your program to suffer _more_ from this than the other version, if you switch it of in both. So it doesn't seem much to the point. Why are you telling this at all? (For the second time...)

And you don't have the slightest idea how much Elo difference this actually causes between the same version that does or doesn't do it. It is just meaningless talk, like people saying "smoking can't be bad for you, because my uncle, who smoked all its life, lived to the age of 95".

hristo · Post by **hristo** » Wed Sep 26, 2007 8:57 pm

hgm wrote:
hristo wrote:I believe that most strong engines will show fluctuations in their relative strength to one another as a function of time given to them (some engines will perform relatively better depending on the time control).
You believe, you suspect...

Can't you be a bit more concrete?

You claim that there is a pair of engines that plays 54-46 at 40 moves/40 minutes, and that that result will change to 42-58 at 50 moves/45 min?

This is basically what I'm saying, indeed.
(for instance score of 60-40 at 40/5min and score 55-45 at 40/60min)

hgm wrote: If so, (or something similar), which engines are this?

The engines (most of them) traverse completely different trees (different topology). It is unreasonable to expect that different engines would require the same amount of time, whether it is measured in nodes (evaluated positions) or in seconds, to find the "proper" move from any given position.
If you agree with the above, then it naturally follows that the relative strength between engines will fluctuate with respect to time.

I don't have a direct example of a match where the same group of engines were pitted against each other at different time controls. (If someone has such data please post it so I can be shown to be wrong.)

Regards,
Hristo

hgm · Post by **hgm** » Wed Sep 26, 2007 9:16 pm

hristo wrote:This is basically what I'm saying, indeed.
(for instance score of 60-40 at 40/5min and score 55-45 at 40/60min)

Sorry, but I was not asking for 5min vs 60min, (which is a factor 12), but for a difference of 10%. 12 = 1.1^26.

Indeed there can be a different scaling (Elo vs log time) for engines with a different branching ratio, that gives a notable strength difference between blits and long games. But one really has to go to change the time orders of magnitude before this effect shows up.

In addidion, we were not talking about engines with a different search tree (branching factor), but with a different evaluation function. There I have never heard of such an effect, not even for 1min/game vs 40moves/2 hr.

And we are not talking about two engines with wildly different evaluation, but about two versions of the same engine with nearly identical evaluation that was tweeked a little bit, or to which a new term was added. Out of the question.

Unless you could show actual facts, I will assume you are just bullshitting me! This is even worse that the usual creatonism debate!

hristo · Post by **hristo** » Wed Sep 26, 2007 9:40 pm

hgm wrote:
hristo wrote:This is basically what I'm saying, indeed.
(for instance score of 60-40 at 40/5min and score 55-45 at 40/60min)
Sorry, but I was not asking for 5min vs 60min, (which is a factor 12), but for a difference of 10%. 12 = 1.1^26.

Indeed there can be a different scaling (Elo vs log time) for engines with a different branching ratio, that gives a notable strength difference between blits and long games. But one really has to go to change the time orders of magnitude before this effect shows up.

So, you agree that time can influence the relative strength between engines, but you just don't think that there is "a lot of fluctuation" at 10% time difference.

hgm wrote: In addidion, we were not talking about engines with a different search tree (branching factor), but with a different evaluation function. There I have never heard of such an effect, not even for 1min/game vs 40moves/2 hr.

Different evaluation function is precisely what changes the search tree ... I have no idea what you are talking about.

hgm wrote: And we are not talking about two engines with wildly different evaluation, but about two versions of the same engine with nearly identical evaluation that was tweeked a little bit, or to which a new term was added. Out of the question.

If the changes result in a different search tree, which would be the case 90% of the time, then I'm not sure what is "Out of the question.". The search tree doesn't have to be drastically different to have it require different amount of time (nodes) to reach a particular point.

H.G.,
claiming that "time limits" have no perceivable influence on relative strength between engines is incorrect and IMO grossly overlooks some of the very basic assumptions as to how chess engines work.

You have not offered any evidence nor have you proposed a sound argument as to "Why changes in the evaluation function don't result in a different time to find 'the proper' move?"

Obviously a test can be performed, if it hasn't been done already ... but for some reason you are leaving the area of scientific discourse and are attempting to enter into a "sling fest".

bob · Post by **bob** » Wed Sep 26, 2007 10:11 pm

hgm wrote:
bob wrote:I completely agree with that. Searching to some fixed depth or fixed number of nodes removes that issue, yet it influences games in significant ways that I want to observe. If I miss a move, and study shows that a little more time would help, then can I find a way to use more time there? I have two examples where I have done so.

1. Fail low triggers more time. I can recall Slate/Atkin bouncing in their chairs wondering "we just discovered this move sucks... are we going to have enough time to find that move Y is better as Levy pointed out?" I don't have that problem as when X fails low, I am going to search _hard_ to try to find Y and use a lot more time than usual to do so if possible.

2. I have an OK move, but my program is about to change its mind to a new better move. I could tell this since I display the current move being searched once every 15 seconds just to let the operator know how long I have been searching and what I am searching on. Normally the ply-1 moves go by quickly (after the first one) and the farther down the list I get, the faster they go by. Unless I run into a move that looks more promising and there it can take a lot of time to dismiss it (but it requires more nodes which triggers my ply-1 move reordering to move this near the top for the next iteration to check it more carefully earlier in the search). But once I get into searching a move that will become the new best move, if I have enough time to complete the search, I don't worry about having enough time because I never to a time abort in the middle of an iteration until all currently active ply-1 moves are completed (since I do a parallel search and also split at the root, I can have N ply-1 moves in various stages of completion).
This still does not explain in any way why you expect one version of your program to suffer _more_ from this than the other version, if you switch it of in both. So it doesn't seem much to the point. Why are you telling this at all? (For the second time...)

Because I make changes to this code from time to time as well. And I want to test those changes. I also want to exercise that code as frequently as possible so that it has a chance to break something. It is important enough that I want to make sure everything works well together.

It is also possible that a new version has a better evaluation that returns more accurate evaluations in situations that the older one was inaccurate in, and these different scores can influence time usage. If I test without time usage, and just bias all scores in one direction, I would get one result. When I play in real games, where every search fails low because of this biased score, I would see a completely different time usage pattern and hence a different result.

Again, at least in my code, all of these things interact, and changing one thing over here can affect something way over _there_ unexpectedly. If I don't have that thing way over there turned off for testing.

And you don't have the slightest idea how much Elo difference this actually causes between the same version that does or doesn't do it. It is just meaningless talk, like people saying "smoking can't be bad for you, because my uncle, who smoked all its life, lived to the age of 95".

I would say your statements are _exactly_ as meaningless, for exactly the same reason. I could probably disable something and run a big test and tell you the elo/error-bar improvement or reduction, but for some of this I really don't care. I don't care at all what the improvement is for null move, or reductions, or such. I just care that there _is_ an improvement... But I am operating much like the alpha/beta search, I just need to prove A' is better (or worse) than A. I really don't give a hoot about how much better. Better is better, and that's good enough here.

bob · Post by **bob** » Wed Sep 26, 2007 10:14 pm

hgm wrote:
hristo wrote:I believe that most strong engines will show fluctuations in their relative strength to one another as a function of time given to them (some engines will perform relatively better depending on the time control).
You believe, you suspect...

Can't you be a bit more concrete?

You claim that there is a pair of engines that plays 54-46 at 40 moves/40 minutes, and that that result will change to 42-58 at 50 moves/45 min?

If so, (or something similar), which engines are this?

That is certainly true. I don't know about those specific time controls, but I have run a bunch of different time tests on my cluster, to accurately answer the question "do I need longer games" or will short games give me as reliable an answer? Short games are fine. But I have examples where Crafty will roll over a program at very fast time controls, and they even up as the time control becomes more reasonable. The opposite is true also.

So yes, changing the time control can change the result. If you play enough games, the change is predictable also. For small numbers of games, the results are quite random.

hgm · Post by **hgm** » Wed Sep 26, 2007 10:16 pm

You after talking about one search tree in one particular situation. That has nothing to do with the strength of an engine. That is determines by the average over billions of search trees of move quality vs time used.

I have measured the strength of many engines over a wide range of time controls. I have played time-odds matches between engines to systematically measure how the strength varies with thinking time.I have made many small changes to my engines, and tested them extensively.

How much have you done of all this? Are your statements based on anything at all, other than that this sounds plausible to you?

bob · Post by **bob** » Wed Sep 26, 2007 10:22 pm

hristo wrote:
hgm wrote:
hristo wrote:This is basically what I'm saying, indeed.
(for instance score of 60-40 at 40/5min and score 55-45 at 40/60min)
Sorry, but I was not asking for 5min vs 60min, (which is a factor 12), but for a difference of 10%. 12 = 1.1^26.

Indeed there can be a different scaling (Elo vs log time) for engines with a different branching ratio, that gives a notable strength difference between blits and long games. But one really has to go to change the time orders of magnitude before this effect shows up.
So, you agree that time can influence the relative strength between engines, but you just don't think that there is "a lot of fluctuation" at 10% time difference.

hgm wrote: In addidion, we were not talking about engines with a different search tree (branching factor), but with a different evaluation function. There I have never heard of such an effect, not even for 1min/game vs 40moves/2 hr.
Different evaluation function is precisely what changes the search tree ... I have no idea what you are talking about.

hgm wrote: And we are not talking about two engines with wildly different evaluation, but about two versions of the same engine with nearly identical evaluation that was tweeked a little bit, or to which a new term was added. Out of the question.
If the changes result in a different search tree, which would be the case 90% of the time, then I'm not sure what is "Out of the question.". The search tree doesn't have to be drastically different to have it require different amount of time (nodes) to reach a particular point.

H.G.,
claiming that "time limits" have no perceivable influence on relative strength between engines is incorrect and IMO grossly overlooks some of the very basic assumptions as to how chess engines work.

You have not offered any evidence nor have you proposed a sound argument as to "Why changes in the evaluation function don't result in a different time to find 'the proper' move?"

Obviously a test can be performed, if it hasn't been done already ... but for some reason you are leaving the area of scientific discourse and are attempting to enter into a "sling fest".

Somebody is missing something here. I have made _tiny_ evaluation changes that caused me to take 2x longer to search to the same depth for some positions, or vice-versa. In non-tactical positions, slight eval changes can greatly alter the shape of the tree being searched, which increases the variability of the results for such changes or increases the difference in the trees the two programs are dealing with.

This discussion is boiling down to the "I have never touched a moon rock, so man has never walked on the surface of the moon."

That's not getting anywhere.

hgm · Post by **hgm** » Wed Sep 26, 2007 10:29 pm

bob wrote:That is certainly true. I don't know about those specific time controls, but I have run a bunch of different time tests on my cluster, to accurately answer the question "do I need longer games" or will short games give me as reliable an answer? Short games are fine. But I have examples where Crafty will roll over a program at very fast time controls, and they even up as the time control becomes more reasonable. The opposite is true also.

And this happens for a mere 10% change in time control? How much Elo per factor two of time is the largest you have ever seen, between two programs?

So yes, changing the time control can change the result. If you play enough games, the change is predictable also. For small numbers of games, the results are quite random.

An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?