Bullet vs regular time control, say 40/4m CCRL/CEGT

Rebel · Post by **Rebel** » Sun Aug 30, 2015 8:58 pm

Laskos wrote:
Adam Hair wrote:Testing at a bullet time control is not as reliable as testing at a blitz time control when comparing the results from a fixed number of games. As you stated, there are sources of error that are not accounted for when estimating Elo difference that have a larger effect on bullet games. So, 1000 40/240s games are more reliable than 1000 40/15s games (assuming that other conditions such as the openings are the same).
I don't quite understand what you mean. If all factors contributing to the engine strength behave the same at every time control, then the effect is extremely mild one on draw rate only, and the error bars are slightly larger at lower draw rate, thus, at shorter time control (at fixed number of games with a balanced result). You mean that? This effect is extremely mild, and it's easily offset by higher number of games performed in the same amount of time. The end result, if scaling is identical for all factors, the shortest time control is the best to get the smallest error margins in the same amount of time. The problems arises when the scaling for some contributing factor to the strength is different, some, for example, scaling well at longer time control, needing longer time control games, ideally going into the territory of rating game-play. And this problem is unavoidable, therefore one has to be careful when using bullet time control, but use it whenever it is safe.
However, I think that source of estimation error is dominated by the reduction of error due to the increased number of games that can be played in a fixed time period when using bullet time controls. Approximately 16,000 40/15s games can be played in the same amount of time as 1000 40/240s games (assuming an average of 80 moves per game). Ignoring the draw rate (I can not recall the error bar formula that includes draws at the moment), that would result in calculated error bars that are 1/4 of those for 1000 games.

If what you are saying is true, then why are CCRL and CEGT still playing at 40/4, 40/20 and 40/40? Why not play bullet with a consderble amount of games?

Rebel · Post by **Rebel** » Sun Aug 30, 2015 9:00 pm

Laskos wrote:
Rebel wrote:
Laskos wrote:
Rebel wrote:I have good reasons to believe that the drawback of playing bullet games (say 40/15s) to test a change is that you have to play (a lot) more games than testing at CCRL/CEGT (40/240s).

Leaving out my considerations for the above statement for the moment I want to ask if research has been done to investigate that and maybe there is a formula (or factor) that you can use as base to lower the number of games when you (for instance) double the time control.

Say, you are playing 15,000 (40/15) bullet games. When you decide to double the time control, will 12,000 games (or so) give an equivalent result?

In general the rating lists give a pretty good indication of the strength of programs and they are not playing 10,000 games.
Bob is correct. Many, for example positional factors, scale differently with time control, so one goes to longer time controls only out of need, to check to reasonable game-play time controls. If all factors would scale equally, it would be preferable to test at shortest possible time controls, given by the the clock tick and such.
Thanks Kai for moving in.

Are you saying that playing 10,000 (40/240s) games is as reliable than 10,000 (40/15s) bullet games?

This probably is true for positional changes, but for search changes?
If the rating lists at 40/15s are identical to rating lists at 40/240s, and all changes to the engine scale the same in time, then yes. As we know, they are similar, but not identical. One has to extract, what factors contribute almost identically, what very differently. Those suspected by "intuition" or whatever as contributing to differences, are better be tested at longer time controls too. Is a bit of an art, but no clear statistical rules to determine what changes need to be tested how.

Yep, that was my question in the OP.

Laskos · Post by **Laskos** » Sun Aug 30, 2015 9:04 pm

Rebel wrote:
Laskos wrote:
Adam Hair wrote:Testing at a bullet time control is not as reliable as testing at a blitz time control when comparing the results from a fixed number of games. As you stated, there are sources of error that are not accounted for when estimating Elo difference that have a larger effect on bullet games. So, 1000 40/240s games are more reliable than 1000 40/15s games (assuming that other conditions such as the openings are the same).
I don't quite understand what you mean. If all factors contributing to the engine strength behave the same at every time control, then the effect is extremely mild one on draw rate only, and the error bars are slightly larger at lower draw rate, thus, at shorter time control (at fixed number of games with a balanced result). You mean that? This effect is extremely mild, and it's easily offset by higher number of games performed in the same amount of time. The end result, if scaling is identical for all factors, the shortest time control is the best to get the smallest error margins in the same amount of time. The problems arises when the scaling for some contributing factor to the strength is different, some, for example, scaling well at longer time control, needing longer time control games, ideally going into the territory of rating game-play. And this problem is unavoidable, therefore one has to be careful when using bullet time control, but use it whenever it is safe.
However, I think that source of estimation error is dominated by the reduction of error due to the increased number of games that can be played in a fixed time period when using bullet time controls. Approximately 16,000 40/15s games can be played in the same amount of time as 1000 40/240s games (assuming an average of 80 moves per game). Ignoring the draw rate (I can not recall the error bar formula that includes draws at the moment), that would result in calculated error bars that are 1/4 of those for 1000 games.
If what you are saying is true, then why are CCRL and CEGT still playing at 40/4, 40/20 and 40/40? Why not play bullet with a consderble amount of games?

Because there is an "if they are scaling the same". Not the case, usually. Not extremely different too. So, you won't normally see an engine relative swing of 200 ELO points in rating lists from 40/15s to 40/240s. The lists will look similar (not identical, though), and you have to have the artistry to decide what factors in your engine behave how relative to time control.

bob · Post by **bob** » Sun Aug 30, 2015 10:30 pm

Rebel wrote:
bob wrote: The formula has NO reference to time control.
Yep, that's why I started this topic. And it makes me wonder if that is correct, not saying it (the elo bar) is not correct.

Think about it. If if worked as you (we) would like, one could play one LONG game and get the same result. Except we know what one game would be, random noise...
Sure.

But I do think playing 1000 (40/240s) games is far more reliable than 1000 (40/15s) bullet games. Much less horizon effects, much less negative influence from the 1/18th of a second, CLOCKS_PER_SEC which is pretty dominant in bullet games.

I do lots of testing at different time controls, but the number of games to get +/-3 Elo remains at 30K or so.

I've even tested at 40moves/2 hours a couple of times, running over 700 games at a time. Takes forever, 30K still +/- 3.
Doing the math provided my interpretation of the above is correct would mean: 3 games a day on 1 PC, meaning 10,000 days to finish, divided by 700 cores is roughly 2 weeks.

I think it came out closer to a month. I don't always get (540 + 256) cores non-stop, other things run as well.

at 40/2 The games seem to average about 6 hours, although some are shorter and some are longer. maybe closer to 4 games a day. Particularly when most programs resign at reasonable thresholds, or offer draws well before 50 moves ends the game.

bob · Post by **bob** » Sun Aug 30, 2015 10:38 pm

Rebel wrote:
Laskos wrote:
Rebel wrote:
Laskos wrote:
Rebel wrote:I have good reasons to believe that the drawback of playing bullet games (say 40/15s) to test a change is that you have to play (a lot) more games than testing at CCRL/CEGT (40/240s).

Leaving out my considerations for the above statement for the moment I want to ask if research has been done to investigate that and maybe there is a formula (or factor) that you can use as base to lower the number of games when you (for instance) double the time control.

Say, you are playing 15,000 (40/15) bullet games. When you decide to double the time control, will 12,000 games (or so) give an equivalent result?

In general the rating lists give a pretty good indication of the strength of programs and they are not playing 10,000 games.
Bob is correct. Many, for example positional factors, scale differently with time control, so one goes to longer time controls only out of need, to check to reasonable game-play time controls. If all factors would scale equally, it would be preferable to test at shortest possible time controls, given by the the clock tick and such.
Thanks Kai for moving in.

Are you saying that playing 10,000 (40/240s) games is as reliable than 10,000 (40/15s) bullet games?

This probably is true for positional changes, but for search changes?
If the rating lists at 40/15s are identical to rating lists at 40/240s, and all changes to the engine scale the same in time, then yes. As we know, they are similar, but not identical. One has to extract, what factors contribute almost identically, what very differently. Those suspected by "intuition" or whatever as contributing to differences, are better be tested at longer time controls too. Is a bit of an art, but no clear statistical rules to determine what changes need to be tested how.
Yep, that was my question in the OP.

I think that all you can do is use fast games, and when you do something you might think will affect deeper searches more than shallow searches (IE LMR tuning) then ALSO test at longer time controls here and there to see if there is any noticeable change.

For evaluation changes, I rarely resort to longer games. Unless it is something major like king safety where depth can play a part. For search changes, I usually start at fast, go to 1m+1s and then 5m+5s and if there is any trend I might go longer, but most of the time it is not needed.

Parallel search is another whole can of worms. A BIG can of worms.

cdani · Post by **cdani** » Sun Aug 30, 2015 10:40 pm

Rebel wrote: Are you saying that playing 10,000 (40/240s) games is as reliable than 10,000 (40/15s) bullet games?

If you afford to validate changes playing 10,000 40/240s games instead of 40/15s, your engine will be stronger for sure, because most things scale different at longer time controls. Just see that every engine plays most times different moves at longer time controls. So obviously the results will differ, many times a little, sometimes a lot.

An engine tested at say 40/15s today will be stronger that the same engine tested in the same way some years ago, just because the computers are faster, so the results are different.

Eelco de Groot · Post by **Eelco de Groot** » Mon Aug 31, 2015 2:52 am

Adam Hair wrote:Testing at a bullet time control is not as reliable as testing at a blitz time control when comparing the results from a fixed number of games. As you stated, there are sources of error that are not accounted for when estimating Elo difference that have a larger effect on bullet games. So, 1000 40/240s games are more reliable than 1000 40/15s games (assuming that other conditions such as the openings are the same).

However, I think that source of estimation error is dominated by the reduction of error due to the increased number of games that can be played in a fixed time period when using bullet time controls. Approximately 16,000 40/15s games can be played in the same amount of time as 1000 40/240s games (assuming an average of 80 moves per game). Ignoring the draw rate (I can not recall the error bar formula that includes draws at the moment), that would result in calculated error bars that are 1/4 of those for 1000 games.

I like to think of this as a Galton Box

video demonstration of Galton board, illustrating a binomial or normal distribution

This is like playing a testmatch. If the two engines are of equal strength, you get a normal distribution of possible results at the bottom. The highest columns are exactly in the middle, if you just repeat the number of matches, balls, enough. Each ball's trajectory is a match of n games, if there are n rows of pins. The final outcome is only statistically predictable. If the chance of going left or right bouncing of a pin is not 50%, this is like an Elo difference. The distribution at the bottom still looks like a normal distribution (normal as in De Normaal verdeling) but it is shifted to the right or left.

Now what would happen if you don't have a closed box, but repeat the experiment with a lot of sidewind. If the wind is constant, it changes the probability of jumping left or right everywhere equally. You will measure this as if there was a strength difference that is not real. This would be a systematic error.

You might also have a variable wind, shifting from left to right. The normal distribution from the balls jumping off the pins left or right with an equal chance (or shifted due to a difference in Elo simulated in the board as a difference in chance jumping to one side) will be superposed with another normal distribution, due to the wind. It is harder to pick up the signal of a strength difference, unless the wind is shifting so rapidly compared to the length of a match that in each game the ball has an equal chance of being blown to the left as being blown to the right.

It is these kind of sources of noise, that you do not really want to measure in a chessmatch, and it is also these kinds of noise that might creep in a badly controlled bullet match. For instance, both engines do not get equal time from the operating system. If this is in the order of milliseconds, it will not mattter in 5 minute games but it will in gamelengths you measure in seconds. Of course, there is not much sense to put this in the Elo formula, the possible sources of noise are too varied.

Rebel · Post by **Rebel** » Mon Aug 31, 2015 11:05 am

bob wrote:
Rebel wrote:
Laskos wrote:
Rebel wrote:
Laskos wrote:
Rebel wrote:I have good reasons to believe that the drawback of playing bullet games (say 40/15s) to test a change is that you have to play (a lot) more games than testing at CCRL/CEGT (40/240s).

Leaving out my considerations for the above statement for the moment I want to ask if research has been done to investigate that and maybe there is a formula (or factor) that you can use as base to lower the number of games when you (for instance) double the time control.

Say, you are playing 15,000 (40/15) bullet games. When you decide to double the time control, will 12,000 games (or so) give an equivalent result?

In general the rating lists give a pretty good indication of the strength of programs and they are not playing 10,000 games.
Bob is correct. Many, for example positional factors, scale differently with time control, so one goes to longer time controls only out of need, to check to reasonable game-play time controls. If all factors would scale equally, it would be preferable to test at shortest possible time controls, given by the the clock tick and such.
Thanks Kai for moving in.

Are you saying that playing 10,000 (40/240s) games is as reliable than 10,000 (40/15s) bullet games?

This probably is true for positional changes, but for search changes?
If the rating lists at 40/15s are identical to rating lists at 40/240s, and all changes to the engine scale the same in time, then yes. As we know, they are similar, but not identical. One has to extract, what factors contribute almost identically, what very differently. Those suspected by "intuition" or whatever as contributing to differences, are better be tested at longer time controls too. Is a bit of an art, but no clear statistical rules to determine what changes need to be tested how.
Yep, that was my question in the OP.
I think that all you can do is use fast games, and when you do something you might think will affect deeper searches more than shallow searches (IE LMR tuning) then ALSO test at longer time controls here and there to see if there is any noticeable change.

For evaluation changes, I rarely resort to longer games. Unless it is something major like king safety where depth can play a part. For search changes, I usually start at fast, go to 1m+1s and then 5m+5s and if there is any trend I might go longer, but most of the time it is not needed.

That's how I see it too.

lkaufman · Post by **lkaufman** » Mon Aug 31, 2015 3:35 pm

I think the question needs to be restated. It is obvious that more time per game is better for a given number of games. Let's assume for now that scaling is not an issue (although it certainly is), and that results would be the same at any time control if everything was perfectly fair. Then the question becomes, how fast does the game have to be played for the advantage of playing more games with less time per game to be balanced out by the advantage of less error due to extraneous circumstances like operating system time and granularity of measuring time? Presumably playing game in one second will have too much granularity/system error, and probably beyond game in one minute the loss of games with more time is more of a problem than the benefit of further reducing random error. Somewhere in that one second to one minute range is the ideal time (I think), but where? It is probably lower with Linux than with Windows and may depend on the GUI. This is not a mathematical question but one for experts on computers, so I don't know the answer, but would like to. Does anyone have any data on this?

Laskos · Post by **Laskos** » Mon Aug 31, 2015 4:44 pm

lkaufman wrote:I think the question needs to be restated. It is obvious that more time per game is better for a given number of games. Let's assume for now that scaling is not an issue (although it certainly is), and that results would be the same at any time control if everything was perfectly fair. Then the question becomes, how fast does the game have to be played for the advantage of playing more games with less time per game to be balanced out by the advantage of less error due to extraneous circumstances like operating system time and granularity of measuring time? Presumably playing game in one second will have too much granularity/system error, and probably beyond game in one minute the loss of games with more time is more of a problem than the benefit of further reducing random error. Somewhere in that one second to one minute range is the ideal time (I think), but where? It is probably lower with Linux than with Windows and may depend on the GUI. This is not a mathematical question but one for experts on computers, so I don't know the answer, but would like to. Does anyone have any data on this?

Empirically, I found that the time allocation with respect to desired time control suffers badly below 5 second per game. The time granularity on Windows is about 15ms. The results can be very affected by noise, and a nasty, systematic one. Above 15 second per game most of the bias is from scaling. Fixed depth or nodes games can be extremely fast, but they have limited applications.

Bullet vs regular time control, say 40/4m CCRL/CEGT

Re: Bullet vs regular time control, say 40/4m CCRL/CEGT

Re: Bullet vs regular time control, say 40/4m CCRL/CEGT

Re: Bullet vs regular time control, say 40/4m CCRL/CEGT

Re: Bullet vs regular time control, say 40/4m CCRL/CEGT

Re: Bullet vs regular time control, say 40/4m CCRL/CEGT

Re: Bullet vs regular time control, say 40/4m CCRL/CEGT

Re: Bullet vs regular time control, say 40/4m CCRL/CEGT

Re: Bullet vs regular time control, say 40/4m CCRL/CEGT

Re: Bullet vs regular time control, say 40/4m CCRL/CEGT

Re: Bullet vs regular time control, say 40/4m CCRL/CEGT