Some thoughts on QS

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Some thoughts on QS

Post by Don »

diep wrote:
Don wrote:
diep wrote: We seldom agree, but in this case we do.

Rybka3 is on CCRL 3134 and Houdini which is like 7 plies deeper searching yet having the same eval, it's 3208.

That's 70 elopoints for 7 plies.

However in case of Komodo he's winning 2 ply already at a depth of 10, whereas he's heavy forward pruning last few plies.

It means Komodo without LMR is hopelessly inefficient, so that changes the equasion as i assume Don didn't cut'n paste the evaluation of rybka.

The point is that Don's search is total inefficient *without* LMR.
Komodo without LMR is still stronger than most programs. But what does that matter? This would be like me saying Diep is totally dependent on alpha/beta pruning or else it's hopelessly inefficient. That would be a true but meaningless statement because what matters is how you put everything together.

This basically means don gets more out of LMR than rybka - which could be very true as rybka of course forward prunes a lot last few plies.

Also i advice to Don to quote Johan de Koning to not do incesttesting as incesttesting never is good idea.

Vincent
I have no idea what you are talking about - but we do all our testing against other programs, not komodo vs komodo. Is that what you are talking about?

Don
The most stupid way of forward pruning, if i enable it in Diep, it's 100 elo stronger for Diep in blitz. It's 300 elo stronger in diep-diep. It's 60 elo weaker if i play against other programs.

If i enable multicut. It's a lot of elo stronger at fast time controls single core and especially in diep - diep, and at slower time controls it's a LOT weaker against other programs. Most importantly i noticed that if i enable it, it searches a ply deeper. That matters most 10 to 12 ply.

When diep already gets above 14 ply search depth as a minimum, then multicut no longer gives elo.
I see multi-cut as a drop in replacement for null move pruning but probably not quite as good. We did a lot of experimenting with it and the mult-cut version was about 20 ELO weaker if I remember correctly. But I'm not one to pass judgment too hastily, it could be that we never stumbled on the right formula for doing it.

Now you claim super-bullet time controls, and something that gives you 2 ply at 10 ply search depth, this for an engine that's gettingeasily 20-25 ply,
and you test komodo versus komodo.

That's not science.
Science must first start with facts and you apparently don't have any idea about what we do because you have this all wrong.

We are very concerned with scalability and if you look at the rating lists you will notice that the longer the time control the better Komodo does. We have put a significant amount of time into understanding what works and how it's affected by depth.

We don't test Komodo versions against each other. We test in gauntlet fashion where each new candidate plays several foreign programs and not other Komodo versions.

So you are just saying stupid things that are not factual.

I do agree on one thing, some types of changes can help at fast time controls and hurt at long time controls - there is no doubt about that. There are also things that help only at long time controls and those things sometimes don't make it into most programs because they are too hard to test where you need 20,000 games to prove an idea.

I personally think you are too quick to draw conclusions about what works and what doesn't work. You talk about science but you don't use science at all, everything about you is intuition driven. Even your program has strengths and weaknesses that are driven by whether your intuition was good or bad.

The pattern I see over and over again in computer chess is that it is full of superstition and conjecture. The only way to have a really strong program today (other than copying someone else and calling it yours) is to leave your superstitions at the door and open up your mind and never take yourself (or your opinions) too seriously or you will just end up painting yourself into a corner.

Here is a thought experiment. Imagine that computers continue to get faster and faster until they are again 100 times faster than today. Are we still going to have the argument that things that work at 1 minute will not work at 1 hour? Because 1 hour now is like 36 seconds will be then. If we had had this discussion 20 years ago what would we have concluded?

It seems to me that in 10 or 20 years we will have to "adjust" our programs over and over again to be strong at the only levels we can reasonably test. When we work on scalability we take the cognitive shortcut of assuming that there are only 2 search depths, anything below a few ply which are "fast" and everything above that, which is "slow" and that beyond that an idea either "works" or "does not work." So you get language such as, "it does not work at long time controls." But long and short are highly relative concepts. Komodo at game in 1 second would be like Sargon on the z80 playing a correspondence game.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Uri Blass
Posts: 10300
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Some thoughts on QS

Post by Uri Blass »

I do not notice that komodo does better at longer time control based on rating lists.

It may do better in blitz relative to bullet but when I compare between
CCRL 40/40 and CCRL 40/4 it is not clear.
Even for Komodo3 there may be a statistical error but
at least it seems to be the case and the best example that I could find when Komodo3 is better than something at 40/40 and worse at 40/4 is
Rybka3 64 bit 2 cpu.

long time control
Komodo 3 64-bit 3114 +15 −15 61.3% −75.1 42.5% 1435
blitz
Komodo 3 64-bit 3164 +10 −10 57.6% −68.9 36.1% 4535

Even here Rybka3 2 cpu may be weaker than komodo3 in 40/4
because the statistical error is too high.

Rybka 3 64-bit 2CPU 3076 +21 −21 67.0% −112.2 42.9% 769
Rybka 3 64-bit 2CPU 3181 +14 −14 78.5% −247.4 28.5% 2596
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Some thoughts on QS

Post by Don »

Uri Blass wrote:I do not notice that komodo does better at longer time control based on rating lists.
Here are the current numbers using Houdini 2.0 and Komodo 5:

CCRL 40/4 - Houdini up 60 ELO
CCRL 40/40 - Houdini up 23 ELO

CEGT 4/40 - Houdini up 44 ELO
CEGT 40/20 - Houdini 2 ELO weaker than Komodo 5

In both cases you see we do pretty badly at short time controls and improve substantially at longer time controls which refutes Vincents idea that Komodo is optimized only to play fast time controls.

If you look at the CEGT list at 40/4 you will also see that Komodo 5 is coming out weaker than several other programs but at 40/20 it comes out as better than all other programs except for a virtual tie with Houdini 2.0c.

I think some of what we are observing is that Houdini is the wild card here. Most programs have ELO curves similar to Komodo, but Houdini and the other clones tend to play really well at fast time controls - so that is where you see most of the difference.


It may do better in blitz relative to bullet but when I compare between
CCRL 40/40 and CCRL 40/4 it is not clear.
Even for Komodo3 there may be a statistical error but
at least it seems to be the case and the best example that I could find when Komodo3 is better than something at 40/40 and worse at 40/4 is
Rybka3 64 bit 2 cpu.

long time control
Komodo 3 64-bit 3114 +15 −15 61.3% −75.1 42.5% 1435
blitz
Komodo 3 64-bit 3164 +10 −10 57.6% −68.9 36.1% 4535

Even here Rybka3 2 cpu may be weaker than komodo3 in 40/4
because the statistical error is too high.

Rybka 3 64-bit 2CPU 3076 +21 −21 67.0% −112.2 42.9% 769
Rybka 3 64-bit 2CPU 3181 +14 −14 78.5% −247.4 28.5% 2596
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
Houdini
Posts: 1471
Joined: Tue Mar 16, 2010 12:00 am

Re: Some thoughts on QS

Post by Houdini »

Let's do some more rating list cherry-picking. :lol:

CCRL 40/4:
- Komodo 3 64-bit 3164
- Komodo 4 64-bit 3191 (+27)
- Komodo 5 64-bit 3217 (+53)

CCRL 40/40
- Komodo 3 64-bit 3114
- Komodo 4 64-bit 3117 (+3)
- Komodo 5 64-bit 3120 (+6)

Between Komodo 3 and Komodo 5 you've managed to add 53 Elo at fast 40/4, translating into 6 Elo at slow 40/40.
Timo's matches at 120+3 strongly suggest the same, as Komodo 5 performed worse than Komodo 4.

"Lies, damned lies, and statistics".
Uri Blass
Posts: 10300
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Some thoughts on QS

Post by Uri Blass »

Don wrote:
Uri Blass wrote:I do not notice that komodo does better at longer time control based on rating lists.
Here are the current numbers using Houdini 2.0 and Komodo 5:

CCRL 40/4 - Houdini up 60 ELO
CCRL 40/40 - Houdini up 23 ELO

CEGT 4/40 - Houdini up 44 ELO
CEGT 40/20 - Houdini 2 ELO weaker than Komodo 5

In both cases you see we do pretty badly at short time controls and improve substantially at longer time controls which refutes Vincents idea that Komodo is optimized only to play fast time controls.

If you look at the CEGT list at 40/4 you will also see that Komodo 5 is coming out weaker than several other programs but at 40/20 it comes out as better than all other programs except for a virtual tie with Houdini 2.0c.

I think some of what we are observing is that Houdini is the wild card here. Most programs have ELO curves similar to Komodo, but Houdini and the other clones tend to play really well at fast time controls - so that is where you see most of the difference.
some comments when I look at the CCRL list

1)I think that usually you see smaller difference at longer time control
so comparing elo difference is not going to help here.

CCRL 40/40
Komodo 5 64-bit 3120 +24 −24
Movei 00.8.438 (10 10 10) 2672 +12 −12

difference 448 elo

CCRL 40/4

Komodo 5 64-bit 3217 +24 −23
Movei 00.8.438 (10 10 10) 2623 +9 −8

difference 594 elo

Note that Movei does relatively better than programs at similiar strength at long time control but I do not claim that it does better than Komodo at long time control because I think that difference in rating is meaningful only if one program is better at long time control and worse at blitz.

2)Even if we look only at Komodo5 and Houdini
I think that it is better to choose houdini1.5 and not houdini2 because 1.5 seems to be better than 2 at long time control.


and
If I use the CCRL 40/40 and 40/4 I get
40/4

Houdini 2.0c 64-bit 3276 +18 −18 69.3% −166.5 31.5%
Houdini 1.5a 64-bit 3235 +12 −12 68.7% −162.4 31.0%
Komodo 5 64-bit 3217 +24 −23 71.6% −188.4 28.6%

40/40

Houdini 1.5a 64-bit 3154 +16 −16 63.5% −89.9 43.4%
Houdini 2.0c 64-bit 3143 +16 −16 66.8% −117.9 36.8%
Komodo 5 64-bit 3120 +24 −24 61.5% −73.5 53.4%

The difference between houdini1.5a and Komodo5
is bigger at 40/40 and

3)even if we ignore houdini1.5a and look only at komodo5 and houdini2c the result is not significant statistically.

The 59 elo difference between houdini2c and komodo5 at 40/4 have a statistical error near 30 elo.

The 23 elo difference between houdini2c and komodo5 at 40/40 have a similiar statistical error.

The difference between the differences has a statistical error of more than 36 elo.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Some thoughts on QS

Post by diep »

Don wrote:
diep wrote:
Don wrote:
diep wrote: We seldom agree, but in this case we do.

Rybka3 is on CCRL 3134 and Houdini which is like 7 plies deeper searching yet having the same eval, it's 3208.

That's 70 elopoints for 7 plies.

However in case of Komodo he's winning 2 ply already at a depth of 10, whereas he's heavy forward pruning last few plies.

It means Komodo without LMR is hopelessly inefficient, so that changes the equasion as i assume Don didn't cut'n paste the evaluation of rybka.

The point is that Don's search is total inefficient *without* LMR.
Komodo without LMR is still stronger than most programs. But what does that matter? This would be like me saying Diep is totally dependent on alpha/beta pruning or else it's hopelessly inefficient. That would be a true but meaningless statement because what matters is how you put everything together.

This basically means don gets more out of LMR than rybka - which could be very true as rybka of course forward prunes a lot last few plies.

Also i advice to Don to quote Johan de Koning to not do incesttesting as incesttesting never is good idea.

Vincent
I have no idea what you are talking about - but we do all our testing against other programs, not komodo vs komodo. Is that what you are talking about?

Don
The most stupid way of forward pruning, if i enable it in Diep, it's 100 elo stronger for Diep in blitz. It's 300 elo stronger in diep-diep. It's 60 elo weaker if i play against other programs.

If i enable multicut. It's a lot of elo stronger at fast time controls single core and especially in diep - diep, and at slower time controls it's a LOT weaker against other programs. Most importantly i noticed that if i enable it, it searches a ply deeper. That matters most 10 to 12 ply.

When diep already gets above 14 ply search depth as a minimum, then multicut no longer gives elo.
I see multi-cut as a drop in replacement for null move pruning but probably not quite as good. We did a lot of experimenting with it and the mult-cut version was about 20 ELO weaker if I remember correctly. But I'm not one to pass judgment too hastily, it could be that we never stumbled on the right formula for doing it.
We can mathematically prove nullmove to be a stronger algorithm than multicut for computerchess. Take me correct - multicut is a real invention.

It's a big thing to invent - all credits to Ingvi there who invented it in 90s.

It has a similar weakness like fail-high reductions and another one. They all have to do with hashtable.

The weakness with hashtable is that there is no cheap way of verifying that all 3 paths transpose into the 100% same refutation.

In chess this is a big issue as it's very common for paths to converge into the same refutation.

Say we can do Bxh6+ here, Qxg6+ or Qd8+.

All 3 transposte further away from the root into the same position upon which we base our entire multicut.

So the algorithm is really transpositionsensitive. There will be undoubtfully ways to undo this, but that obviously will take care that your transpositiontable will work less.

If we nullmove in the same position we tend to reduce with a bigger R than we can do with multicut. Giving the opponent the move is a very strong assumption.

We do not suffer from real ugly transpositions then unlike with multicut.

So nullmove in general is far more powerful than multicut as we can reduce more and we do not suffer from the transpositiontable effect.

QED.

So with drop in i assume you mean that they try to reduce basically the same search space. This is entirely correct - yet nullmove is doing this in a far more correct manner than multicut.

Also where nullmove doesn't work because the opponent can capture a piece of us, yet where normal search gives a cutoff, there multicut has usually a problem to give a cutoff, as it needs 3 cutoffs. It doesn't happen a lot that you can capture a piece in 3 manners. AND IF YOU CAN THEN WE HAVE THE TRANSPOSITION BUG AGAIN IN SEVERAL CASES.

I experimented a lot as well with multicut and i'm sure with some tricks you can eliminate the transpositiontable problem a tad.

Yet even then the reduction factor you need for multicut is rather huge. At least you need to reduce a ply or 3 (on top of the normal reduction, to get any benefit at all out of multicut.

In principle you reduce your position P, without giving the opponent some sort of benefit like with nullmove, with a bunch of plies.

That's very tricky at least.

Doing that i can win 1 ply search depth with Diep. Is it worth to win 1 ply meanwhile having a reduction of in total 3 plies in what is possible a critical line?

For some who just test bullet i'm sure it might work and some will find it working at all levels for them, but i garantuee you, nullmove is far stronger assumption than multicut.


Now you claim super-bullet time controls, and something that gives you 2 ply at 10 ply search depth, this for an engine that's gettingeasily 20-25 ply,
and you test komodo versus komodo.

That's not science.
Science must first start with facts and you apparently don't have any idea about what we do because you have this all wrong.

We are very concerned with scalability and if you look at the rating lists you will notice that the longer the time control the better Komodo does. We have put a significant amount of time into understanding what works and how it's affected by depth.

We don't test Komodo versions against each other. We test in gauntlet fashion where each new candidate plays several foreign programs and not other Komodo versions.

So you are just saying stupid things that are not factual.
We agree that you have to start somewhere.

what i simply see is that you played komodo without lmr against komodo with lmr at a small search depth.

Now There is a ton of algorithms that will give diep BIG ELO at bullet/blitz single core.

I gave a few examples.

Especially if we consider you also forward prune heavily last few plies, practical your 10 ply search depth is equal to Diep search depth 7 or so, from tactical viewpoint.

At that depth of course *anything* get gets me search deeper will give elo.
I do agree on one thing, some types of changes can help at fast time controls and hurt at long time controls - there is no doubt about that. There are also things that help only at long time controls and those things sometimes don't make it into most programs because they are too hard to test where you need 20,000 games to prove an idea.
I'm more than anyone else in the know how with just 1 machine at home you can't do much.

In this case: more than 150 elopoints are rather easy to prove.

Additionally it's you who in 90s already posted the indication that at bigger search depths the elowin for basically anything is smaller.

Claiming the opposite now is contradicting that claim, and rather naive claim.

I personally think you are too quick to draw conclusions about what works and what doesn't work. You talk about science but you don't use science at all, everything about you is intuition driven. Even your program has strengths and weaknesses that are driven by whether your intuition was good or bad.
I'm not doing a claim that LMR is 150 elopoints at super-bullet of say 0.1 seconds a move and THEREFORE it gives more elo at slower time controls.

That's not only counter-intuitiion, it's also dead wrong science.

Furthermore much of the forward pruning experiments i did do, they are at a BIGGER search depth actually than your superbullet. And that already around 11-12 years ago.

Sure it was done at 36 computers and you didn't have 36 computers. Usually it took me a whole day to setup the 36 computers, besides the driving time to Jan Louwman. He then reported back after a week or 3.

So each experiment also took way longer than what you posted here.

Usually around a 500 games with and 500 games without X it was.

Obviously i couldn't repeat this too much and it layed a big stress on Jan.

Calling that intuition driven is dead wrong. I had probably more accurate data than anyone except a few chessprogrammrs with machines at home, to prove or disprove specific algorithms.

Also another hard fact is that you need far less games in slower time control games than you'd need in bullet to prove anything.

About all chessprogrammers that i know of confirmed this in the tournaments they showed up at.

Some algorithms require overhead. Even at superbullet you won't have enough system time for that overhead.

There will be many algorithms that super-bullet willl never discover.

Yet many simple type algorithms, no matter how big of an invention, they just do not work very well in computerchess, yet in many programs they work in superbullet.
The pattern I see over and over again in computer chess is that it is full of superstition and conjecture. The only way to have a really strong program today (other than copying someone else and calling it yours) is to leave your superstitions at the door and open up your mind and never take yourself (or your opinions) too seriously or you will just end up painting yourself into a corner.
I'm not here to impose what sort of social behaviour others must follow. I do notice a lot of bad science though.
Here is a thought experiment. Imagine that computers continue to get faster and faster until they are again 100 times faster than today. Are we still going to have the argument that things that work at 1 minute will not work at 1 hour? Because 1 hour now is like 36 seconds will be then. If
A single core today is not factor 100 faster than 12 years ago.
The cores i have here are 2.5Ghz core2 Xeons. And 64 cores of them :)

Back in 2001 a forward pruning experiment at Jan Louwmans computer with 1000 games in total (500 with and around 500 without), which scored 20% worse against the world top in those days, at slow level time controls,

we used mostly k7 machines as well as p3's. The slower P3's at 800Mhz i put at a time control of 9 hours a game. The 1.x Ghz k7's were put at 40 in 2.

That's 3 minutes a move. So in Ghz minute a move that's 1.6Ghz * 3 = 4.8Ghz minute a move.

Your superbullet, i don't know what your hardware is, could be high clocked i7 or something, but if you'd have 2.5Ghz core2/i7 based hardware that's.

2.5Ghz * 0.1 seconds / 60 = 0.04167 Ghz minute a move.

Even the testing from 12 years ago, with factor 100 faster hardware, you won't do better than what i did do in 2001 at single core machines.

Yet you know just as well as i do, that a single core in 10 years from now won't be factor 100 faster. You'll have to scale SMP.
we had had this discussion 20 years ago what would we have concluded?
We would've concluded that searching at 2 ply search depths would be pretty stupid.

What most here have forgotten is my statements from end 90s, which is that you first need to get through a tactical barrier.

That barrier for todays programs lies somewhere 17+ plies or so. For software from a year or 10 ago it was somewhere around a 12-14 ply.

Above that testing becomes easier as a lot of the tactical noise you stumble upon gets away and positional and strategical factors tend to become more important.

The huge reduction factors used for nullmove and even LMR nowadays definitely give indication of that tactical barrier.

If it wouldn't be there, you would not be able to use such huge reduction factors.
It seems to me that in 10 or 20 years we will have to "adjust" our programs over and over again to be strong at the only levels we can reasonably test. When we work on scalability we take the cognitive shortcut of assuming that there are only 2 search depths, anything below a few ply which are "fast" and everything above that, which is "slow" and that beyond that an idea either "works" or "does not work." So you get language such as, "it does not work at long time controls." But long and short are highly relative concepts. Komodo at game in 1 second would be like Sargon on the z80 playing a correspondence game.
At factor 100 faster hardware you would first need to get a speedup of factor 100.0 and even then you barely are having the accuracy of the algorithmic experiments i did do in 2001 as testing at factor 100 faster hardware at 0.1 seconds a move is a lot slower than game in 1 second and 0.04 * 100 = 4.0 which still is slower than what i did do in 2001.

entire game in 1 second at factor 100 faster hardware, probably by then games are 100 moves if not more, means you can take at most 10 milliseconds a move.

So at multiplce socket machines you'll get dicked as the fastest timing you can do there is gonna eat 33 milliseconds or so from the runqueue.

But let's continue the throught experiment.

10 milliseconds at the virtual 250Ghz core2 hardware that's in Ghz minute a move:

250 ghz * 0.01 / 60 = 0.04Ghz/minute.

Or about equal to the time control you tested at now and factor 100 slower than what i tested at in 2001.

Thanks,
Vincent
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Some thoughts on QS

Post by Don »

Uri Blass wrote:
Don wrote:
Uri Blass wrote:I do not notice that komodo does better at longer time control based on rating lists.
Here are the current numbers using Houdini 2.0 and Komodo 5:

CCRL 40/4 - Houdini up 60 ELO
CCRL 40/40 - Houdini up 23 ELO

CEGT 4/40 - Houdini up 44 ELO
CEGT 40/20 - Houdini 2 ELO weaker than Komodo 5

In both cases you see we do pretty badly at short time controls and improve substantially at longer time controls which refutes Vincents idea that Komodo is optimized only to play fast time controls.

If you look at the CEGT list at 40/4 you will also see that Komodo 5 is coming out weaker than several other programs but at 40/20 it comes out as better than all other programs except for a virtual tie with Houdini 2.0c.

I think some of what we are observing is that Houdini is the wild card here. Most programs have ELO curves similar to Komodo, but Houdini and the other clones tend to play really well at fast time controls - so that is where you see most of the difference.
some comments when I look at the CCRL list

1)I think that usually you see smaller difference at longer time control
so comparing elo difference is not going to help here.

CCRL 40/40
Komodo 5 64-bit 3120 +24 −24
Movei 00.8.438 (10 10 10) 2672 +12 −12

difference 448 elo

CCRL 40/4

Komodo 5 64-bit 3217 +24 −23
Movei 00.8.438 (10 10 10) 2623 +9 −8

difference 594 elo

Note that Movei does relatively better than programs at similiar strength at long time control but I do not claim that it does better than Komodo at long time control because I think that difference in rating is meaningful only if one program is better at long time control and worse at blitz.

2)Even if we look only at Komodo5 and Houdini
I think that it is better to choose houdini1.5 and not houdini2 because 1.5 seems to be better than 2 at long time control.


and
If I use the CCRL 40/40 and 40/4 I get
40/4

Houdini 2.0c 64-bit 3276 +18 −18 69.3% −166.5 31.5%
Houdini 1.5a 64-bit 3235 +12 −12 68.7% −162.4 31.0%
Komodo 5 64-bit 3217 +24 −23 71.6% −188.4 28.6%

40/40

Houdini 1.5a 64-bit 3154 +16 −16 63.5% −89.9 43.4%
Houdini 2.0c 64-bit 3143 +16 −16 66.8% −117.9 36.8%
Komodo 5 64-bit 3120 +24 −24 61.5% −73.5 53.4%

The difference between houdini1.5a and Komodo5
is bigger at 40/40 and

3)even if we ignore houdini1.5a and look only at komodo5 and houdini2c the result is not significant statistically.

The 59 elo difference between houdini2c and komodo5 at 40/4 have a statistical error near 30 elo.

The 23 elo difference between houdini2c and komodo5 at 40/40 have a similiar statistical error.

The difference between the differences has a statistical error of more than 36 elo.
Hi Uri,

You do in fact see smaller difference at longer time controls. However we have played hundreds of thousands of games over the past year or two at various time controls that make it absolutely clear that we do badly at fast time controls compared to longer time controls.

Houdini is a very bad example to compare against because unless you run it deep enough all you will see is Komodo approaching Houdini. As you say there is some doubt due to the factor you mentioned and also the fact that the rating agencies are almost useless (for things like this) due to their tiny samples and widely varying conditions.

This argument was not supposed to be about Houdini however, it was supposed to be an answer to Vincent who claims that Komodo is super optimized for lightening chess compared to other programs. There is no significant data anywhere which supports that claim.

I think there is an interesting experiment you can conduct that is a fair way to measure this effect between any 2 programs without being concerned about the rating compression affect you observed. Here is how it works:

Start with the assertion and then try to prove or disprove it. Let's say the assertion is that program A scales better than program B. Choose some fast time control and then play time games where one program is handicapped such that program A is scoring slightly worse than program B with a statistically significant sample of games.

Now your starting assertion is that program A should win at longer time controls, so play a second match with the same handicap given to the same program but the time control increased by a factor of 10. If your assertion is correct, then you should see a crossover, program A should be winning the second match although it lost the first.

We have seen over and over again that we lose to critter (and Ippo programs) if we test too fast, and we win if we test longer. So we are not chasing Critter with a gap that is gradually closing, we are passing it with an increasing gap. I suspect that if you go deep enough and draws become common the gap will then likely start to gradually close (even if the "chess superiority" continues to increase.) But when you see an increase even though you know that there should be a decrease with depth then you have proved your point and we have seen that many times in our own testing.

Don
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Some thoughts on QS

Post by Don »

diep wrote: Additionally it's you who in 90s already posted the indication that at bigger search depths the elowin for basically anything is smaller.

Claiming the opposite now is contradicting that claim, and rather naive claim.
I think you misunderstood something about what I said either then or now. I believe that with depth any superiority is reduced generally - the ELO gap closes between a weak and strong program in general, but a terribly written unscalable program may actually lose ground with depth. You can easily write a program that does not scale well and loses ground to other programs with depth. That is not an absolute that it can never happen.

The programs of 30 years ago - play them against Komodo and handicap Komodo to be equal in strength - then keep doubling the time control for each program and you will see Komodo's ELO increase relative to them with each doubling.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Uri Blass
Posts: 10300
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Some thoughts on QS

Post by Uri Blass »

Don wrote:
Uri Blass wrote:
Don wrote:
Uri Blass wrote:I do not notice that komodo does better at longer time control based on rating lists.
Here are the current numbers using Houdini 2.0 and Komodo 5:

CCRL 40/4 - Houdini up 60 ELO
CCRL 40/40 - Houdini up 23 ELO

CEGT 4/40 - Houdini up 44 ELO
CEGT 40/20 - Houdini 2 ELO weaker than Komodo 5

In both cases you see we do pretty badly at short time controls and improve substantially at longer time controls which refutes Vincents idea that Komodo is optimized only to play fast time controls.

If you look at the CEGT list at 40/4 you will also see that Komodo 5 is coming out weaker than several other programs but at 40/20 it comes out as better than all other programs except for a virtual tie with Houdini 2.0c.

I think some of what we are observing is that Houdini is the wild card here. Most programs have ELO curves similar to Komodo, but Houdini and the other clones tend to play really well at fast time controls - so that is where you see most of the difference.
some comments when I look at the CCRL list

1)I think that usually you see smaller difference at longer time control
so comparing elo difference is not going to help here.

CCRL 40/40
Komodo 5 64-bit 3120 +24 −24
Movei 00.8.438 (10 10 10) 2672 +12 −12

difference 448 elo

CCRL 40/4

Komodo 5 64-bit 3217 +24 −23
Movei 00.8.438 (10 10 10) 2623 +9 −8

difference 594 elo

Note that Movei does relatively better than programs at similiar strength at long time control but I do not claim that it does better than Komodo at long time control because I think that difference in rating is meaningful only if one program is better at long time control and worse at blitz.

2)Even if we look only at Komodo5 and Houdini
I think that it is better to choose houdini1.5 and not houdini2 because 1.5 seems to be better than 2 at long time control.


and
If I use the CCRL 40/40 and 40/4 I get
40/4

Houdini 2.0c 64-bit 3276 +18 −18 69.3% −166.5 31.5%
Houdini 1.5a 64-bit 3235 +12 −12 68.7% −162.4 31.0%
Komodo 5 64-bit 3217 +24 −23 71.6% −188.4 28.6%

40/40

Houdini 1.5a 64-bit 3154 +16 −16 63.5% −89.9 43.4%
Houdini 2.0c 64-bit 3143 +16 −16 66.8% −117.9 36.8%
Komodo 5 64-bit 3120 +24 −24 61.5% −73.5 53.4%

The difference between houdini1.5a and Komodo5
is bigger at 40/40 and

3)even if we ignore houdini1.5a and look only at komodo5 and houdini2c the result is not significant statistically.

The 59 elo difference between houdini2c and komodo5 at 40/4 have a statistical error near 30 elo.

The 23 elo difference between houdini2c and komodo5 at 40/40 have a similiar statistical error.

The difference between the differences has a statistical error of more than 36 elo.
Hi Uri,

You do in fact see smaller difference at longer time controls. However we have played hundreds of thousands of games over the past year or two at various time controls that make it absolutely clear that we do badly at fast time controls compared to longer time controls.

Houdini is a very bad example to compare against because unless you run it deep enough all you will see is Komodo approaching Houdini. As you say there is some doubt due to the factor you mentioned and also the fact that the rating agencies are almost useless (for things like this) due to their tiny samples and widely varying conditions.

This argument was not supposed to be about Houdini however, it was supposed to be an answer to Vincent who claims that Komodo is super optimized for lightening chess compared to other programs. There is no significant data anywhere which supports that claim.

I think there is an interesting experiment you can conduct that is a fair way to measure this effect between any 2 programs without being concerned about the rating compression affect you observed. Here is how it works:

Start with the assertion and then try to prove or disprove it. Let's say the assertion is that program A scales better than program B. Choose some fast time control and then play time games where one program is handicapped such that program A is scoring slightly worse than program B with a statistically significant sample of games.

Now your starting assertion is that program A should win at longer time controls, so play a second match with the same handicap given to the same program but the time control increased by a factor of 10. If your assertion is correct, then you should see a crossover, program A should be winning the second match although it lost the first.

We have seen over and over again that we lose to critter (and Ippo programs) if we test too fast, and we win if we test longer. So we are not chasing Critter with a gap that is gradually closing, we are passing it with an increasing gap. I suspect that if you go deep enough and draws become common the gap will then likely start to gradually close (even if the "chess superiority" continues to increase.) But when you see an increase even though you know that there should be a decrease with depth then you have proved your point and we have seen that many times in our own testing.

Don
Don, I believe you that komodo scales better relative to Critter and Houdini when you go from bullet to blitz and need smaller time advantage relative to them to score 50%.

It does not prove that it is continues to scale better when you go from blitz to longer time control.

Of course we need more games but I will not be surprised if Houdini1.5 scales better than komodo when you go from 40/20 to longer time control.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Some thoughts on QS

Post by Don »

Uri Blass wrote: Don, I believe you that komodo scales better relative to Critter and Houdini when you go from bullet to blitz and need smaller time advantage relative to them to score 50%.

It does not prove that it is continues to scale better when you go from blitz to longer time control.
This is an argument that is infinitely extendable and thus cannot be debated by reasonable people. If you refuse to use inference, then you have an infinite number of time controls that you have to prove.

In fact, the rating lists don't mean a thing. The program that is number 200 on the list may actually be the strongest program at 5 minutes + 4 seconds - nobody ever checked that exact time control so how you know for sure?

Don
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.