Scaling from FGRL results with top 3 engines

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

JJJ
Posts: 1346
Joined: Sat Apr 19, 2014 1:47 pm

Re: Scaling from FGRL results with top 3 engines

Post by JJJ »

In fact you could add than Houdini scales the worst with Stockfish. But to me, Stockfish is the favorite of this TCEC compared to Houdini 6 and Komodo 11.2, unless these engines comes with a surprise.

Also, both Stockfish and Houdini are tested at micro time control for their patches, Houdini 8 sec + 80 ms and Stockfish 10 sec and 100 ms, so can we jump to the conclusion that the lower the time control for test, the most his strenght decrease in long time control ?
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Scaling from FGRL results with top 3 engines

Post by Laskos »

JJJ wrote:In fact you could add than Houdini scales the worst with Stockfish. But to me, Stockfish is the favorite of this TCEC compared to Houdini 6 and Komodo 11.2, unless these engines comes with a surprise.

Also, both Stockfish and Houdini are tested at micro time control for their patches, Houdini 8 sec + 80 ms and Stockfish 10 sec and 100 ms, so can we jump to the conclusion that the lower the time control for test, the most his strenght decrease in long time control ?
Yes, and as one can see from the opening post to this LTC result, the scaling of Houdini and Stockfish from 60 seconds to 60 minutes is almost identical. So, the results at 60s between the two are fairly representative for LTC. I tested the fastest compile of current Stockfish dev I have, Brainfish (no any book), against the latest Houdini, and Stockfish beats Houdini handily at 60''+ 0.6'':

Code: Select all

Games Completed = 400 of 400 (Avg game length = 187.888 sec)
Settings = RR/64MB/60000ms+600ms/M 600cp for 3 moves, D 120 moves/EPD:C:\LittleBlitzer\2moves_v1.epd(32000)
Time = 10873 sec elapsed, 0 sec remaining
 1.  Brainfish 021017 64 BMI2 	217.0/400	114-80-206  	(L: m=0 t=0 i=0 a=80)	(D: r=147 i=28 f=8 s=3 a=20)	(tpm=1385.4 d=23.59 nps=1714177)
 2.  Houdini 6.02 Pro x64-pext	183.0/400	80-114-206  	(L: m=0 t=1 i=0 a=113)	(D: r=147 i=28 f=8 s=3 a=20)	(tpm=1424.6 d=20.02 nps=1853284)
I will re-test with Contempt=0 for Houdini and some moderate Overhead value, as Houdini had 1 time loss.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Scaling from FGRL results with top 3 engines

Post by cdani »

JJJ wrote: Also, both Stockfish and Houdini are tested at micro time control for their patches, Houdini 8 sec + 80 ms and Stockfish 10 sec and 100 ms, so can we jump to the conclusion that the lower the time control for test, the most his strenght decrease in long time control ?
For Andscacs I use 15 seconds for stc, and maybe 30-80 seconds for ltc. Stockfish also uses 60 seconds for ltc, which is reasonable.

The good/bad scaling, in my experience is not due to using such short time controls, that anyway cannot be avoided due to lack of resources, but to the stop rules that one uses.

I must say that I don't have any good mathematical knowledge, so every decision I have done related to this, is more by intuition than anything, so pull your hair at your own taste :-)

As I don't have many resources, many times I accept little changes without having enough confidence, just by intuition. To have some safety, every few changes I do verification tests, most probably at longer time control. Seems to be enough.

If a change passes stc and ltc, but ltc obtains less %, or I decide to extend the tests, maybe both, or I do a third test at even longer time control, but I don't accept a patch like this at first. Stockfish framework had accepted the test without further testing. If in doubt, the most probable is that I reject the change if the drop in % of the ltc patch is clear.

Some tests that I think that are sensitive to time control, like king safety ones, I test them first at ltc or medium time control, as any result at stc alone will not give any real information. If they are clearly bad at ltc I reject them. If I have doubts sometimes I do an even longer time control test. And sometimes the way to be more sure faster is to do a stc test. If is good at stc, the change is bad. And if is bad or near 0 at stc, then I decide what to do.

Of course the best changes are the ones that are bad at stc and good at ltc, undetectable to current testing methods of most people, I suppose.

Take all this with tweezers.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Scaling from FGRL results with top 3 engines

Post by cdani »

And another important point. Testing 10 patches with somewhat lower confidence in 5 days tend to give more elo than testing 3-5 patches with enough confidence in the same time.
Uri Blass
Posts: 10298
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Scaling from FGRL results with top 3 engines

Post by Uri Blass »

<snipped>
cdani wrote:
For Andscacs I use 15 seconds for stc, and maybe 30-80 seconds for ltc. Stockfish also uses 60 seconds for ltc, which is reasonable.

The good/bad scaling, in my experience is not due to using such short time controls, that anyway cannot be avoided due to lack of resources, but to the stop rules that one uses.
You can clearly avoid using short time control at the price of making less changes.
The target does not have to be making progress as fast as possible and I wonder if there are programmers who prefer to get less progress but to verify that every change that they accept works at time control that is longer than 80 seconds including testing ideas that do not work at short time control(when testing at STC can be done to verify that you do not lose more than 10 elo(because losing more than 10 elo is probably because of a bug) and to get knowledge about the value of the patch at STC).

I think that
Knowing that some change is in the interval (-6 elo,-2 elo) in 10+0.1 time control and in the interval (2 elo ,6 elo) in 160+1.6 time control may be an interesting information.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Scaling from FGRL results with top 3 engines

Post by Laskos »

Laskos wrote:
JJJ wrote:In fact you could add than Houdini scales the worst with Stockfish. But to me, Stockfish is the favorite of this TCEC compared to Houdini 6 and Komodo 11.2, unless these engines comes with a surprise.

Also, both Stockfish and Houdini are tested at micro time control for their patches, Houdini 8 sec + 80 ms and Stockfish 10 sec and 100 ms, so can we jump to the conclusion that the lower the time control for test, the most his strenght decrease in long time control ?
Yes, and as one can see from the opening post to this LTC result, the scaling of Houdini and Stockfish from 60 seconds to 60 minutes is almost identical. So, the results at 60s between the two are fairly representative for LTC. I tested the fastest compile of current Stockfish dev I have, Brainfish (no any book), against the latest Houdini, and Stockfish beats Houdini handily at 60''+ 0.6'':

Code: Select all

Games Completed = 400 of 400 &#40;Avg game length = 187.888 sec&#41;
Settings = RR/64MB/60000ms+600ms/M 600cp for 3 moves, D 120 moves/EPD&#58;C&#58;\LittleBlitzer\2moves_v1.epd&#40;32000&#41;
Time = 10873 sec elapsed, 0 sec remaining
 1.  Brainfish 021017 64 BMI2 	217.0/400	114-80-206  	&#40;L&#58; m=0 t=0 i=0 a=80&#41;	&#40;D&#58; r=147 i=28 f=8 s=3 a=20&#41;	&#40;tpm=1385.4 d=23.59 nps=1714177&#41;
 2.  Houdini 6.02 Pro x64-pext	183.0/400	80-114-206  	&#40;L&#58; m=0 t=1 i=0 a=113&#41;	&#40;D&#58; r=147 i=28 f=8 s=3 a=20&#41;	&#40;tpm=1424.6 d=20.02 nps=1853284&#41;
I will re-test with Contempt=0 for Houdini and some moderate Overhead value, as Houdini had 1 time loss.
With Overhead set to 30ms for Houdini and Contempt to 0, the result is no better, Stockfish wins convincingly. They scale almost identically from 60s to 60min, so no very good perspectives for Houdini to win TCEC, only if SMP implementation is better. But anyway, top 3 will be close at TCEC.

Code: Select all

Games Completed = 400 of 400 &#40;Avg game length = 184.981 sec&#41;
Settings = Gauntlet/64MB/60000ms+600ms/M 600cp for 3 moves, D 120 moves/EPD&#58;C&#58;\LittleBlitzer\2moves_v1.epd&#40;32000&#41;
Time = 10719 sec elapsed, 0 sec remaining
 1.  Brainfish 021017 64 BMI2 	220.5/400	96-55-249  	&#40;L&#58; m=0 t=0 i=0 a=55&#41;	&#40;D&#58; r=157 i=49 f=10 s=4 a=29&#41;	&#40;tpm=1372.2 d=24.18 nps=1732874&#41;
 2.  Houdini 6.02 Pro x64-pext	179.5/400	55-96-249  	&#40;L&#58; m=0 t=0 i=0 a=96&#41;	&#40;D&#58; r=157 i=49 f=10 s=4 a=29&#41;	&#40;tpm=1412.0 d=20.56 nps=1878006&#41;
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Scaling from FGRL results with top 3 engines

Post by Milos »

Uri Blass wrote:With only material evaluation it can play stupid moves in the opening like 1.a4 because it does not lose material.

It can probably still beat weak humans but I believe that if you want it to beat strong human players(let say level of 2300 fide rating) then you need at least more evaluation like piece square table.
Totally wrong. Mobility accounts for 40% of eval, material 30%, PST 20%, everything else 20% (it's more than 100% in total because mobility and material are quite correlated).
So if you have only material and mobility, you can play at GM level easily.
With high-end hardware you can even beat Carlsen. And no, you will never play a4 in the opening since it will give you much less mobility after a few plies.

And as usual Lyudmil is just performing his ignorant, trolling ramblings that are starting to get very boring.
Amazing how someone so ignorant can be so full of himself and be so delusional about the world around him...
Uri Blass
Posts: 10298
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Scaling from FGRL results with top 3 engines

Post by Uri Blass »

Milos wrote:
Uri Blass wrote:With only material evaluation it can play stupid moves in the opening like 1.a4 because it does not lose material.

It can probably still beat weak humans but I believe that if you want it to beat strong human players(let say level of 2300 fide rating) then you need at least more evaluation like piece square table.
Totally wrong. Mobility accounts for 40% of eval, material 30%, PST 20%, everything else 20% (it's more than 100% in total because mobility and material are quite correlated).
So if you have only material and mobility, you can play at GM level easily.
With high-end hardware you can even beat Carlsen. And no, you will never play a4 in the opening since it will give you much less mobility after a few plies.

And as usual Lyudmil is just performing his ignorant, trolling ramblings that are starting to get very boring.
Amazing how someone so ignorant can be so full of himself and be so delusional about the world around him...
I guess that there is a misunderstanding.
I agree that with only material and mobility it can play at GM level.
I meant only material without mobility or piece square table.

Edit:reading the thread again I do not see where Alvaro Cardoso mentioned mobility and I responded to the claim of Alvaro that only material is enough.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Scaling from FGRL results with top 3 engines

Post by Milos »

Uri Blass wrote:I guess that there is a misunderstanding.
I agree that with only material and mobility it can play at GM level.
I meant only material without mobility or piece square table.

Edit:reading the thread again I do not see where Alvaro Cardoso mentioned mobility and I responded to the claim of Alvaro that only material is enough.
You are right, he didn't, I assumed it, since material and mobility are tightly connected and having only material doesn't make much sense, especially since both are extremely simple concepts and equally independent from actual placement of pieces on the squares and totally different concepts from PST or any other pattern based bonuses.