In fact you could add than Houdini scales the worst with Stockfish. But to me, Stockfish is the favorite of this TCEC compared to Houdini 6 and Komodo 11.2, unless these engines comes with a surprise.
Also, both Stockfish and Houdini are tested at micro time control for their patches, Houdini 8 sec + 80 ms and Stockfish 10 sec and 100 ms, so can we jump to the conclusion that the lower the time control for test, the most his strenght decrease in long time control ?
Scaling from FGRL results with top 3 engines
Moderators: hgm, Rebel, chrisw
-
- Posts: 1346
- Joined: Sat Apr 19, 2014 1:47 pm
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Scaling from FGRL results with top 3 engines
Yes, and as one can see from the opening post to this LTC result, the scaling of Houdini and Stockfish from 60 seconds to 60 minutes is almost identical. So, the results at 60s between the two are fairly representative for LTC. I tested the fastest compile of current Stockfish dev I have, Brainfish (no any book), against the latest Houdini, and Stockfish beats Houdini handily at 60''+ 0.6'':JJJ wrote:In fact you could add than Houdini scales the worst with Stockfish. But to me, Stockfish is the favorite of this TCEC compared to Houdini 6 and Komodo 11.2, unless these engines comes with a surprise.
Also, both Stockfish and Houdini are tested at micro time control for their patches, Houdini 8 sec + 80 ms and Stockfish 10 sec and 100 ms, so can we jump to the conclusion that the lower the time control for test, the most his strenght decrease in long time control ?
Code: Select all
Games Completed = 400 of 400 (Avg game length = 187.888 sec)
Settings = RR/64MB/60000ms+600ms/M 600cp for 3 moves, D 120 moves/EPD:C:\LittleBlitzer\2moves_v1.epd(32000)
Time = 10873 sec elapsed, 0 sec remaining
1. Brainfish 021017 64 BMI2 217.0/400 114-80-206 (L: m=0 t=0 i=0 a=80) (D: r=147 i=28 f=8 s=3 a=20) (tpm=1385.4 d=23.59 nps=1714177)
2. Houdini 6.02 Pro x64-pext 183.0/400 80-114-206 (L: m=0 t=1 i=0 a=113) (D: r=147 i=28 f=8 s=3 a=20) (tpm=1424.6 d=20.02 nps=1853284)
-
- Posts: 2204
- Joined: Sat Jan 18, 2014 10:24 am
- Location: Andorra
Re: Scaling from FGRL results with top 3 engines
For Andscacs I use 15 seconds for stc, and maybe 30-80 seconds for ltc. Stockfish also uses 60 seconds for ltc, which is reasonable.JJJ wrote: Also, both Stockfish and Houdini are tested at micro time control for their patches, Houdini 8 sec + 80 ms and Stockfish 10 sec and 100 ms, so can we jump to the conclusion that the lower the time control for test, the most his strenght decrease in long time control ?
The good/bad scaling, in my experience is not due to using such short time controls, that anyway cannot be avoided due to lack of resources, but to the stop rules that one uses.
I must say that I don't have any good mathematical knowledge, so every decision I have done related to this, is more by intuition than anything, so pull your hair at your own taste
As I don't have many resources, many times I accept little changes without having enough confidence, just by intuition. To have some safety, every few changes I do verification tests, most probably at longer time control. Seems to be enough.
If a change passes stc and ltc, but ltc obtains less %, or I decide to extend the tests, maybe both, or I do a third test at even longer time control, but I don't accept a patch like this at first. Stockfish framework had accepted the test without further testing. If in doubt, the most probable is that I reject the change if the drop in % of the ltc patch is clear.
Some tests that I think that are sensitive to time control, like king safety ones, I test them first at ltc or medium time control, as any result at stc alone will not give any real information. If they are clearly bad at ltc I reject them. If I have doubts sometimes I do an even longer time control test. And sometimes the way to be more sure faster is to do a stc test. If is good at stc, the change is bad. And if is bad or near 0 at stc, then I decide what to do.
Of course the best changes are the ones that are bad at stc and good at ltc, undetectable to current testing methods of most people, I suppose.
Take all this with tweezers.
Daniel José - http://www.andscacs.com
-
- Posts: 2204
- Joined: Sat Jan 18, 2014 10:24 am
- Location: Andorra
Re: Scaling from FGRL results with top 3 engines
And another important point. Testing 10 patches with somewhat lower confidence in 5 days tend to give more elo than testing 3-5 patches with enough confidence in the same time.
Daniel José - http://www.andscacs.com
-
- Posts: 10298
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Scaling from FGRL results with top 3 engines
<snipped>
The target does not have to be making progress as fast as possible and I wonder if there are programmers who prefer to get less progress but to verify that every change that they accept works at time control that is longer than 80 seconds including testing ideas that do not work at short time control(when testing at STC can be done to verify that you do not lose more than 10 elo(because losing more than 10 elo is probably because of a bug) and to get knowledge about the value of the patch at STC).
I think that
Knowing that some change is in the interval (-6 elo,-2 elo) in 10+0.1 time control and in the interval (2 elo ,6 elo) in 160+1.6 time control may be an interesting information.
You can clearly avoid using short time control at the price of making less changes.cdani wrote:
For Andscacs I use 15 seconds for stc, and maybe 30-80 seconds for ltc. Stockfish also uses 60 seconds for ltc, which is reasonable.
The good/bad scaling, in my experience is not due to using such short time controls, that anyway cannot be avoided due to lack of resources, but to the stop rules that one uses.
The target does not have to be making progress as fast as possible and I wonder if there are programmers who prefer to get less progress but to verify that every change that they accept works at time control that is longer than 80 seconds including testing ideas that do not work at short time control(when testing at STC can be done to verify that you do not lose more than 10 elo(because losing more than 10 elo is probably because of a bug) and to get knowledge about the value of the patch at STC).
I think that
Knowing that some change is in the interval (-6 elo,-2 elo) in 10+0.1 time control and in the interval (2 elo ,6 elo) in 160+1.6 time control may be an interesting information.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Scaling from FGRL results with top 3 engines
With Overhead set to 30ms for Houdini and Contempt to 0, the result is no better, Stockfish wins convincingly. They scale almost identically from 60s to 60min, so no very good perspectives for Houdini to win TCEC, only if SMP implementation is better. But anyway, top 3 will be close at TCEC.Laskos wrote:Yes, and as one can see from the opening post to this LTC result, the scaling of Houdini and Stockfish from 60 seconds to 60 minutes is almost identical. So, the results at 60s between the two are fairly representative for LTC. I tested the fastest compile of current Stockfish dev I have, Brainfish (no any book), against the latest Houdini, and Stockfish beats Houdini handily at 60''+ 0.6'':JJJ wrote:In fact you could add than Houdini scales the worst with Stockfish. But to me, Stockfish is the favorite of this TCEC compared to Houdini 6 and Komodo 11.2, unless these engines comes with a surprise.
Also, both Stockfish and Houdini are tested at micro time control for their patches, Houdini 8 sec + 80 ms and Stockfish 10 sec and 100 ms, so can we jump to the conclusion that the lower the time control for test, the most his strenght decrease in long time control ?I will re-test with Contempt=0 for Houdini and some moderate Overhead value, as Houdini had 1 time loss.Code: Select all
Games Completed = 400 of 400 (Avg game length = 187.888 sec) Settings = RR/64MB/60000ms+600ms/M 600cp for 3 moves, D 120 moves/EPD:C:\LittleBlitzer\2moves_v1.epd(32000) Time = 10873 sec elapsed, 0 sec remaining 1. Brainfish 021017 64 BMI2 217.0/400 114-80-206 (L: m=0 t=0 i=0 a=80) (D: r=147 i=28 f=8 s=3 a=20) (tpm=1385.4 d=23.59 nps=1714177) 2. Houdini 6.02 Pro x64-pext 183.0/400 80-114-206 (L: m=0 t=1 i=0 a=113) (D: r=147 i=28 f=8 s=3 a=20) (tpm=1424.6 d=20.02 nps=1853284)
Code: Select all
Games Completed = 400 of 400 (Avg game length = 184.981 sec)
Settings = Gauntlet/64MB/60000ms+600ms/M 600cp for 3 moves, D 120 moves/EPD:C:\LittleBlitzer\2moves_v1.epd(32000)
Time = 10719 sec elapsed, 0 sec remaining
1. Brainfish 021017 64 BMI2 220.5/400 96-55-249 (L: m=0 t=0 i=0 a=55) (D: r=157 i=49 f=10 s=4 a=29) (tpm=1372.2 d=24.18 nps=1732874)
2. Houdini 6.02 Pro x64-pext 179.5/400 55-96-249 (L: m=0 t=0 i=0 a=96) (D: r=157 i=49 f=10 s=4 a=29) (tpm=1412.0 d=20.56 nps=1878006)
-
- Posts: 4190
- Joined: Wed Nov 25, 2009 1:47 am
Re: Scaling from FGRL results with top 3 engines
Totally wrong. Mobility accounts for 40% of eval, material 30%, PST 20%, everything else 20% (it's more than 100% in total because mobility and material are quite correlated).Uri Blass wrote:With only material evaluation it can play stupid moves in the opening like 1.a4 because it does not lose material.
It can probably still beat weak humans but I believe that if you want it to beat strong human players(let say level of 2300 fide rating) then you need at least more evaluation like piece square table.
So if you have only material and mobility, you can play at GM level easily.
With high-end hardware you can even beat Carlsen. And no, you will never play a4 in the opening since it will give you much less mobility after a few plies.
And as usual Lyudmil is just performing his ignorant, trolling ramblings that are starting to get very boring.
Amazing how someone so ignorant can be so full of himself and be so delusional about the world around him...
-
- Posts: 10298
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Scaling from FGRL results with top 3 engines
I guess that there is a misunderstanding.Milos wrote:Totally wrong. Mobility accounts for 40% of eval, material 30%, PST 20%, everything else 20% (it's more than 100% in total because mobility and material are quite correlated).Uri Blass wrote:With only material evaluation it can play stupid moves in the opening like 1.a4 because it does not lose material.
It can probably still beat weak humans but I believe that if you want it to beat strong human players(let say level of 2300 fide rating) then you need at least more evaluation like piece square table.
So if you have only material and mobility, you can play at GM level easily.
With high-end hardware you can even beat Carlsen. And no, you will never play a4 in the opening since it will give you much less mobility after a few plies.
And as usual Lyudmil is just performing his ignorant, trolling ramblings that are starting to get very boring.
Amazing how someone so ignorant can be so full of himself and be so delusional about the world around him...
I agree that with only material and mobility it can play at GM level.
I meant only material without mobility or piece square table.
Edit:reading the thread again I do not see where Alvaro Cardoso mentioned mobility and I responded to the claim of Alvaro that only material is enough.
-
- Posts: 4190
- Joined: Wed Nov 25, 2009 1:47 am
Re: Scaling from FGRL results with top 3 engines
You are right, he didn't, I assumed it, since material and mobility are tightly connected and having only material doesn't make much sense, especially since both are extremely simple concepts and equally independent from actual placement of pieces on the squares and totally different concepts from PST or any other pattern based bonuses.Uri Blass wrote:I guess that there is a misunderstanding.
I agree that with only material and mobility it can play at GM level.
I meant only material without mobility or piece square table.
Edit:reading the thread again I do not see where Alvaro Cardoso mentioned mobility and I responded to the claim of Alvaro that only material is enough.