konsolas wrote: ↑Tue Jan 08, 2019 10:03 pm
It's been a long while since I last updated Topple, but I've finally rewritten the evaluation and implemented Texel tuning with a simple linear search: self play shows an elo gain of about +150.
The release can be found here: https://github.com/konsolas/ToppleChess ... tag/v0.3.0
I've provided 3 builds for windows, but it should be easy to build from source on Linux and macOS with CMake (you may need to remove the -static flag for clang on MacOS to work).
It would be awesome to find out how the new Topple performs against other engines
I don't know what the others will get, but this might well be a case of "self-play elo increase doesn't translate to real elo increase"
I ran 2 gauntlets, one with Topple v0.2.1 and then with Topple v0.3.0, same opponents, same number of games, and the results were almost identical (even v0.2.1 got more points, well inside error margins)
konsolas wrote: ↑Tue Jan 08, 2019 10:03 pm
It's been a long while since I last updated Topple, but I've finally rewritten the evaluation and implemented Texel tuning with a simple linear search: self play shows an elo gain of about +150.
The release can be found here: https://github.com/konsolas/ToppleChess ... tag/v0.3.0
I've provided 3 builds for windows, but it should be easy to build from source on Linux and macOS with CMake (you may need to remove the -static flag for clang on MacOS to work).
It would be awesome to find out how the new Topple performs against other engines
I don't know what the others will get, but this might well be a case of "self-play elo increase doesn't translate to real elo increase"
I ran 2 gauntlets, one with Topple v0.2.1 and then with Topple v0.3.0, same opponents, same number of games, and the results were almost identical (even v0.2.1 got more points, well inside error margins)
Ah, that's certainly disappointing. I suppose I should change how I test new versions so I don't get overinflated Elo estimates in the future. Thank you very much for testing. I've updated the release page on GitHub.
I'm quite surprised, since v0.3.0 shares very little evaluation code with v0.2.1
konsolas wrote: ↑Thu Jan 10, 2019 5:29 pm
Ah, that's certainly disappointing. I suppose I should change how I test new versions so I don't get overinflated Elo estimates in the future. Thank you very much for testing. I've updated the release page on GitHub.
I'm quite surprised, since v0.3.0 shares very little evaluation code with v0.2.1
If I am not mistaken Topple does not support ponder, right?
Carlos does some unusual asymmetric testing with always ponder on, thus the error bars are higher and it also depends on,
how often it faces opponents, which also cannot ponder.
This means the CCRL/CEGT result could be very different.
Post by CMCanavessi » Tue Jan 08, 2019 9:25 pm
TC is 1 minute + 1 second, ponder is ON for engines that support it, 1 thread, 2 move book, random openings with NO reverse games.
Post by xr_a_y » Wed Jan 09, 2019 7:05 am
Ok those are 2 domains where Minic is not good. I'll activate pondering soon (this is usualy worh 40-60 elo) and work on sudden death TC because current heuristic isn't smart enougth. I guess some "emergency time" management shall be added also.
konsolas wrote: ↑Thu Jan 10, 2019 5:29 pm
Ah, that's certainly disappointing. I suppose I should change how I test new versions so I don't get overinflated Elo estimates in the future. Thank you very much for testing. I've updated the release page on GitHub.
I'm quite surprised, since v0.3.0 shares very little evaluation code with v0.2.1
If I am not mistaken Topple does not support ponder, right?
Carlos does some unusual asymmetric testing with always ponder on, thus the error bars are higher and it also depends on,
how often it faces opponents, which also cannot ponder.
This means the CCRL/CEGT result could be very different.
Post by CMCanavessi » Tue Jan 08, 2019 9:25 pm
TC is 1 minute + 1 second, ponder is ON for engines that support it, 1 thread, 2 move book, random openings with NO reverse games.
Post by xr_a_y » Wed Jan 09, 2019 7:05 am
Ok those are 2 domains where Minic is not good. I'll activate pondering soon (this is usualy worh 40-60 elo) and work on sudden death TC because current heuristic isn't smart enougth. I guess some "emergency time" management shall be added also.
Yeah, but the conditions were exactly the same for both runs, same opponents, same number of games, same TC, same hash size, same everything.
konsolas wrote: ↑Thu Jan 10, 2019 5:29 pm
Ah, that's certainly disappointing. I suppose I should change how I test new versions so I don't get overinflated Elo estimates in the future. Thank you very much for testing. I've updated the release page on GitHub.
I'm quite surprised, since v0.3.0 shares very little evaluation code with v0.2.1
If I am not mistaken Topple does not support ponder, right?
Carlos does some unusual asymmetric testing with always ponder on, thus the error bars are higher and it also depends on,
how often it faces opponents, which also cannot ponder.
This means the CCRL/CEGT result could be very different.
Post by CMCanavessi » Tue Jan 08, 2019 9:25 pm
TC is 1 minute + 1 second, ponder is ON for engines that support it, 1 thread, 2 move book, random openings with NO reverse games.
Post by xr_a_y » Wed Jan 09, 2019 7:05 am
Ok those are 2 domains where Minic is not good. I'll activate pondering soon (this is usualy worh 40-60 elo) and work on sudden death TC because current heuristic isn't smart enougth. I guess some "emergency time" management shall be added also.
Yeah, but the conditions were exactly the same for both runs, same opponents, same number of games, same TC, same hash size, same everything.
Same openings too? I don't know how much lines your 2_moves book contains, but you wrote randomly selected w/o repeating?
(ofc 200 games still would give an error of may be +-30)
konsolas wrote: ↑Thu Jan 10, 2019 5:29 pm
Ah, that's certainly disappointing. I suppose I should change how I test new versions so I don't get overinflated Elo estimates in the future. Thank you very much for testing. I've updated the release page on GitHub.
I'm quite surprised, since v0.3.0 shares very little evaluation code with v0.2.1
If I am not mistaken Topple does not support ponder, right?
Carlos does some unusual asymmetric testing with always ponder on, thus the error bars are higher and it also depends on,
how often it faces opponents, which also cannot ponder.
This means the CCRL/CEGT result could be very different.
Post by CMCanavessi » Tue Jan 08, 2019 9:25 pm
TC is 1 minute + 1 second, ponder is ON for engines that support it, 1 thread, 2 move book, random openings with NO reverse games.
Post by xr_a_y » Wed Jan 09, 2019 7:05 am
Ok those are 2 domains where Minic is not good. I'll activate pondering soon (this is usualy worh 40-60 elo) and work on sudden death TC because current heuristic isn't smart enougth. I guess some "emergency time" management shall be added also.
Yeah, but the conditions were exactly the same for both runs, same opponents, same number of games, same TC, same hash size, same everything.
Same openings too? I don't know how much lines your 2_moves book contains, but you wrote randomly selected w/o repeating?
(ofc 200 games still would give an error of may be +-30)
Yep, openings can change things a bit, the book contains 200 openings and 1 entry with no moves (bookless), so 201 total entries. Still, +150 elo as seen by konsolas would be way way higher than the margin of errors from the different openings being used, as the engines were all well within +-150 elo, so if v0.2.1 scored almost 50%, a +150 elo engine would have scored at least 60-65% if not more.
CMCanavessi wrote: ↑Thu Jan 10, 2019 7:05 pm
...
Yep, openings can change things a bit, the book contains 200 openings and 1 entry with no moves (bookless), so 201 total entries. Still, +150 elo as seen by konsolas would be way way higher than the margin of errors from the different openings being used, as the engines were all well within +-150 elo, so if v0.2.1 scored almost 50%, a +150 elo engine would have scored at least 60-65% if not more.
Well +150 seems not realistic then and I would also suppose a SP rating test 'fata morgana'.
(I did not read the part in this thread with that high expectation before)
Hi Konsolas
Wow , indeed these results must be very disapointing...
Do you use the same opennings in your tests ?
I think its necessary To use the same opennings .
In Isa , I test every new versions in à self test vs 2-3 others version .
I play 500 games vs each others , 250 différent opennings with colour reversed .TC 1minute + 250 miliseconds.
I stop the test if It is so bad after 200 games.
So, at the end , à dev version have played 1500 games , and the error bar is relatively small.
Good luck
Dany
I've built up a small collection of engines now to run tournaments with so hopefully i can have a more accurate picture of strength improvements in the future:
Topple2 = Topple v0.2.1, Topple2E = Topple v0.3.1, ToppleDebug = current dev build of Topple.
This reflects CMCanavessi's results (where v0.2.1 was very similar to v0.3.1), but I think there is sufficient evidence to suggest that the current development build is likely to be stronger.
konsolas wrote: ↑Sat Jan 12, 2019 1:50 pm
Thanks Daniel,
I've built up a small collection of engines now to run tournaments with so hopefully i can have a more accurate picture of strength improvements in the future:
Topple2 = Topple v0.2.1, Topple2E = Topple v0.3.1, ToppleDebug = current dev build of Topple.
This reflects CMCanavessi's results (where v0.2.1 was very similar to v0.3.1), but I think there is sufficient evidence to suggest that the current development build is likely to be stronger.
Hi Vincent, you should remove opponents from your pool, which are too strong or too weak, they just add random noise
and an unnecessary bigger error in rating calculations.
BTW Drosophila and Godel seem to have a problem in your environment, because 0% is unlikeley in real games considering their strength.
(probably always crashing, or always losing on time - you should check for unusual result tags too)