STS rating v13.1 for Lc0 0.21.2 with nodes = 1

Max · Post by **Max** » Wed Jun 19, 2019 12:50 pm

Aren't the 1500 positions from the Strategic Test Suite look promising to compare several lc0 networks without any search involved? Lets find out.

STS rating v13.1 for Lc0 0.21.2 with nodes = 1

Code: Select all

STS(elo) 	network
--------------------------------
1540		11258-16x2-se
1690		11258-24x3-se  
1780		11258-32x4-se
1993		11258-48x5-se 
1999		net54023	
2116		11258-64x6-se	
2233		11258-80x7-se	
2239		32930-112x9-se 
2287		11258-96x8-se  
2292		11258-104x9-se
2333		11258-112x9-se
2334		net52340	
2353		11258-128x10-se
2367		net53316
2475		net40x256_130
2516		net33000
2529		11258-256x12-se
2536		net42610
2567		11258-200x20-se
2571		netT40.T8.610
2625		net11258

Interesting, the old network 11258 is in the lead by over 50 points. And the TCEC 15 choice T40.T8.610 scores in front of the latest 42610 net. The new 40x256 network #130 may still be to "young"?!

Max · Post by **Max** » Wed Jun 19, 2019 1:20 pm

How to run STS with node = 1?

Instead of "go movetime" change the lines 545 and 549 in file sts_rating_v13.1.py to "go nodes 1"

Code: Select all

545: p.stdin.write("go nodes 1" + "\n")
546: t1 = time.clock()  # Start time for sending stop if engine is not working
547:
548: if debug:
549:  	logfnFO.write("%s >> go nodes 1\n" %(datetime.datetime.now().isoformat()))

Laskos · Post by **Laskos** » Wed Jun 19, 2019 1:52 pm

Max wrote: ↑Wed Jun 19, 2019 12:50 pm Aren't the 1500 positions from the Strategic Test Suite look promising to compare several lc0 networks without any search involved? Lets find out.

STS rating v13.1 for Lc0 0.21.2 with nodes = 1
Code: Select all
STS(elo) 	network
--------------------------------
1540		11258-16x2-se
1690		11258-24x3-se  
1780		11258-32x4-se
1993		11258-48x5-se 
1999		net54023	
2116		11258-64x6-se	
2233		11258-80x7-se	
2239		32930-112x9-se 
2287		11258-96x8-se  
2292		11258-104x9-se
2333		11258-112x9-se
2334		net52340	
2353		11258-128x10-se
2367		net53316
2475		net40x256_130
2516		net33000
2529		11258-256x12-se
2536		net42610
2567		11258-200x20-se
2571		netT40.T8.610
2625		net11258
Interesting, the old network 11258 is in the lead by over 50 points. And the TCEC 15 choice T40.T8.610 scores in front of the latest 42610 net. The new 40x256 network #130 may still be to "young"?!

STS is not great positional test suite and this became clear precisely with Leela. I have my own 3 year old positional opening suite containing 200 positions, which proved to withstand the Leela challenge, Leela is outperforming all other engines by a wide margin. Here are solved positions at 1 node for Leela and depth=1 for top regular engines:

Code: Select all

nodes=1

42611
score=128/200 

T40.T8.610
score=125/200

40b_131
score=118/200

11261
score=117/200

32930
score=116/200

--------------------
depth=1

Stockfish_dev
score=55/200

Komodo 13.02
score=46/200

Score of 128/200 is matched by SF only in some long time controls tests (say 30s per position on 4 cores, or more than 150 million nodes per position).

peter · Post by **peter** » Wed Jun 19, 2019 3:55 pm

Laskos wrote: ↑Wed Jun 19, 2019 1:52 pm I have my own 3 year old positional opening suite containing 200 positions, which proved to withstand the Leela challenge, Leela is outperforming all other engines by a wide margin.

So I'd say these 200 positions are great for Leela, which doesn't mean they are as great as for testing other engines as well, not even to prove the one or the other one NN better or worse compared to other engines and to other NNs.

As far as I could see with some of the positions of your testset, solutions often seem doubtful as for single best moves and doubtful in this single one solution too, which are counted for solved or not solved only, aren't they?

STS gives often different numbers of points for solutions near to each other in value, and 1300 to 1500 (version 15 since 2014 already, isn't it?) positions of good comparable results as for very different engines over the years let me stick to STS still rather than to your 200 for Leela great ones.

Only real point of weakness of STS to me is (as well as with any other test- suites, tactical or positional ones both the same) the attempt to compare Elo as test- results of the suites to Elo out of engine- engine- games, that doesn't work in any way for any test- suite, but it doesn't work as for eng-eng gameplaying itself only neither anymore, to get results comparable to each other, if you look at different hardware- TCs, different opening- sets and different pools of engines playing against each other.

Hard times to keep the old Elosion alive anyhow, as for "Elo" as a single one measurement of transitivity between different engines and different methods of testing them.

Besides lack of transitivity between NN- und A-B engines playing against each other with different hardware- TCs and openings, remis- death of eng-eng-games, played by modern hardware-software- combinations and not insanely short TCs, makes measurement more and more biased and or demanding as for numbers of games to be played to get out of error- bar.

Laskos · Post by **Laskos** » Wed Jun 19, 2019 7:17 pm

peter wrote: ↑Wed Jun 19, 2019 3:55 pm
Laskos wrote: ↑Wed Jun 19, 2019 1:52 pm I have my own 3 year old positional opening suite containing 200 positions, which proved to withstand the Leela challenge, Leela is outperforming all other engines by a wide margin.
So I'd say these 200 positions are great for Leela, which doesn't mean they are as great as for testing other engines as well, not even to prove the one or the other one NN better or worse compared to other engines and to other NNs.

As far as I could see with some of the positions of your testset, solutions often seem doubtful as for single best moves and doubtful in this single one solution too, which are counted for solved or not solved only, aren't they?

STS gives often different numbers of points for solutions near to each other in value, and 1300 to 1500 (version 15 since 2014 already, isn't it?) positions of good comparable results as for very different engines over the years let me stick to STS still rather than to your 200 for Leela great ones.

Only real point of weakness of STS to me is (as well as with any other test- suites, tactical or positional ones both the same) the attempt to compare Elo as test- results of the suites to Elo out of engine- engine- games, that doesn't work in any way for any test- suite, but it doesn't work as for eng-eng gameplaying itself only neither anymore, to get results comparable to each other, if you look at different hardware- TCs, different opening- sets and different pools of engines playing against each other.

Hard times to keep the old Elosion alive anyhow, as for "Elo" as a single one measurement of transitivity between different engines and different methods of testing them.

Besides lack of transitivity between NN- und A-B engines playing against each other with different hardware- TCs and openings, remis- death of eng-eng-games, played by modern hardware-software- combinations and not insanely short TCs, makes measurement more and more biased and or demanding as for numbers of games to be played to get out of error- bar.

STS is clearly a flawed positional suite built by over-analyzing with traditional AB engines of its time, especially Rybka. It fails to show the great positional superiority of Leela over traditional engines. Do you think, if it fails even on this, it will show something accurately about different nets of Leela? Mine 200, correctly showing Leela's vast superiority (it has nothing to do with Leela, being built from databases of human games 3 years ago), still have a chance of discerning the particular Leela nets.

Here is a STS nonsensical result:

Code: Select all

Stuck to the solution from 1s to 2s of thinking
4 i7 cores at 3.80GHz
RTX 2070 GPU

Houdini 1.5a        score=1339/1500 [averages on correct positions: depth=8.4 time=0.15 nodes=1320734]
Komodo 13.02        score=1319/1500 [averages on correct positions: depth=10.6 time=0.16 nodes=1064120]
Stockfish dev       score=1284/1500 [averages on correct positions: depth=10.8 time=0.18 nodes=1111414]  
Texel 1.07          score=1241/1500 [averages on correct positions: depth=8.9 time=0.22 nodes=1379498]
Arasan 21.0         score=1195/1500 [averages on correct positions: depth=9.5 time=0.24 nodes=1017178]
Lc0 v21.2 ID42524   score=1177/1500 [averages on correct positions: depth=3.8 time=0.14 nodes=1580]
Fruit 2.1           score= 993/1500 [averages on correct positions: depth=5.2 time=0.23 nodes=505132]

Houdini 1.5 scores so well only because it is very related to Rybka, the engine used to build the suite. It is a bad habit building positional test suites by over-analyzing with engines. Leela scores abysmally on a strong GPU on this so called "Strategical Test Suite".

peter · Post by **peter** » Wed Jun 19, 2019 7:47 pm

Laskos wrote: ↑Wed Jun 19, 2019 7:17 pm
peter wrote: ↑Wed Jun 19, 2019 3:55 pm
Laskos wrote: ↑Wed Jun 19, 2019 1:52 pm I have my own 3 year old positional opening suite containing 200 positions, which proved to withstand the Leela challenge, Leela is outperforming all other engines by a wide margin.
So I'd say these 200 positions are great for Leela, which doesn't mean they are as great as for testing other engines as well, not even to prove the one or the other one NN better or worse compared to other engines and to other NNs.

As far as I could see with some of the positions of your testset, solutions often seem doubtful as for single best moves and doubtful in this single one solution too, which are counted for solved or not solved only, aren't they?

STS gives often different numbers of points for solutions near to each other in value, and 1300 to 1500 (version 15 since 2014 already, isn't it?) positions of good comparable results as for very different engines over the years let me stick to STS still rather than to your 200 for Leela great ones.

Only real point of weakness of STS to me is (as well as with any other test- suites, tactical or positional ones both the same) the attempt to compare Elo as test- results of the suites to Elo out of engine- engine- games, that doesn't work in any way for any test- suite, but it doesn't work as for eng-eng gameplaying itself only neither anymore, to get results comparable to each other, if you look at different hardware- TCs, different opening- sets and different pools of engines playing against each other.

Hard times to keep the old Elosion alive anyhow, as for "Elo" as a single one measurement of transitivity between different engines and different methods of testing them.

Besides lack of transitivity between NN- und A-B engines playing against each other with different hardware- TCs and openings, remis- death of eng-eng-games, played by modern hardware-software- combinations and not insanely short TCs, makes measurement more and more biased and or demanding as for numbers of games to be played to get out of error- bar.
STS is clearly a flawed positional suite built over-analyzing with traditional AB engines of its time, especially Rybka. It fails to show the great positional superiority of Leela over traditional engines. Do you think, if it fails even on this, it will show something accurately about different nets of Leela? Mine 200, correctly showing Leela's vast superiority (it has nothing to do with Leela, being built from databases of human games 3 years ago), still have a chance of discerning the particular Leela nets.

Here is a STS nonsensical result:
Code: Select all
Stuck to the solution from 1s to 2s of thinking
4 i7 cores at 3.80GHz
RTX 2070 GPU

Houdini 1.5a        score=1339/1500 [averages on correct positions: depth=8.4 time=0.15 nodes=1320734]
Komodo 13.02        score=1319/1500 [averages on correct positions: depth=10.6 time=0.16 nodes=1064120]
Stockfish dev       score=1284/1500 [averages on correct positions: depth=10.8 time=0.18 nodes=1111414]  
Texel 1.07          score=1241/1500 [averages on correct positions: depth=8.9 time=0.22 nodes=1379498]
Arasan 21.0         score=1195/1500 [averages on correct positions: depth=9.5 time=0.24 nodes=1017178]
Lc0 v21.2 ID42524   score=1177/1500 [averages on correct positions: depth=3.8 time=0.14 nodes=1580]
Fruit 2.1           score= 993/1500 [averages on correct positions: depth=5.2 time=0.23 nodes=505132]
Houdini 1.5 scores so well only because it is very related to Rybka, the engine used to build the suite. It is a bad habit building positional test suites by over-analyzing with engines. Leela scores abysmally on a strong GPU on this so called "Strategical Test Suite".

So you want to exchange the bias you think to see for A-B engines developed over the years, (the engines as well as the test- suite) against the one you see with Leela and your suite, both having found to each other over last year?

Or what else does it mean to you, that NN- engines are great at these by you arbitrarily chosen 200 positions, but that these are great 200 positions for NN- engines? Or as the old saying goes: test- sets test the test- positions.

Still there is only one kind of engine which your suite is great for (or the engine for the suite) and from now on you're going to test NN- engines only with these 200 positions, which you think to be great for these engines, to show, how good the development of the NNs will progress in these 200 positions.

I would find it more logical to look for progress in other positions too, which aren't as great for NN- engines explicitely, but better for non- NN- engines too, otherwise I would let NN- engines play only against NN- engines from now on also, to test their progress by gameplaying.

That's what I really wanted to point out, "Elo" measured by games against different pools of engines, different opening- sets, different hardware- TC isn't a transitive measurement in computerchess anymore, if it ever was, and so that's the same with test- suites of course even more then with eng-eng- gameplaying.

Have fun with your 200 positions and the measurements of your very own you get out of them. Yet don't mind me not being interested in your results more than I am in mine.

Laskos · Post by **Laskos** » Wed Jun 19, 2019 8:23 pm

peter wrote: ↑Wed Jun 19, 2019 7:47 pm
Laskos wrote: ↑Wed Jun 19, 2019 7:17 pm
peter wrote: ↑Wed Jun 19, 2019 3:55 pm
Laskos wrote: ↑Wed Jun 19, 2019 1:52 pm I have my own 3 year old positional opening suite containing 200 positions, which proved to withstand the Leela challenge, Leela is outperforming all other engines by a wide margin.
So I'd say these 200 positions are great for Leela, which doesn't mean they are as great as for testing other engines as well, not even to prove the one or the other one NN better or worse compared to other engines and to other NNs.

As far as I could see with some of the positions of your testset, solutions often seem doubtful as for single best moves and doubtful in this single one solution too, which are counted for solved or not solved only, aren't they?

STS gives often different numbers of points for solutions near to each other in value, and 1300 to 1500 (version 15 since 2014 already, isn't it?) positions of good comparable results as for very different engines over the years let me stick to STS still rather than to your 200 for Leela great ones.

Only real point of weakness of STS to me is (as well as with any other test- suites, tactical or positional ones both the same) the attempt to compare Elo as test- results of the suites to Elo out of engine- engine- games, that doesn't work in any way for any test- suite, but it doesn't work as for eng-eng gameplaying itself only neither anymore, to get results comparable to each other, if you look at different hardware- TCs, different opening- sets and different pools of engines playing against each other.

Hard times to keep the old Elosion alive anyhow, as for "Elo" as a single one measurement of transitivity between different engines and different methods of testing them.

Besides lack of transitivity between NN- und A-B engines playing against each other with different hardware- TCs and openings, remis- death of eng-eng-games, played by modern hardware-software- combinations and not insanely short TCs, makes measurement more and more biased and or demanding as for numbers of games to be played to get out of error- bar.
STS is clearly a flawed positional suite built over-analyzing with traditional AB engines of its time, especially Rybka. It fails to show the great positional superiority of Leela over traditional engines. Do you think, if it fails even on this, it will show something accurately about different nets of Leela? Mine 200, correctly showing Leela's vast superiority (it has nothing to do with Leela, being built from databases of human games 3 years ago), still have a chance of discerning the particular Leela nets.

Here is a STS nonsensical result:
Code: Select all
Stuck to the solution from 1s to 2s of thinking
4 i7 cores at 3.80GHz
RTX 2070 GPU

Houdini 1.5a        score=1339/1500 [averages on correct positions: depth=8.4 time=0.15 nodes=1320734]
Komodo 13.02        score=1319/1500 [averages on correct positions: depth=10.6 time=0.16 nodes=1064120]
Stockfish dev       score=1284/1500 [averages on correct positions: depth=10.8 time=0.18 nodes=1111414]  
Texel 1.07          score=1241/1500 [averages on correct positions: depth=8.9 time=0.22 nodes=1379498]
Arasan 21.0         score=1195/1500 [averages on correct positions: depth=9.5 time=0.24 nodes=1017178]
Lc0 v21.2 ID42524   score=1177/1500 [averages on correct positions: depth=3.8 time=0.14 nodes=1580]
Fruit 2.1           score= 993/1500 [averages on correct positions: depth=5.2 time=0.23 nodes=505132]
Houdini 1.5 scores so well only because it is very related to Rybka, the engine used to build the suite. It is a bad habit building positional test suites by over-analyzing with engines. Leela scores abysmally on a strong GPU on this so called "Strategical Test Suite".
So you want to exchange the bias you think to see for A-B engines developed over the years, (the engines as well as the test- suite) against the one you see with Leela and your suite, both having found to each other over last year?

Or what else does it mean to you, that NN- engines are great at these by you arbitrarily chosen 200 positions, but that these are great 200 positions for NN- engines? Or as the old saying goes: test- sets test the test- positions.

Still there is only one kind of engine which your suite is great for (or the engine for the suite) and from now on you're going to test NN- engines only with these 200 positions, which you think to be great for these engines, to show, how good the development of the NNs will progress in these 200 positions.

I would find it more logical to look for progress in other positions too, which aren't as great for NN- engines explicitely, but better for non- NN- engines too, otherwise I would let NN- engines play only against NN- engines from now on also, to test their progress by gameplaying.

That's what I really wanted to point out, "Elo" measured by games against different pools of engines, different opening- sets, different hardware- TC isn't a transitive measurement in computerchess anymore, if it ever was, and so that's the same with test- suites of course even more then with eng-eng- gameplaying.

Have fun with your 200 positions and the measurements of your very own you get out of them. Yet don't mind me not being interested in your results more than I am in mine.

I am not that big fan of my suite, I guess that some 10% of unique solutions in it are wrong. But you seem to not acknowledge that Leela with late 20b nets is vastly superior _positionally_ to regular AB engines. Time after time you come with something against Leela, and how it is not much worth in analysis. I repeat, I didn't come with my opening positional suite 3 years ago waiting for Leela, and if Leela solves 145-155/200 of positions in 1-2 seconds, while top traditional engines only 95-115/200, it so happened that databases of human games in the opening somehow agree with Leela and less so with Stockfish. Allow me to trust databases with hundreds to thousands of games for each position of human games with average FIDE Elo level of 2500 than to take seriously Peter Martan and his grudges against Leela and "vast experience with analyzing". Again, STS showed itself as a pretty useless tool in measuring the "positional understanding". It is more Rybka left to analyze for a couple of hours each position, a completely wrong methodology in building positional test suites. You would much better switch to tactical test suites like Arasan, they are indeed analyzable with traditional engines, and there this "useless" Leela performs indeed very poorly. To me these deep tactical shots many people love so much are boring. Luckily, they rarely occur in real games.

peter · Post by **peter** » Wed Jun 19, 2019 8:49 pm

Laskos wrote: ↑Wed Jun 19, 2019 8:23 pm
peter wrote: ↑Wed Jun 19, 2019 7:47 pm
Laskos wrote: ↑Wed Jun 19, 2019 7:17 pm
peter wrote: ↑Wed Jun 19, 2019 3:55 pm
Laskos wrote: ↑Wed Jun 19, 2019 1:52 pm I have my own 3 year old positional opening suite containing 200 positions, which proved to withstand the Leela challenge, Leela is outperforming all other engines by a wide margin.
So I'd say these 200 positions are great for Leela, which doesn't mean they are as great as for testing other engines as well, not even to prove the one or the other one NN better or worse compared to other engines and to other NNs.

As far as I could see with some of the positions of your testset, solutions often seem doubtful as for single best moves and doubtful in this single one solution too, which are counted for solved or not solved only, aren't they?

STS gives often different numbers of points for solutions near to each other in value, and 1300 to 1500 (version 15 since 2014 already, isn't it?) positions of good comparable results as for very different engines over the years let me stick to STS still rather than to your 200 for Leela great ones.

Only real point of weakness of STS to me is (as well as with any other test- suites, tactical or positional ones both the same) the attempt to compare Elo as test- results of the suites to Elo out of engine- engine- games, that doesn't work in any way for any test- suite, but it doesn't work as for eng-eng gameplaying itself only neither anymore, to get results comparable to each other, if you look at different hardware- TCs, different opening- sets and different pools of engines playing against each other.

Hard times to keep the old Elosion alive anyhow, as for "Elo" as a single one measurement of transitivity between different engines and different methods of testing them.

Besides lack of transitivity between NN- und A-B engines playing against each other with different hardware- TCs and openings, remis- death of eng-eng-games, played by modern hardware-software- combinations and not insanely short TCs, makes measurement more and more biased and or demanding as for numbers of games to be played to get out of error- bar.
STS is clearly a flawed positional suite built over-analyzing with traditional AB engines of its time, especially Rybka. It fails to show the great positional superiority of Leela over traditional engines. Do you think, if it fails even on this, it will show something accurately about different nets of Leela? Mine 200, correctly showing Leela's vast superiority (it has nothing to do with Leela, being built from databases of human games 3 years ago), still have a chance of discerning the particular Leela nets.

Here is a STS nonsensical result:
Code: Select all
Stuck to the solution from 1s to 2s of thinking
4 i7 cores at 3.80GHz
RTX 2070 GPU

Houdini 1.5a        score=1339/1500 [averages on correct positions: depth=8.4 time=0.15 nodes=1320734]
Komodo 13.02        score=1319/1500 [averages on correct positions: depth=10.6 time=0.16 nodes=1064120]
Stockfish dev       score=1284/1500 [averages on correct positions: depth=10.8 time=0.18 nodes=1111414]  
Texel 1.07          score=1241/1500 [averages on correct positions: depth=8.9 time=0.22 nodes=1379498]
Arasan 21.0         score=1195/1500 [averages on correct positions: depth=9.5 time=0.24 nodes=1017178]
Lc0 v21.2 ID42524   score=1177/1500 [averages on correct positions: depth=3.8 time=0.14 nodes=1580]
Fruit 2.1           score= 993/1500 [averages on correct positions: depth=5.2 time=0.23 nodes=505132]
Houdini 1.5 scores so well only because it is very related to Rybka, the engine used to build the suite. It is a bad habit building positional test suites by over-analyzing with engines. Leela scores abysmally on a strong GPU on this so called "Strategical Test Suite".
So you want to exchange the bias you think to see for A-B engines developed over the years, (the engines as well as the test- suite) against the one you see with Leela and your suite, both having found to each other over last year?

Or what else does it mean to you, that NN- engines are great at these by you arbitrarily chosen 200 positions, but that these are great 200 positions for NN- engines? Or as the old saying goes: test- sets test the test- positions.

Still there is only one kind of engine which your suite is great for (or the engine for the suite) and from now on you're going to test NN- engines only with these 200 positions, which you think to be great for these engines, to show, how good the development of the NNs will progress in these 200 positions.

I would find it more logical to look for progress in other positions too, which aren't as great for NN- engines explicitely, but better for non- NN- engines too, otherwise I would let NN- engines play only against NN- engines from now on also, to test their progress by gameplaying.

That's what I really wanted to point out, "Elo" measured by games against different pools of engines, different opening- sets, different hardware- TC isn't a transitive measurement in computerchess anymore, if it ever was, and so that's the same with test- suites of course even more then with eng-eng- gameplaying.

Have fun with your 200 positions and the measurements of your very own you get out of them. Yet don't mind me not being interested in your results more than I am in mine.
I am not that big fan of my suite, I guess that some 10% of unique solutions in it are wrong. But you seem to not acknowledge that Leela with late 20b nets is vastly superior _positionally_ to regular AB engines. Time after time you come with something against Leela, and how it is not much worth in analysis. I repeat, I didn't come with my opening positional suite 3 years ago waiting for Leela, and if Leela solves 145-155/200 of positions in 1-2 seconds, while top traditional engines only 95-115/200, it so happened that databases of human games in the opening somehow agree with Leela and less so with Stockfish. Allow me to trust databases with hundreds to thousands of games for each position of human games with average FIDE Elo level of 2500 than to take seriously Peter Martan and his grudges against Leela and "vast experience with analyzing". Again, STS showed itself as a pretty useless tool in measuring the "positional understanding". It is more Rybka left to analyze for a couple of hours each position, a completely wrong methodology in building positional test suites. You would much better switch to tactical test suites like Arasan, they are indeed analyzable with traditional engines, and there this "useless" Leela performs indeed very poorly. To me these deep tactical shots many people love so much are boring. Luckily, they rarely occur in real games.

For my personal point of view it's you taking all these things much too emotional and subjective ad personam, much more than I did and do, otherwise I could as well call your results ones of a biased Leela- fan as you call my critics against your data (your "test suite") grudges against Leela.
As a matter of fact I said nothing else but if you call STS biased in a certain matter, you have to admit the same for your positions too. There is certain bias in any kind of test suite as well as in every other method of testing, e.g. in gameplaying with certain opening sets, hardware- TC and pool of engines compared to each other.
No problem for me, sorry, if it is one for you.

But your psychological external projections aren't my problem neither, you see?

Laskos · Post by **Laskos** » Wed Jun 19, 2019 9:19 pm

peter wrote: ↑Wed Jun 19, 2019 8:49 pm
Laskos wrote: ↑Wed Jun 19, 2019 8:23 pm
peter wrote: ↑Wed Jun 19, 2019 7:47 pm
Laskos wrote: ↑Wed Jun 19, 2019 7:17 pm
peter wrote: ↑Wed Jun 19, 2019 3:55 pm
Laskos wrote: ↑Wed Jun 19, 2019 1:52 pm I have my own 3 year old positional opening suite containing 200 positions, which proved to withstand the Leela challenge, Leela is outperforming all other engines by a wide margin.
So I'd say these 200 positions are great for Leela, which doesn't mean they are as great as for testing other engines as well, not even to prove the one or the other one NN better or worse compared to other engines and to other NNs.

As far as I could see with some of the positions of your testset, solutions often seem doubtful as for single best moves and doubtful in this single one solution too, which are counted for solved or not solved only, aren't they?

STS gives often different numbers of points for solutions near to each other in value, and 1300 to 1500 (version 15 since 2014 already, isn't it?) positions of good comparable results as for very different engines over the years let me stick to STS still rather than to your 200 for Leela great ones.

Only real point of weakness of STS to me is (as well as with any other test- suites, tactical or positional ones both the same) the attempt to compare Elo as test- results of the suites to Elo out of engine- engine- games, that doesn't work in any way for any test- suite, but it doesn't work as for eng-eng gameplaying itself only neither anymore, to get results comparable to each other, if you look at different hardware- TCs, different opening- sets and different pools of engines playing against each other.

Hard times to keep the old Elosion alive anyhow, as for "Elo" as a single one measurement of transitivity between different engines and different methods of testing them.

Besides lack of transitivity between NN- und A-B engines playing against each other with different hardware- TCs and openings, remis- death of eng-eng-games, played by modern hardware-software- combinations and not insanely short TCs, makes measurement more and more biased and or demanding as for numbers of games to be played to get out of error- bar.
STS is clearly a flawed positional suite built over-analyzing with traditional AB engines of its time, especially Rybka. It fails to show the great positional superiority of Leela over traditional engines. Do you think, if it fails even on this, it will show something accurately about different nets of Leela? Mine 200, correctly showing Leela's vast superiority (it has nothing to do with Leela, being built from databases of human games 3 years ago), still have a chance of discerning the particular Leela nets.

Here is a STS nonsensical result:
Code: Select all
Stuck to the solution from 1s to 2s of thinking
4 i7 cores at 3.80GHz
RTX 2070 GPU

Houdini 1.5a        score=1339/1500 [averages on correct positions: depth=8.4 time=0.15 nodes=1320734]
Komodo 13.02        score=1319/1500 [averages on correct positions: depth=10.6 time=0.16 nodes=1064120]
Stockfish dev       score=1284/1500 [averages on correct positions: depth=10.8 time=0.18 nodes=1111414]  
Texel 1.07          score=1241/1500 [averages on correct positions: depth=8.9 time=0.22 nodes=1379498]
Arasan 21.0         score=1195/1500 [averages on correct positions: depth=9.5 time=0.24 nodes=1017178]
Lc0 v21.2 ID42524   score=1177/1500 [averages on correct positions: depth=3.8 time=0.14 nodes=1580]
Fruit 2.1           score= 993/1500 [averages on correct positions: depth=5.2 time=0.23 nodes=505132]
Houdini 1.5 scores so well only because it is very related to Rybka, the engine used to build the suite. It is a bad habit building positional test suites by over-analyzing with engines. Leela scores abysmally on a strong GPU on this so called "Strategical Test Suite".
So you want to exchange the bias you think to see for A-B engines developed over the years, (the engines as well as the test- suite) against the one you see with Leela and your suite, both having found to each other over last year?

Or what else does it mean to you, that NN- engines are great at these by you arbitrarily chosen 200 positions, but that these are great 200 positions for NN- engines? Or as the old saying goes: test- sets test the test- positions.

Still there is only one kind of engine which your suite is great for (or the engine for the suite) and from now on you're going to test NN- engines only with these 200 positions, which you think to be great for these engines, to show, how good the development of the NNs will progress in these 200 positions.

I would find it more logical to look for progress in other positions too, which aren't as great for NN- engines explicitely, but better for non- NN- engines too, otherwise I would let NN- engines play only against NN- engines from now on also, to test their progress by gameplaying.

That's what I really wanted to point out, "Elo" measured by games against different pools of engines, different opening- sets, different hardware- TC isn't a transitive measurement in computerchess anymore, if it ever was, and so that's the same with test- suites of course even more then with eng-eng- gameplaying.

Have fun with your 200 positions and the measurements of your very own you get out of them. Yet don't mind me not being interested in your results more than I am in mine.
I am not that big fan of my suite, I guess that some 10% of unique solutions in it are wrong. But you seem to not acknowledge that Leela with late 20b nets is vastly superior _positionally_ to regular AB engines. Time after time you come with something against Leela, and how it is not much worth in analysis. I repeat, I didn't come with my opening positional suite 3 years ago waiting for Leela, and if Leela solves 145-155/200 of positions in 1-2 seconds, while top traditional engines only 95-115/200, it so happened that databases of human games in the opening somehow agree with Leela and less so with Stockfish. Allow me to trust databases with hundreds to thousands of games for each position of human games with average FIDE Elo level of 2500 than to take seriously Peter Martan and his grudges against Leela and "vast experience with analyzing". Again, STS showed itself as a pretty useless tool in measuring the "positional understanding". It is more Rybka left to analyze for a couple of hours each position, a completely wrong methodology in building positional test suites. You would much better switch to tactical test suites like Arasan, they are indeed analyzable with traditional engines, and there this "useless" Leela performs indeed very poorly. To me these deep tactical shots many people love so much are boring. Luckily, they rarely occur in real games.
For my personal point of view it's you taking all these things much too emotional and subjective ad personam, much more than I did and do, otherwise I could as well call your results ones of a biased Leela- fan as you call my critics against your data (your "test suite") grudges against Leela.
As a matter of fact I said nothing else but if you call STS biased in a certain matter, you have to admit the same for your positions too. There is certain bias in any kind of test suite as well as in every other method of testing, e.g. in gameplaying with certain opening sets, hardware- TC and pool of engines compared to each other.
No problem for me, sorry, if it is one for you.

But your psychological external projections aren't my problem neither, you see?

You expressed several times in the past your scepticism about using Leela in analysis. What I observe empirically, is that on my hardware, in Leela- SF games, aside endgames, in 80%-90% (or even higher percentage) of moves, Leela's choices are superior to SF ones (sooner or later SF realizes what Leela saw earlier), and only when some sharp combinations occur with often a unique line, and in endgames, SF is better. I am not sure how one can say that Leela is not of much use as an analysis tool without him having a grudge against it.

Can you admit bluntly that Leela on an RTX GPU is objectively MUCH stronger positionally than any of these Stockfishes, Komodos etc even on 64 core machine (doesn't matter how many cores)? And that STS fails miserably in showing that, while it claimed to measure exactly that?

peter · Post by **peter** » Wed Jun 19, 2019 10:04 pm

Laskos wrote: ↑Wed Jun 19, 2019 9:19 pm Can you admit bluntly that Leela on an RTX GPU is objectively MUCH stronger positionally than any of these Stockfishes, Komodos etc even on 64 core machine (doesn't matter how many cores)? And that STS fails miserably in showing that, while it claimed to measure exactly that?

You don't want to understand that your definition of "positional strength", which I hope for you is more than the result of your "test suite", isn't mine, which of course isn't simply to me any single one other test suite neither, not any single one , not even like any single one like the to me still even better test suite like STS.

It wasn't me demanding from you a test suite that would replace eng-eng-games as for showing any certain kind of playing strength or of even overall playing strength (even more difficult to define because demanding even more single positions to be tested) but the one reflected by the single test you run.

Every test, by game-playing and by test suites depend on positions, opening, middle- game and end-game positions.

Game- playing from early opening positions only always test the opening positions and the strength in opening of the engines compared to each other three times more then endgame positions and three times more divided by two then middlegame- postions, because opeing positions are tested in opening, in middle- game and in end-game by the progress of game.

So if you want to give better measurements, you have to have better and more test postions, opening- ,middle-, and endgame positions.
By gameplaying from certain opening positions you test engines' ability to deal with these opening positions, by gameplaying from middlegame positions you test engines' abilities to deal with these middlegame positions and the same with endgame positions.

If you think, your positional test suite represents your definition of positional strenght best, fine, so be it for you and your definition.
What you must not expect, is that it would be anybody else's definition and test suite of one and only choice too.

If I find the positional qualities tested by STS better fitting to my definition of positional strength, you'll have to be confident with this as well or call me whatever you want.
But remember it wasn't and isn't me who claimed any single test suite a measurement for anybody else's definition of positional strength then the one given by the author of the suite as a very well defined one definition of its own, not more and not less.
Period.

STS rating v13.1 for Lc0 0.21.2 with nodes = 1

STS rating v13.1 for Lc0 0.21.2 with nodes = 1

Re: STS rating v13.1 for Lc0 0.21.2 with nodes = 1

Re: STS rating v13.1 for Lc0 0.21.2 with nodes = 1

Re: STS rating v13.1 for Lc0 0.21.2 with nodes = 1

Re: STS rating v13.1 for Lc0 0.21.2 with nodes = 1

Re: STS rating v13.1 for Lc0 0.21.2 with nodes = 1

Re: STS rating v13.1 for Lc0 0.21.2 with nodes = 1

Re: STS rating v13.1 for Lc0 0.21.2 with nodes = 1

Re: STS rating v13.1 for Lc0 0.21.2 with nodes = 1

Re: STS rating v13.1 for Lc0 0.21.2 with nodes = 1