SPCC: Testrun of SF nnue gk200627 finished

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw

Ron Langeveld
Posts: 140
Joined: Tue Jan 05, 2010 8:02 pm

Re: SPCC: Testrun of SF nnue gk200627 finished

Post by Ron Langeveld »

pohl4711 wrote: Thu Jul 23, 2020 4:46 pm
Ron Langeveld wrote: Thu Jul 23, 2020 10:18 am You should not run 20 games at the same time on a 12 core (24 thread) laptop. The conditions won't be the same (unreliable results).
That is not true. All engines run slower, if more than 12 threads are used. Thats clear. But this means not a distortion. Only a slowdown.
I opened 20x Stockfish 11-engine in console mode in Windows and started all of them with "go infinite". All of them ran smooth and with stable speed. As long as at least one thread is not in use (free for Windows-operations), there is no distortion. And I keep 4 threads unused.
Every physical core has a second virtual core/thread that is a lot slower in nps. Your test is seriously flawed because you assume that each of the 20 Stockfish instances were fixed on the the thread they started on which is absolutely not true. Each instance gets to use many threads over time and in the end the nps of the session averages out and looks the same for the period you run it. In a test game where a move is calculated in a much shorter time frame like 1 second there will have been considerably less thread switching and the chance of moves being the result of low nps versus high nps moves is increasing considerably with inconsistent quality as a result. If you don't belief me then pick any pool of engines with a known difference in elo and try to establish these differences with specific error margins in a test with 11 games in parallel and a test with 20 games in parallel. You'll will notice that the latter test will need more games to get to the same level of certainty.
User avatar
pohl4711
Posts: 2434
Joined: Sat Sep 03, 2011 7:25 am
Location: Berlin, Germany
Full name: Stefan Pohl

Re: SPCC: Testrun of SF nnue gk200627 finished

Post by pohl4711 »

Ron Langeveld wrote: Fri Jul 24, 2020 8:39 am
pohl4711 wrote: Thu Jul 23, 2020 4:46 pm
Ron Langeveld wrote: Thu Jul 23, 2020 10:18 am You should not run 20 games at the same time on a 12 core (24 thread) laptop. The conditions won't be the same (unreliable results).
That is not true. All engines run slower, if more than 12 threads are used. Thats clear. But this means not a distortion. Only a slowdown.
I opened 20x Stockfish 11-engine in console mode in Windows and started all of them with "go infinite". All of them ran smooth and with stable speed. As long as at least one thread is not in use (free for Windows-operations), there is no distortion. And I keep 4 threads unused.
Every physical core has a second virtual core/thread that is a lot slower in nps. Your test is seriously flawed because you assume that each of the 20 Stockfish instances were fixed on the the thread they started on which is absolutely not true. Each instance gets to use many threads over time and in the end the nps of the session averages out and looks the same for the period you run it. In a test game where a move is calculated in a much shorter time frame like 1 second there will have been considerably less thread switching and the chance of moves being the result of low nps versus high nps moves is increasing considerably with inconsistent quality as a result. If you don't belief me then pick any pool of engines with a known difference in elo and try to establish these differences with specific error margins in a test with 11 games in parallel and a test with 20 games in parallel. You'll will notice that the latter test will need more games to get to the same level of certainty.
All I can say is: My results are valid. Look at the latest: Stockfish 200717 is +30 Elo better than Stockfish 11 in my ratinglist:
https://www.sp-cc.de
And look at the regression-test page of Stockfish:
https://github.com/glinscott/fishtest/w ... sion-Tests
Progress of Stockfish 200717 (single) to Stockfish 11: +30.7 Elo.
Ron Langeveld
Posts: 140
Joined: Tue Jan 05, 2010 8:02 pm

Re: SPCC: Testrun of SF nnue gk200627 finished

Post by Ron Langeveld »

Of course your results can be valid. I never said they weren't. I was addressing another issue though, which basically boils down to your tests suffering from "noise" due to a significant percentage of weak moves as a result of low nps. This means that you will have to run many more games in order to get to the same accuracy in results. This means that if you measure a 30 point elo difference you could have played less games to get there with the same error margins when you just use 11 physical cores.
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: SPCC: Testrun of SF nnue gk200627 finished

Post by Rebel »

Ron Langeveld wrote: Fri Jul 24, 2020 2:18 pm Of course your results can be valid. I never said they weren't. I was addressing another issue though, which basically boils down to your tests suffering from "noise" due to a significant percentage of weak moves as a result of low nps. This means that you will have to run many more games in order to get to the same accuracy in results. This means that if you measure a 30 point elo difference you could have played less games to get there with the same error margins when you just use 11 physical cores.
Assumptions without evidence. Show me one case.
90% of coding is debugging, the other 10% is writing bugs.
User avatar
pohl4711
Posts: 2434
Joined: Sat Sep 03, 2011 7:25 am
Location: Berlin, Germany
Full name: Stefan Pohl

Re: SPCC: Testrun of SF nnue gk200627 finished

Post by pohl4711 »

Rebel wrote: Fri Jul 24, 2020 4:44 pm
Ron Langeveld wrote: Fri Jul 24, 2020 2:18 pm Of course your results can be valid. I never said they weren't. I was addressing another issue though, which basically boils down to your tests suffering from "noise" due to a significant percentage of weak moves as a result of low nps. This means that you will have to run many more games in order to get to the same accuracy in results. This means that if you measure a 30 point elo difference you could have played less games to get there with the same error margins when you just use 11 physical cores.
Assumptions without evidence. Show me one case.
I made an experiment on my Quadcore Notebook (8 hyperthreading "cores"): Run 5x Stockfish and then I started 2 more Stockfish simultaneously. As I expected, the 2 Stockfish ran exactly at the same speed, even though 5 other Stockfish were running and I had 7 Stockfish in total running on a Quadcore.
QED
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: SPCC: Testrun of SF nnue gk200627 finished

Post by Rebel »

pohl4711 wrote: Sat Jul 25, 2020 6:03 am
Rebel wrote: Fri Jul 24, 2020 4:44 pm
Ron Langeveld wrote: Fri Jul 24, 2020 2:18 pm Of course your results can be valid. I never said they weren't. I was addressing another issue though, which basically boils down to your tests suffering from "noise" due to a significant percentage of weak moves as a result of low nps. This means that you will have to run many more games in order to get to the same accuracy in results. This means that if you measure a 30 point elo difference you could have played less games to get there with the same error margins when you just use 11 physical cores.
Assumptions without evidence. Show me one case.
I made an experiment on my Quadcore Notebook (8 hyperthreading "cores"): Run 5x Stockfish and then I started 2 more Stockfish simultaneously. As I expected, the 2 Stockfish ran exactly at the same speed, even though 5 other Stockfish were running and I had 7 Stockfish in total running on a Quadcore.
QED
I have done many tests using all the threats available and never noticed any problem. It's an important subject, the base of measuring possible elo improvements and so I did several tests playing exact same engines against each other at full speed, after the match running a tool inspecting the output, looked perfect every time.

I don't pretend to know the truth of the matter and while I understand the philosophical assumption about the "cores-1" rule I haven't seen any proof of that and it's quite well possible it's a created myth.
90% of coding is debugging, the other 10% is writing bugs.
Modern Times
Posts: 3546
Joined: Thu Jun 07, 2012 11:02 pm

Re: SPCC: Testrun of SF nnue gk200627 finished

Post by Modern Times »

Someone here commented a while ago that hyperthreading, and the AMD equivalent, has probably come a long way since it was first Introduced by Intel all those years ago. Stefan knows what he is doing so I'm inclined to think all is OK.
User avatar
pohl4711
Posts: 2434
Joined: Sat Sep 03, 2011 7:25 am
Location: Berlin, Germany
Full name: Stefan Pohl

Re: SPCC: Testrun of SF nnue gk200627 finished

Post by pohl4711 »

Rebel wrote: Sat Jul 25, 2020 8:35 am
I don't pretend to know the truth of the matter and while I understand the philosophical assumption about the "cores-1" rule I haven't seen any proof of that and it's quite well possible it's a created myth.
I do not believe, that using all threads will distort the results by running some engines with more or less speed than others. But when Windows does some hardware-using with more or less efforts, the running engines could be running a little bit faster or slower. That is no big deal, but when using notebooks 24/7, it is better not to use 100% of the hardware...So, I use 20 of 24 threads and all is fine.
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: SPCC: Testrun of SF nnue gk200627 finished

Post by Rebel »

pohl4711 wrote: Sat Jul 25, 2020 10:33 am
Rebel wrote: Sat Jul 25, 2020 8:35 am
I don't pretend to know the truth of the matter and while I understand the philosophical assumption about the "cores-1" rule I haven't seen any proof of that and it's quite well possible it's a created myth.
I do not believe, that using all threads will distort the results by running some engines with more or less speed than others. But when Windows does some hardware-using with more or less efforts, the running engines could be running a little bit faster or slower. That is no big deal, but when using notebooks 24/7, it is better not to use 100% of the hardware...So, I use 20 of 24 threads and all is fine.
If a PC has internet and/or anti-virus software installed it's a wise thing to do, on a clean PC (IMO) there is no need.
90% of coding is debugging, the other 10% is writing bugs.