Stockfish randomicity

amchess · Post by **amchess** » Tue Sep 26, 2023 2:00 pm

Chess knowledge is not only to play the best or worst moves, but also to find as comprehensive/significant test samples as possible.
I tested Stockfish at long times (25+10) against an engine of similar strength over 200 games. The test took more than 10 days and I ran it twice. In the first, Stockfish won with 3 games to spare. In the other, the result was the opposite.
As for the tests at ultra-long times over thousands of games, they exalt patches that pruning very hard. In fact, some derivatives such as Crystal and ShashChess show how they are able to resolve many more hard positions than basic Stockfish at non ultrafast times, precisely because they do not make too much use of selectivity techniques.
So, in my opinion, testing over thousands of games is only good if you want to get a bullet monster.

connor_mcmonigle · Post by **connor_mcmonigle** » Tue Sep 26, 2023 6:01 pm

amchess wrote: ↑Tue Sep 26, 2023 2:00 pm Chess knowledge is not only to play the best or worst moves, but also to find as comprehensive/significant test samples as possible.
I tested Stockfish at long times (25+10) against an engine of similar strength over 200 games. The test took more than 10 days and I ran it twice. In the first, Stockfish won with 3 games to spare. In the other, the result was the opposite.
As for the tests at ultra-long times over thousands of games, they exalt patches that pruning very hard. In fact, some derivatives such as Crystal and ShashChess show how they are able to resolve many more hard positions than basic Stockfish at non ultrafast times, precisely because they do not make too much use of selectivity techniques.
So, in my opinion, testing over thousands of games is only good if you want to get a bullet monster.

As has been explained to you countless times at this point, a 200 game test is meaningless garbage. Even single threaded, you'll get different results between test runs due to slight timing variations.

Crystal shows that you can weaken Stockfish by >100 of Elo and slightly improve its performance on tactical test positions - this doesn't mean pruning and reducing less is better. To the contrary, from a game playing perspective, it's provably much worse as it results in the engine spending time on lines which usually prove to be bad.

amchess · Post by **amchess** » Tue Sep 26, 2023 8:32 pm

I don't agree. As I explained me too a lot of times, at vltc, if you classify before the positions, you'll obtain both goals. This is Shashchess idea. Moreover, 200 games MUST be meaningful because the samples are carefully choosen by chess knowledge and the TC is 25+10.

syzygy · Post by **syzygy** » Wed Sep 27, 2023 12:56 am

amchess wrote: ↑Tue Sep 26, 2023 2:00 pm Chess knowledge is not only to play the best or worst moves, but also to find as comprehensive/significant test samples as possible.
I tested Stockfish at long times (25+10) against an engine of similar strength over 200 games. The test took more than 10 days and I ran it twice. In the first, Stockfish won with 3 games to spare. In the other, the result was the opposite.

So what you want to see is that they play 2x100 times the same 2 games with the same 2 outcomes. Right.

amchess · Post by **amchess** » Wed Sep 27, 2023 9:48 am

I mean you cannot underestimate the fact that, after 200 games at 25+10, Stockfish does not prevail.
My purpose with ShashChess is not to make a bullet monster, but a tool USEFUL to the practical player
- by correspondence (thus optimized for long times)
-OTB, with a handicap mode based not simply on the number of random errors, but on a player's thinking system provided by the classical rating function with weights that can be turned on and off according to the level of play With the neural network, which is a black box, this fine granularity would not be possible, and using a handicap mode as in the MAIA project would be too approximate because it is based on error-filled matches between humans.
On the other hand, even Stockfish 11 (the last one without a net) is definitely stronger than even the human world champion.
In my opinion, research, in general, is useless and terribly boring if it does not make people's lives better.
As for the testing strategy, I do not have the recipe ready and was pondering how to improve it given precisely the randomness of the operating system and the engine itself for lazy smp. Perhaps, testing against weaker engines could also be considered.

Ciekce · Post by **Ciekce** » Wed Sep 27, 2023 12:03 pm

amchess wrote: ↑Tue Sep 26, 2023 8:32 pm I don't agree. As I explained me too a lot of times, at vltc, if you classify before the positions, you'll obtain both goals. This is Shashchess idea. Moreover, 200 games MUST be meaningful because the samples are carefully choosen by chess knowledge and the TC is 25+10.

you can disagree all you want, you're arguing against basic statistics

200 games to measure such a small strength difference is *never* representative, opening selection is irrelevant

amchess · Post by **amchess** » Wed Sep 27, 2023 9:07 pm

Ciekce wrote: ↑Wed Sep 27, 2023 12:03 pm
amchess wrote: ↑Tue Sep 26, 2023 8:32 pm I don't agree. As I explained me too a lot of times, at vltc, if you classify before the positions, you'll obtain both goals. This is Shashchess idea. Moreover, 200 games MUST be meaningful because the samples are carefully choosen by chess knowledge and the TC is 25+10.
you can disagree all you want, you're arguing against basic statistics

200 games to measure such a small strength difference is *never* representative, opening selection is irrelevant

The basics of statistics say that the more significant the samples are, the more they can be reduced, somewhat like when projections are made for voting. In the case of chess, obviously chess knowledge cannot be ignored. Apparently, this is not so obvious... I have definitely taken my math exams... Anyway, no problem: everyone enjoys themselves however they want. I'm not interested in competing to see who has the longest one. I am much happier in being useful to others. The discussion has become sterile because there is no worse deaf person than the one who doesn't want to listen.

syzygy · Post by **syzygy** » Thu Sep 28, 2023 1:40 am

amchess wrote: ↑Wed Sep 27, 2023 9:07 pmThe discussion has become sterile because there is no worse deaf person than the one who doesn't want to listen.

Always good to witness a moment of self-reflection.

connor_mcmonigle · Post by **connor_mcmonigle** » Fri Sep 29, 2023 5:42 am

amchess wrote: ↑Wed Sep 27, 2023 9:07 pm
Ciekce wrote: ↑Wed Sep 27, 2023 12:03 pm
amchess wrote: ↑Tue Sep 26, 2023 8:32 pm I don't agree. As I explained me too a lot of times, at vltc, if you classify before the positions, you'll obtain both goals. This is Shashchess idea. Moreover, 200 games MUST be meaningful because the samples are carefully choosen by chess knowledge and the TC is 25+10.
you can disagree all you want, you're arguing against basic statistics

200 games to measure such a small strength difference is *never* representative, opening selection is irrelevant
The basics of statistics say that the more significant the samples are, the more they can be reduced, somewhat like when projections are made for voting. In the case of chess, obviously chess knowledge cannot be ignored. Apparently, this is not so obvious... I have definitely taken my math exams... Anyway, no problem: everyone enjoys themselves however they want. I'm not interested in competing to see who has the longest one. I am much happier in being useful to others. The discussion has become sterile because there is no worse deaf person than the one who doesn't want to listen.

It is exceedingly evident that, if you ever took your math exams as you claim, you are quite rusty at present.
In this very thread, you were painfully close to the realization that the randomness of game results invalidated your garbage testing methodology. Every game is a three-sided (weighted) coin toss. You need a lot of coin tosses to accurately estimate the weights of the three-sided coin. It doesn't matter if your 200 starting positions are somehow an immaculate embodiment of all of chess - 200 coin tosses will just never be enough.

amchess · Post by **amchess** » Fri Sep 29, 2023 9:01 pm

No! This is not a three-coin toss because the probabilities of the outcomes are not equidistributed.
Indeed, with nnue, it is increasingly difficult to find positions that alternate between decisive result for a color or the draw.
Here, chess knowledge intervenes.
In general, business knowledge improves the testing strategy:
sw engineering exam.
I didn't allow myself to insult you, partly because I don't know you, but more importantly, I would never do that.
Since you don't think so, this is my last post in this thread
because this thread has become totally unconstructive.
I ask the moderators to intervene and hope to be heard this time.
It is not possible that one cannot discuss a topic civilly without boorish personal attacks, without even knowing the interlocutor.

Stockfish randomicity

Re: Stockfish randomicity

Re: Stockfish randomicity

Re: Stockfish randomicity

Re: Stockfish randomicity

Re: Stockfish randomicity

Re: Stockfish randomicity

Re: Stockfish randomicity

Re: Stockfish randomicity

Re: Stockfish randomicity

Re: Stockfish randomicity