I'm disappointed with Stockfish dev.

syzygy · Post by **syzygy** » Fri Mar 10, 2023 10:19 pm

CornfedForever wrote: ↑Wed Mar 08, 2023 10:06 pm
syzygy wrote: ↑Wed Mar 08, 2023 9:34 pm
CornfedForever wrote: ↑Tue Mar 07, 2023 4:02 am
Eduard wrote: ↑Mon Mar 06, 2023 12:25 pm

A total of 73 parameters were changed here. Known parameters that are constantly changing. Let's see when one of these parameters will be changed again? It won't take too long.

And they wonder why I question how they can know which changes actually resulted in a positive change and which result in a negative change.
No, they see Dunning-Kruger at work.
Enough with what is essentially name calling rather than an argument. I'm talking about the data and not knowing with any real certainly how you get to a + elo or a -elo (outside of tollerance) because so much is tested together. I mean...if every patch was a positive...SF would be increasing in strength every week. It is not.

Who says SF is not increasing in strength?

SF has increased hundreds of Elo because its development process works.
Of course ultimately there is a ceiling to what can be achieved.

CornfedForever · Post by **CornfedForever** » Fri Mar 10, 2023 11:02 pm

syzygy wrote: ↑Fri Mar 10, 2023 10:19 pm
CornfedForever wrote: ↑Wed Mar 08, 2023 10:06 pm
syzygy wrote: ↑Wed Mar 08, 2023 9:34 pm
CornfedForever wrote: ↑Tue Mar 07, 2023 4:02 am
Eduard wrote: ↑Mon Mar 06, 2023 12:25 pm

A total of 73 parameters were changed here. Known parameters that are constantly changing. Let's see when one of these parameters will be changed again? It won't take too long.

And they wonder why I question how they can know which changes actually resulted in a positive change and which result in a negative change.
No, they see Dunning-Kruger at work.
Enough with what is essentially name calling rather than an argument. I'm talking about the data and not knowing with any real certainly how you get to a + elo or a -elo (outside of tollerance) because so much is tested together. I mean...if every patch was a positive...SF would be increasing in strength every week. It is not.
Who says SF is not increasing in strength?

SF has increased hundreds of Elo because its development process works.
Of course ultimately there is a ceiling to what can be achieved.

Once again...intentionally misinterpreting my words...but we have come to expect that from you.

No one is saying it is 'not increasing in strength'. But there is are a series of patches released each and every week...if every one of them was a positive, it would be increasing each week and clearly it is not. Sometimes it is one step forward, two steps back...not a straight linear progression.

Maybe some remedial English for you is in order?

syzygy · Post by **syzygy** » Fri Mar 10, 2023 11:13 pm

CornfedForever wrote: ↑Fri Mar 10, 2023 11:02 pm
syzygy wrote: ↑Fri Mar 10, 2023 10:19 pm
CornfedForever wrote: ↑Wed Mar 08, 2023 10:06 pm
syzygy wrote: ↑Wed Mar 08, 2023 9:34 pm
CornfedForever wrote: ↑Tue Mar 07, 2023 4:02 am
Eduard wrote: ↑Mon Mar 06, 2023 12:25 pm

A total of 73 parameters were changed here. Known parameters that are constantly changing. Let's see when one of these parameters will be changed again? It won't take too long.

And they wonder why I question how they can know which changes actually resulted in a positive change and which result in a negative change.
No, they see Dunning-Kruger at work.
Enough with what is essentially name calling rather than an argument. I'm talking about the data and not knowing with any real certainly how you get to a + elo or a -elo (outside of tollerance) because so much is tested together. I mean...if every patch was a positive...SF would be increasing in strength every week. It is not.
Who says SF is not increasing in strength?

SF has increased hundreds of Elo because its development process works.
Of course ultimately there is a ceiling to what can be achieved.
Once again...intentionally misinterpreting my words...but we have come to expect that from you.

No one is saying it is 'not increasing in strength'. But there is are a series of patches released each and every week...if every one of them was a positive, it would be increasing each week and clearly it is not. Sometimes it is one step forward, two steps back...not a straight linear progression.

Maybe some remedial English for you is in order?

Ok, so you mean all is going perfectly fine with SF development. Then we can close the thread.

forum3/viewtopic.php?p=943314#p943314

syzygy wrote:The SF development process does not require 100% certainty that a patch gains Elo. It is a game of statistics.

You can NEVER be 100% sure that a patch that seems to gain 1 Elo really does not lose Elo.
You can be 99.99% sure if you want, but it would be a waste of resources.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.

syzygy · Post by **syzygy** » Fri Mar 10, 2023 11:18 pm

syzygy wrote: ↑Fri Mar 10, 2023 11:13 pmYou can NEVER be 100% sure that a patch that seems to gain 1 Elo really does not lose Elo.

Not to mention that the regression tests themselves have error margins.
Welcome to statistics.

AlexChess · Post by **AlexChess** » Fri Mar 10, 2023 11:24 pm

criko wrote: ↑Thu Mar 09, 2023 10:35 am

Piranha was not eaten. It was simply a book fault from white!

It's Eduard's Metaverso

He is convinced that with his daily random mods of values (never adding a single line of code) he can teach to magistral C++ Stockfish developers how to improve their code. I'm easily doing the same, it changes a little the style (often it is worse) but I absolutely know that this way it will be never stronger than SF. I'm only surprised that a Raspberry-P3 can easily draw against a ThreadRipper 128 threads, maybe all tests should be done at tournament level instead of bullet

chrisw · Post by **chrisw** » Sat Mar 11, 2023 12:35 am

syzygy wrote: ↑Fri Mar 10, 2023 11:13 pm
CornfedForever wrote: ↑Fri Mar 10, 2023 11:02 pm
syzygy wrote: ↑Fri Mar 10, 2023 10:19 pm
CornfedForever wrote: ↑Wed Mar 08, 2023 10:06 pm
syzygy wrote: ↑Wed Mar 08, 2023 9:34 pm
CornfedForever wrote: ↑Tue Mar 07, 2023 4:02 am
Eduard wrote: ↑Mon Mar 06, 2023 12:25 pm

A total of 73 parameters were changed here. Known parameters that are constantly changing. Let's see when one of these parameters will be changed again? It won't take too long.

And they wonder why I question how they can know which changes actually resulted in a positive change and which result in a negative change.
No, they see Dunning-Kruger at work.
Enough with what is essentially name calling rather than an argument. I'm talking about the data and not knowing with any real certainly how you get to a + elo or a -elo (outside of tollerance) because so much is tested together. I mean...if every patch was a positive...SF would be increasing in strength every week. It is not.
Who says SF is not increasing in strength?

SF has increased hundreds of Elo because its development process works.
Of course ultimately there is a ceiling to what can be achieved.
Once again...intentionally misinterpreting my words...but we have come to expect that from you.

No one is saying it is 'not increasing in strength'. But there is are a series of patches released each and every week...if every one of them was a positive, it would be increasing each week and clearly it is not. Sometimes it is one step forward, two steps back...not a straight linear progression.

Maybe some remedial English for you is in order?
Ok, so you mean all is going perfectly fine with SF development. Then we can close the thread.

forum3/viewtopic.php?p=943314#p943314
syzygy wrote:The SF development process does not require 100% certainty that a patch gains Elo. It is a game of statistics.
You can NEVER be 100% sure that a patch that seems to gain 1 Elo really does not lose Elo.
You can be 99.99% sure if you want, but it would be a waste of resources.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.

That's a circular argument. If you only have a hammer then everything is hammering.

syzygy · Post by **syzygy** » Sat Mar 11, 2023 3:33 am

Uri Blass wrote: ↑Fri Mar 10, 2023 1:23 am
Sopel wrote: ↑Thu Mar 09, 2023 1:51 pm
Uri Blass wrote: ↑Thu Mar 09, 2023 1:21 pm I would like to see some engine that people test with the principle that every patch that they accept is tested in fixed number of games against previous versions in order to get an unbiased estimate of improvement.
I think I have an engine for you then! It's called Stockfish. They even do something better than a fixed number of games, they use SPRT which is more statistically sound. But if you still insist on fixed games matches then they conveniently do regular regression tests!
SPRT does not give unbiased estimate rating change.

And that is why it is more efficient. You get the information you need to decide whether or not to accept the patch. Not more than you need.

You only know that the change is probably improvement but you have no idea if it is 1 elo improvement or 5 elo improvement.

So you're wasting games on information the development process does not need, and after your fixed number of games you might still not know whether the patch is to be accepted or not.

I write "probably improvement" because there is a small probability that a change with no improvement pass and if you test many 0 elo patches then it means that this probability is probabily higher because if you choose many patches that give 0 elo then statistics tell me that some can pass the SPRT tests.

This will ALWAYS be the case. It has nothing to do with SPRT.

Regular regression tests are not for version X against X+1 and X against X+2 and they only do it only after some patches pass so they are basically something like X against X+20 (when X+20 means 20 patches after X)

And you are complaining that SPRT is not an efficient use of resoures. Right.

Uri Blass · Post by **Uri Blass** » Sat Mar 11, 2023 9:32 am

syzygy wrote: ↑Sat Mar 11, 2023 3:33 am
Uri Blass wrote: ↑Fri Mar 10, 2023 1:23 am
Sopel wrote: ↑Thu Mar 09, 2023 1:51 pm
Uri Blass wrote: ↑Thu Mar 09, 2023 1:21 pm I would like to see some engine that people test with the principle that every patch that they accept is tested in fixed number of games against previous versions in order to get an unbiased estimate of improvement.
I think I have an engine for you then! It's called Stockfish. They even do something better than a fixed number of games, they use SPRT which is more statistically sound. But if you still insist on fixed games matches then they conveniently do regular regression tests!
SPRT does not give unbiased estimate rating change.
And that is why it is more efficient. You get the information you need to decide whether or not to accept the patch. Not more than you need.

You only know that the change is probably improvement but you have no idea if it is 1 elo improvement or 5 elo improvement.
So you're wasting games on information the development process does not need, and after your fixed number of games you might still not know whether the patch is to be accepted or not.

I write "probably improvement" because there is a small probability that a change with no improvement pass and if you test many 0 elo patches then it means that this probability is probabily higher because if you choose many patches that give 0 elo then statistics tell me that some can pass the SPRT tests.
This will ALWAYS be the case. It has nothing to do with SPRT.

Regular regression tests are not for version X against X+1 and X against X+2 and they only do it only after some patches pass so they are basically something like X against X+20 (when X+20 means 20 patches after X)
And you are complaining that SPRT is not an efficient use of resoures. Right.

1)I disagree about the information I need.
I think that I need also some unbiased estimate for the value of a change in order to have more knowledge.

Not everything is about getting elo as fast as possible in a short time.
I prefer to get less elo and better understanding because better understanding may help later for better decisions what to test later.

2)I prefer if people waste games on information that you say the developement process does not need(I believe the information can be productive later so I do not agree it does not need it).
I think that it is better if after people accept the patch based on SPRT,
People use fixed number of games to get an unbiased estimate for the value of change for better knowledge.

It is something that other can do.
I do not plan to spend many hours of computer time to do less than 1% of it but I think that some unbiased estimate for accepted changes with relatively small error may be interesting based on 100,000 games.

Note that even if people do it
I am always afraid there may be errors that are not related to statistical noise(for example suppose in testing stockfishA against stockfishB in some move stockfishA is 10 times slower in nodes per seconds relative to the normal case of using stockfish because some different process run on the computer in the same time then results are not reliable)
I think that it is possible to detect these type of problem automatically but I do not know if it is done.

3)It will always be the case that part of the patches are not good but if we use fixed number of games then we can have a better estimate about the percentage of the good patches(out of accepted patches) and the value of every patch.

The system today will always find "improvements" even if there is no improvement and every patch reduce the level of the engine by 0.1 elo
because after enough patches that reduce the rating by 0.1 elo and fail SPRT
one test may be lucky to pass SPRT.

syzygy · Post by **syzygy** » Sat Mar 11, 2023 5:31 pm

Uri Blass wrote: ↑Sat Mar 11, 2023 9:32 am 1)I disagree about the information I need.
I think that I need also some unbiased estimate for the value of a change in order to have more knowledge.

You'll have to run your own tests to get the information you desire...

Not everything is about getting elo as fast as possible in a short time.

SF development is about improving the engine as well as possible with the finite resources that are available. The available resources may be significant, but they are finite.

I'm sure resource usage is not optimal, but the use of SPRT is not the problem.

I prefer to get less elo and better understanding because better understanding may help later for better decisions what to test later.

There is no way a human can "understand" why, say, a parameter tweak improves play, other than just accepting the test results which show it.

There are patches that deserve special attention, and those usually get special attention.

I am always afraid there may be errors that are not related to statistical noise(for example suppose in testing stockfishA against stockfishB in some move stockfishA is 10 times slower in nodes per seconds relative to the normal case of using stockfish because some different process run on the computer in the same time then results are not reliable)

But those things are supposed to be averaged out by fishtest. Whether fishtest sufficiently randomizes things to average out such noise I do not know for sure because i did not look into the fishtest code, but that is where one should look for this. Much better than redoing all the tests and then still wondering whether there may have been some noise, again redoing the tests, still wondering about noise, infinite loop.

Or do you think your "fixed number of games" are immune to noise?

I think that it is possible to detect these type of problem automatically but I do not know if it is done.

Fisthest does check for machines that produce results that deviate too much from the norm and purges those games.

3)It will always be the case that part of the patches are not good but if we use fixed number of games then we can have a better estimate about the percentage of the good patches(out of accepted patches) and the value of every patch.

And it will always be far more efficient to just tighten the SPRT error margin to whatever you are comfortable with.

What are you going to do if your fixed number of games don't clearly confirm the SPRT result? Right, you will rerun things. So the number of games will not be "fixed" at all. And when you're finally done, you will worry about noise and redo everything from scratch, and so on.

A patch applied to SF is never final. Almost everything gets revisited regularly. Patches that in reality lower Elo but somehow made it into SF will eventually be overwritten.

The system today will always find "improvements" even if there is no improvement and every patch reduce the level of the engine by 0.1 elo
because after enough patches that reduce the rating by 0.1 elo and fail SPRT
one test may be lucky to pass SPRT.

Once SF has reached the theoretical max Elo, then every patch will only lose Elo. But after some of these patches have been applied, there will be room for improvement again. So nothing to worry about. We'll just have to accept that a ceiling will be reached eventually. Maybe this year, maybe in 2187.

Chessqueen · Post by **Chessqueen** » Sat Mar 11, 2023 5:54 pm

Eduard wrote: ↑Fri Feb 10, 2023 1:53 pm I'm disappointed with Stockfish dev!

I'm sorry to say it again. But: I like Stockfish dev. not. This is just a Bullet King. Today there was another new network! Already 6 new nets from Linrock. And when I test the engine, I'm honestly very disappointed.

Here are two positions:
(Analyzes on Ryzen 3900X with 20 threads)

[fen]1r3rk1/1bqnbpp1/p2ppn1B/1p6/4PP1Q/PNNB4/1PP3PP/1K1R3R b - - 0 16[/fen]

Analysis by Stockfish dev-20230209-05dea2ca:

30...Qb5 31.Qf3 Kg7 32.Rxe6 Rxe6 33.Nf5+ Kf7 34.Qh3 Kg8 35.Nxd4 Qe8 36.Nxe6 Qxe6 37.f4 a5 38.f5 Qe8 39.Qh4 Qf7 40.Qf2 Rd8 41.Ra1 b3 42.Rb1 a4 43.cxb3 axb3 44.Qe3 gxf5 45.gxf5 Kh8 46.d4 Qc4 47.Rxb3 Qf1+ 48.Qg1 Qf4 49.Qe1 Rxd4 50.Rb8+ Kg7 51.Qg3+ Qxg3 52.hxg3 Rxe4 53.Rc8 Rc4 54.Rxc7+ Kh6 55.Rc8 Kg5 56.Kg2 Kxf5 57.Kh3 Ke5 58.c6 Kd6 59.c7 Rxc7 60.Ra8 Ke6 61.Kg2 Kf7 62.Kf3 Rc3+ 63.Kf2
= (0.07 --) Depth: 46/66 00:00:35 698MN, tb=47011
The position is equal

Qb5?? loses immediately!

[fen]1r3rk1/1bqnbpp1/p2ppn1B/1p6/4PP1Q/PNNB4/1PP3PP/1K1R3R b - - 0 1[/fen]

Analysis by Stockfish dev-20230209-05dea2ca:

16...gxh6 17.g4
= (0.13 --) Depth: 46/52 00:01:03 1175MN, tb=35710
The position is equal

gxh6 is critical +/-! Nd5= is significantly better!

In both positions Stockfish dev. weak.

Here is an analysis of my engine Charisma Blitz (can be downloaded from my homepage):

Pos 1:

r3r2k/p1pq3p/R3bppN/2P5/1p1pP1P1/3P4/2P2P1P/3Q2RK b - - 0 1

Analysis by Charisma Blitz-avx2:

30...Qe7 31.Qd2 Qxc5 32.h4 c6 33.Rga1 Qe5 34.Kg2 c5 35.Rxa7 Rxa7 36.Rxa7 Qb8 37.Ra1 Qc7 38.f3 Kg7 39.g5 f5 40.Qe1 fxe4 41.fxe4 Qf4 42.Ra7+ Kf8 43.Rxh7 Re7 44.Rh8+ Kg7 45.Rd8
+/= (0.40) Depth: 29/45 00:00:06 132MN, tb=428
White is slightly better

Pos 2:

1r3rk1/1bqnbpp1/p2ppn1B/1p6/4PP1Q/PNNB4/1PP3PP/1K1R3R b - - 0 1

Analysis by Charisma Blitz-avx2:

16...Nd5 17.Nxd5 Bxh4 18.Nxc7 gxh6 19.f5 exf5 20.Be2 Rbc8 21.Nd5 fxe4 22.Ne3 Nc5 23.Nf5 Bf2 24.Nxd6 Nxb3 25.Nxb7 Nd4 26.Rhf1 Rxc2 27.Rxf2 Rxe2 28.Rxe2 Nxe2 29.Re1 Nd4 30.Rxe4 Ne6 31.Rg4+ Kh7 32.Rg3 f5 33.Rc3 Nf4 34.g3 Nh5 35.Nc5 Rf6 36.Nd7 Rd6 37.Ne5
= (0.18) Depth: 36/49 00:00:22 427MN, tb=3719
The position is equal

I'm not against Stockfish, but I'm very disappointed, sorry! There are more such positions, these are just 2 examples. Try it yourself! That's why I prefer to make my own clone!

I am sure none of us here spent 1 single penny :

I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm NOT Disappointed Stockfish Dev. is FREE