I'm disappointed with Stockfish dev.

Sopel · Post by **Sopel** » Sat Mar 11, 2023 7:26 pm

Uri Blass wrote: ↑Sat Mar 11, 2023 9:32 am I prefer to get less elo and better understanding because better understanding may help later for better decisions what to test later.

How helpful would it be to have +0.7 Elo +-0.5 for every patch?

syzygy · Post by **syzygy** » Sat Mar 11, 2023 8:55 pm

chrisw wrote: ↑Sat Mar 11, 2023 12:35 am
syzygy wrote: ↑Fri Mar 10, 2023 11:13 pm
CornfedForever wrote: ↑Fri Mar 10, 2023 11:02 pm
syzygy wrote: ↑Fri Mar 10, 2023 10:19 pm
CornfedForever wrote: ↑Wed Mar 08, 2023 10:06 pm
syzygy wrote: ↑Wed Mar 08, 2023 9:34 pm
CornfedForever wrote: ↑Tue Mar 07, 2023 4:02 am
Eduard wrote: ↑Mon Mar 06, 2023 12:25 pm

A total of 73 parameters were changed here. Known parameters that are constantly changing. Let's see when one of these parameters will be changed again? It won't take too long.

And they wonder why I question how they can know which changes actually resulted in a positive change and which result in a negative change.
No, they see Dunning-Kruger at work.
Enough with what is essentially name calling rather than an argument. I'm talking about the data and not knowing with any real certainly how you get to a + elo or a -elo (outside of tollerance) because so much is tested together. I mean...if every patch was a positive...SF would be increasing in strength every week. It is not.
Who says SF is not increasing in strength?

SF has increased hundreds of Elo because its development process works.
Of course ultimately there is a ceiling to what can be achieved.
Once again...intentionally misinterpreting my words...but we have come to expect that from you.

No one is saying it is 'not increasing in strength'. But there is are a series of patches released each and every week...if every one of them was a positive, it would be increasing each week and clearly it is not. Sometimes it is one step forward, two steps back...not a straight linear progression.

Maybe some remedial English for you is in order?
Ok, so you mean all is going perfectly fine with SF development. Then we can close the thread.

forum3/viewtopic.php?p=943314#p943314
syzygy wrote:The SF development process does not require 100% certainty that a patch gains Elo. It is a game of statistics.
You can NEVER be 100% sure that a patch that seems to gain 1 Elo really does not lose Elo.
You can be 99.99% sure if you want, but it would be a waste of resources.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.
That's a circular argument. If you only have a hammer then everything is hammering.

What argument?

CornfedForever · Post by **CornfedForever** » Sat Mar 11, 2023 9:01 pm

syzygy wrote: ↑Sat Mar 11, 2023 8:55 pm
What argument?

If I may...just google 'circular argument' or 'circular reasoning'...it's a logical falllacy...and you will see what he mans.

syzygy · Post by **syzygy** » Sat Mar 11, 2023 11:30 pm

CornfedForever wrote: ↑Sat Mar 11, 2023 9:01 pm
syzygy wrote: ↑Sat Mar 11, 2023 8:55 pm
What argument?
If I may...just google 'circular argument' or 'circular reasoning'...it's a logical falllacy...and you will see what he mans.

Oh boy. You must be a fun person.

What argument did I make. Before an argument can be circular, there has to be an argument.

chrisw · Post by **chrisw** » Sun Mar 12, 2023 12:03 am

syzygy wrote: ↑Sat Mar 11, 2023 8:55 pm
chrisw wrote: ↑Sat Mar 11, 2023 12:35 am
syzygy wrote: ↑Fri Mar 10, 2023 11:13 pm
CornfedForever wrote: ↑Fri Mar 10, 2023 11:02 pm
syzygy wrote: ↑Fri Mar 10, 2023 10:19 pm
CornfedForever wrote: ↑Wed Mar 08, 2023 10:06 pm
syzygy wrote: ↑Wed Mar 08, 2023 9:34 pm
CornfedForever wrote: ↑Tue Mar 07, 2023 4:02 am
Eduard wrote: ↑Mon Mar 06, 2023 12:25 pm

A total of 73 parameters were changed here. Known parameters that are constantly changing. Let's see when one of these parameters will be changed again? It won't take too long.

And they wonder why I question how they can know which changes actually resulted in a positive change and which result in a negative change.
No, they see Dunning-Kruger at work.
Enough with what is essentially name calling rather than an argument. I'm talking about the data and not knowing with any real certainly how you get to a + elo or a -elo (outside of tollerance) because so much is tested together. I mean...if every patch was a positive...SF would be increasing in strength every week. It is not.
Who says SF is not increasing in strength?

SF has increased hundreds of Elo because its development process works.
Of course ultimately there is a ceiling to what can be achieved.
Once again...intentionally misinterpreting my words...but we have come to expect that from you.

No one is saying it is 'not increasing in strength'. But there is are a series of patches released each and every week...if every one of them was a positive, it would be increasing each week and clearly it is not. Sometimes it is one step forward, two steps back...not a straight linear progression.

Maybe some remedial English for you is in order?
Ok, so you mean all is going perfectly fine with SF development. Then we can close the thread.

forum3/viewtopic.php?p=943314#p943314
syzygy wrote:The SF development process does not require 100% certainty that a patch gains Elo. It is a game of statistics.
You can NEVER be 100% sure that a patch that seems to gain 1 Elo really does not lose Elo.
You can be 99.99% sure if you want, but it would be a waste of resources.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.
That's a circular argument. If you only have a hammer then everything is hammering.
What argument?

It's been reduced to a statistical game by reducing the game itself to 1, 0 and -1. If however, you were to regard the chess engine/game/author/development thing with features more than 1,0,-1 it would be not just more than a "game of statistics" but have other properties too. It must have those other properties otherwise you wouldn't be doing it and the general interest would sink to zero. One assumes.

connor_mcmonigle · Post by **connor_mcmonigle** » Sun Mar 12, 2023 12:24 am

It's immediately telling that the only individuals criticizing the Stockfish project's testing methodology are those who are not remotely experienced when it comes to actual chess engine development. SPRT both is statistically principled and has been empirically demonstrated to be effective across a wide variety of disciplines. It is true that results at STC aren't necessarily guaranteed to translate to results at LTC perfectly, but Stockfish's incredible progress in terms of Elo at a wide range of time controls over the last several years is a testament to the fact that the Stockfish project's testing methodology is effective.

This is a classic case of Dunning Kruger at work. To have any ground to stand on in criticizing Stockfish's testing methodology, you have to both propose and demonstrate an effective alternative. Go develop your own engine from scratch or start from a much weaker engine. Implement your own alternative testing methodology and see what kind of progress you make. You'll quickly find that practical considerations prevent VLTC SPRT testing. You'll quickly find that just using test positions (as was suggested earlier in this thread for some reason) as a proxy for engine strength will get you nowhere. Or you could try the approach which Eduard here has seemingly taken: make a few random changes and watch the engine play a handful of games on some random server and call it good enough. In a great surprise to no one, this also won't get you anywhere.

Starting with Stockfish as a base for experimenting with alternative testing methodologies is incredibly daft as Stockfish is so incredibly strong that random garbage changes usually won't significantly harm its LTC performance. If you've weakened Stockfish by tens of Elo in STC testing and your changes seem roughly neutral in limited VLTC testing, that doesn't mean your changes are brilliant and will continue to scale better at increasingly longer time controls. Rather, it just means chess is pretty close to a draw for an engine as strong as Stockfish and, with sufficient time, even garbage patches won't significantly harm its performance.

chrisw · Post by **chrisw** » Sun Mar 12, 2023 12:34 am

connor_mcmonigle wrote: ↑Sun Mar 12, 2023 12:24 am It's immediately telling that the only individuals criticizing the Stockfish project's testing methodology are those who are not remotely experienced when it comes to actual chess engine development. SPRT both is statistically principled and has been empirically demonstrated to be effective across a wide variety of disciplines. It is true that results at STC aren't necessarily guaranteed to translate to results at LTC perfectly, but Stockfish's incredible progress in terms of Elo at a wide range of time controls over the last several years is a testament to the fact that the Stockfish project's testing methodology is effective.

This is a classic case of Dunning Kruger at work. To have any ground to stand on in criticizing Stockfish's testing methodology, you have to both propose and demonstrate an effective alternative. Go develop your own engine from scratch or start from a much weaker engine. Implement your own alternative testing methodology and see what kind of progress you make. You'll quickly find that practical considerations prevent VLTC SPRT testing. You'll quickly find that just using test positions (as was suggested earlier in this thread for some reason) as a proxy for engine strength will get you nowhere. Or you could try the approach which Eduard here has seemingly taken: make a few random changes and watch the engine play a handful of games on some random server and call it good enough. In a great surprise to no one, this also won't get you anywhere.

Starting with Stockfish as a base for experimenting with alternative testing methodologies is incredibly daft as Stockfish is so incredibly strong that random garbage changes usually won't significantly harm its LTC performance. If you've weakened Stockfish by tens of Elo in STC testing and your changes seem roughly neutral in limited VLTC testing, that doesn't mean your changes are brilliant and will continue to scale better at increasingly longer time controls. Rather, it just means chess is pretty close to a draw for an engine as strong as Stockfish and, with sufficient time, even garbage patches won't significantly harm its performance.

This is true if there is only one hill to climb.

DrEinstein · Post by **DrEinstein** » Sun Mar 12, 2023 11:12 am

So we are all patiently waiting for the next big jump to the bigger hill. I believe, or want to believe, that Stockfish is not yet standing on top of the highest mountain.

syzygy · Post by **syzygy** » Sun Mar 12, 2023 12:07 pm

chrisw wrote: ↑Sun Mar 12, 2023 12:03 am
syzygy wrote: ↑Sat Mar 11, 2023 8:55 pm
chrisw wrote: ↑Sat Mar 11, 2023 12:35 am
syzygy wrote: ↑Fri Mar 10, 2023 11:13 pm Ok, so you mean all is going perfectly fine with SF development. Then we can close the thread.

forum3/viewtopic.php?p=943314#p943314
syzygy wrote:The SF development process does not require 100% certainty that a patch gains Elo. It is a game of statistics.
You can NEVER be 100% sure that a patch that seems to gain 1 Elo really does not lose Elo.
You can be 99.99% sure if you want, but it would be a waste of resources.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.
That's a circular argument. If you only have a hammer then everything is hammering.
What argument?
It's been reduced to a statistical game by reducing the game itself to 1, 0 and -1. If however, you were to regard the chess engine/game/author/development thing with features more than 1,0,-1 it would be not just more than a "game of statistics" but have other properties too. It must have those other properties otherwise you wouldn't be doing it and the general interest would sink to zero. One assumes.

I am not making an argument but stating a fact.

Sure, there is engine functionality not related to chess-playing strength, and I am not talking about that kind of functionality.

CornfedForever · Post by **CornfedForever** » Sun Mar 12, 2023 6:59 pm

DrEinstein wrote: ↑Sun Mar 12, 2023 11:12 am So we are all patiently waiting for the next big jump to the bigger hill. I believe, or want to believe, that Stockfish is not yet standing on top of the highest mountain.

I wonder how one might define next "big jump". All the 'big jumps' have likely come and gone as engine strength is closer to topping out. What is left are likely 'little jumps'. The issue I (and I think others - but I do not speak for them ) see is that those are harder to find...and probably harder under the traditional testing framework to - these days, actually know 'what tweaks" actually' are responsible for those...really, very a little jumps if only because they fall closer to the 'margin of error'. You get a '+' and presume you 'have it' when it is part of multiple 'patches' working together...then later we find something in the tweaks/patches being disregarded or at least changed. And some people...do not seem to want to admit to seeing this 2 steps forward, 1 step back/1step forward, 2 step back thing happening. But it is a viable 'blind approach' that can work over time.

I (like to think) I know a little about quantum physics. There reality is just so 'odd' that no one currently fully understands it...you just "follow the math" into the darkness. Chess though is different animal as we know there are 'only' 10 to the 40 legal moves possible in a game, you play it on only 64 squars and Knights do not move like Bishops...etc.

Sure you can see VERY slow, incremental progress with the path being taken (and steps backward...). However, being at a bit of a loss for exactly what tweak 'works' means it resembles more 'wishcraft' than science - throwing things against the wall and hoping 'something' sticks (and often not knowing exactly what or why it stuck). It's almost like blindly taking herbs to combat Covid-19 until you eventually find in your testing a statistical 'hit' that seems to indicate 'something' in one of those herbs resulted in a tiny number of people not dying who might otherwise would have....vs identifying 'what' specific thing in a given herb actually is responsible and using that...or looking at things differently and finding a spike protein and using it to alert the bodies immune system to respond to something that looks like it...or viral vector technologies for dealing with other disease. etc. Wishcraft vs Science. Both can work...but with one you tend to know 'why' it is working...which in theory should mean 'less steps back'.

I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.