Hybride replacemment strategy worse than always-replace

hgm · Post by **hgm** » Thu Apr 25, 2024 8:02 pm

It is the SPRT sequential testing you advocate that ignores such interaction. If A and B individually add Elo, but A and B combined lose Elo because they interact, you would accept the one you SPRT tested first, and then reject the one you tried to add afterwards. While the latter one could actually have gained you more Elo than the former.

Testing the changes simultaneously in the described way would actually reveal such interaction without any additional testing; you would just add the result of the A+B mod and the unchanged version, and compare it to the sum of the A-only and B-only results.

You can of course elaborate on this, and test even more than two changes simultaneously. With 5 changes you would have 32 versions, and you could let those play a round robin, which would have about 1000 games. You can then compare the total result of all versions that had a change with the total for those that did not have it.

It is fine that you are set in your ways, and are not interested in learning anything new. But it is strange you want to prevent others from doing so, by making false claims...

JacquesRW · Post by **JacquesRW** » Thu Apr 25, 2024 9:02 pm

hgm wrote: ↑Thu Apr 25, 2024 8:02 pm It is the SPRT sequential testing you advocate that ignores such interaction. If A and B individually add Elo, but A and B combined lose Elo because they interact, you would accept the one you SPRT tested first, and then reject the one you tried to add afterwards. While the latter one could actually have gained you more Elo than the former.

You can just SPRT B without A against your new master (that contains A).

You can't soundly combine your A, B and A+B test results, but suppose we ignore that, could you give some solid examples of **how** you can practically select pairs of patches where this method is actually beneficial over sequentially SPRTing stuff.

hgm wrote: ↑Thu Apr 25, 2024 8:02 pm It is fine that you are set in your ways, and are not interested in learning anything new. But it is strange you want to prevent others from doing so, by making false claims...

The call is coming from inside the house... don't forget that this whole discussion was sparked by

hgm wrote: ↑Wed Apr 24, 2024 8:13 pm Playing games is a very inefficient method to measure this. Much better is to just take a couple of hundred positions from games, and measure average time to depth. (Again, with a very small hash table.)

Time-to-depth is not a useable metric for this, because results may not be consistent between depths.

As a trivial example of this, here is running akimbo's bench (resulting nodes to depth shown, because both are the same speed) with always replace vs its current ageing scheme at a few different depths

Code: Select all

| depth | ageing  | always-replace
| 7     | 313435  | 302895
| 8     | 500308  | 484477
| 9     | 782542  | 873053
| 10    | 1549049 | 1591327

Other serious issues I have with this testing methodology are
a) A couple hundred positions is a laughably small sample size.
b) You can't claim that time to depth is the only relevant metric - I would expect that the quality of cutoffs may also matter a lot, a replacement scheme may have less cutoffs and thus result in a higher time to depth, but the cutoff scores when it does cutoff may be of sufficiently higher quality (in the same sense as using a better eval, it may not reduce ttd, but the accuracy is what gains elo) that it gains elo anyway.

Pali · Post by **Pali** » Thu Apr 25, 2024 9:34 pm

hgm wrote: ↑Thu Apr 25, 2024 8:02 pm It is the SPRT sequential testing you advocate that ignores such interaction. If A and B individually add Elo, but A and B combined lose Elo because they interact, you would accept the one you SPRT tested first, and then reject the one you tried to add afterwards. While the latter one could actually have gained you more Elo than the former.

Testing the changes simultaneously in the described way would actually reveal such interaction without any additional testing; you would just add the result of the A+B mod and the unchanged version, and compare it to the sum of the A-only and B-only results.

You can of course elaborate on this, and test even more than two changes simultaneously. With 5 changes you would have 32 versions, and you could let those play a round robin, which would have about 1000 games. You can then compare the total result of all versions that had a change with the total for those that did not have it.

It is fine that you are set in your ways, and are not interested in learning anything new. But it is strange you want to prevent others from doing so, by making false claims...

"But it is strange you want to prevent others from doing so, by making false claims..."
A false claim is that you would be able to measure transposition table replacement scheme efficiency based on time to depth.

Here is a comparison for Black Marlin (using nodes as NPS is the same for each):
Current replacement scheme takes 31981317 nodes to reach depth 30
Always replace takes 33255810 nodes to reach depth 30
Preferring lower depth entries takes 18510403 nodes to reach depth 30
As an extra, removing TT entirely takes 7805004 nodes to reach depth 30

Clearly preferring lower depth entries has better depth to time ratio than both so by your logic, I should merge it.

pgg106 · Post by **pgg106** » Thu Apr 25, 2024 9:42 pm

hgm wrote: ↑Thu Apr 25, 2024 8:02 pm It is the SPRT sequential testing you advocate that ignores such interaction. If A and B individually add Elo, but A and B combined lose Elo because they interact, you would accept the one you SPRT tested first, and then reject the one you tried to add afterwards. While the latter one could actually have gained you more Elo than the former.

Testing the changes simultaneously in the described way would actually reveal such interaction without any additional testing; you would just add the result of the A+B mod and the unchanged version, and compare it to the sum of the A-only and B-only results.

You can of course elaborate on this, and test even more than two changes simultaneously. With 5 changes you would have 32 versions, and you could let those play a round robin, which would have about 1000 games. You can then compare the total result of all versions that had a change with the total for those that did not have it.

It is fine that you are set in your ways, and are not interested in learning anything new. But it is strange you want to prevent others from doing so, by making false claims...

I suggest you just stop spreading misinformation trying to meaninglessly one-up people helping newbies.
Spreading correct, efficient, widely accepted testing methodologies shouldn't be met with the attrition it's usually met with on talkchess, mods in particular should just stop claiming things they don't know anything about in such a matter of fact way.
To anyone reading this post in the future, please, please, please, don't ask for help on talkchess, it's a dead site where you'll only get led astray, the few people talking sense here come from the Stockfish discord server, just join it and actual devs will help you.

Viz · Post by **Viz** » Thu Apr 25, 2024 9:45 pm

In stockfish "always replace" drops bench by like 15-20%, clearly superior to w/e scheme we have in time to depth metric.
The only downside of this metric is that it's completely useless and "always replace" of course drops measurable chunk of playing strength.

hgm · Post by **hgm** » Thu Apr 25, 2024 10:43 pm

JacquesRW wrote: ↑Thu Apr 25, 2024 9:02 pmYou can just SPRT B without A against your new master (that contains A).

Sure, you can. This was not what was presented as 'the ultimate method', though. It becomes even more interesting when you have to test 5 patches, and are allergic to the idea that any pair of those might interact.

You can't soundly combine your A, B and A+B test results, but suppose we ignore that, could you give some solid examples of **how** you can practically select pairs of patches where this method is actually beneficial over sequentially SPRTing stuff.

Not sure why you say that. For patches that do not interact it seems straightforward enough. Details of course depends on how you would test individual patches. E.g. if this is by playing an N-game gauntlet against a set of opponents, and comparing the total scores achieved by the versions that implement A and the total of those that didn't, you are effectively comparing two 2N-game gauntlet results. That half of the games also implemented B, and the other half disn't, does not matter if there is no interaction.

And randomly picking two patches from those you want to test would be a pretty easy method of selection.

Time-to-depth is not a useable metric for this, because results may not be consistent between depths.

As a trivial example of this, here is running akimbo's bench (resulting nodes to depth shown, because both are the same speed) with always replace vs its current ageing scheme at a few different depths
Code: Select all
| depth | ageing  | always-replace
| 7     | 313435  | 302895
| 8     | 500308  | 484477
| 9     | 782542  | 873053
| 10    | 1549049 | 1591327

How many position was this summed/averaged over? Time-to-depth of individual positions is very temperamental, because times are highly affected by whether you changed PV or not, and small differences in the search tree can swing that. So yes, there is noise, which has to be averaged out.

I don't know how many entries you had in your TT, but obviously lower depths with smaller trees will be less affected by replacement schemes. My guess here is that at d=7 or 8 replacement does not yet play a significant role, and that the observed differense is pure noise. Which would put the noise level at ~3%.

Other serious issues I have with this testing methodology are
a) A couple hundred positions is a laughably small sample size.

Did you try that, or guess that? When you do it it is easy enough to determine the standard deviation of the ratio between the different replacement methods, which then decreases like 1/sqrt(N) on averaging. Obviously the number of positions needed would depend on how accurately you have to measure to distinguish the replacement schemes.

b) You can't claim that time to depth is the only relevant metric - I would expect that the quality of cutoffs may also matter a lot, a replacement scheme may have less cutoffs and thus result in a higher time to depth, but the cutoff scores when it does cutoff may be of sufficiently higher quality (in the same sense as using a better eval, it may not reduce ttd, but the accuracy is what gains elo) that it gains elo anyway.

I expect it to be the most important metric, because apart from simple end-games hash grafting is pretty rare. If you are really concerned about this, it would be better to test the improvement of move quality separately. Playing games beteen a version of the engine that only accepts exact-depth hits with one that also accepts over-deep hits is a suitable method for this. As you are only interested in determining an upper limit, to ascertain that differences in search speed that you measured are not significantly affected by it.

hgm · Post by **hgm** » Thu Apr 25, 2024 10:58 pm

Pali wrote: ↑Thu Apr 25, 2024 9:34 pmHere is a comparison for Black Marlin (using nodes as NPS is the same for each):
Current replacement scheme takes 31981317 nodes to reach depth 30
Always replace takes 33255810 nodes to reach depth 30
Preferring lower depth entries takes 18510403 nodes to reach depth 30
As an extra, removing TT entirely takes 7805004 nodes to reach depth 30

I would say you have an engine that lies very much about its depth, to get fewer nodes without TT. It cannot have search all branches to the same depth, with fewer nodes and no hash cutoffs to save nodes.

hgm · Post by **hgm** » Thu Apr 25, 2024 11:08 pm

pgg106 wrote: ↑Thu Apr 25, 2024 9:42 pmI suggest you just stop spreading misinformation trying to meaninglessly one-up people helping newbies.

Well, that you don't understand it doesn't make it misinformation...

In mathematics truth is not established by who makes the loudest objections.

pgg106 · Post by **pgg106** » Thu Apr 25, 2024 11:52 pm

hgm wrote: ↑Thu Apr 25, 2024 10:43 pm
JacquesRW wrote: ↑Thu Apr 25, 2024 9:02 pmYou can just SPRT B without A against your new master (that contains A).
Sure, you can. This was not what was presented as 'the ultimate method', though. It becomes even more interesting when you have to test 5 patches, and are allergic to the idea that any pair of those might interact.

You are wrong, it's exactly what i suggested, test A, figure out if you have to merge it, test B (on top of your new master that may now contain A), it's the only thing that makes sense, especially for devs that can't sustain the hw to test multiple patches at the same time with decent throughput (you'd know this if you actually tested an engine in the last 2 decades).
The rest of the message and the other messages are again fantastical methodologies no one uses because they don't work, there's no metric that is more important than Elo (for a change that is supposed to improve Elo), "lying about depth" is a stupid concept, depth barely has a meaning in a modern engine with all the reductions and extensions that take place, it's just how it works.
Suggesting unproven, unsound testing methodologies when a state of the art already exists is misinformation.

pgg106 · Post by **pgg106** » Thu Apr 25, 2024 11:56 pm

Time and time again devs that have opted for alternative testing solutions (mostly because they didn't know sprt existed) have ended up with feature bloated engines that could barely play decent chess, hundreds if not thousands of Elo weaker that they should've been.
If you could prove your testing methodology was somehow as sound as sprt and more efficient (and there are very very serious doubts about this) it would still be far too error prone (compared to just setting up OB and running sprts) to casually suggest using it.
This isn't even about mathematics, this is about the advice being given here on the regular being utter junk, something that should stop.

Edit: it's insane that a conclusion so obvious such as "use the testing every engine in the top 100 CCRL uses" needs 3 pages of arguments, this is why talkchess is a cesspit and the only useful interactions are because actual devs take pity on people being mislead.

Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace