search extensions

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Joerg Oster
Posts: 937
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany

Re: search extensions

Post by Joerg Oster »

mcostalba wrote:
mcostalba wrote:
bob wrote: Don't make the SAME mistake he did.
Sorry but there is no mistake. I have started with threshold = alpha (test still running and not going bad btw) because is the baseline.

I have already queued up a test with threshold = alpha -100

http://tests.stockfishchess.org/tests/v ... 55a3c87aab
A bit of explanation.

In SF we do null move search only for non-PV nodes and in case of a fail-high we peform a verification search.

I have used the same already exsisting code to:

- Extend null move search also to PV nodes
- In this case we use alpha - <some value> instead of beta as threshold
- In this case we check for a fail-low instead of a fail-high
- In this case if verification search fails-high flag the node to be extended instead of returning

So with reusing the exsisting null search + verification code, the patch tuns out to be very simple.
If I understand correctly, Donninger only wanted to extend nodes near the horizon.
From chessprogramming.wiki:
Donninger's idea is to extend the search one ply if a null move near the horizon (e.g. at depths <= 3) does not fail high and the null move score plus a constant margin (e.g. minor piece value) is <= alpha while the static evaluation at the node is >= beta (i.e. fails high). In order to get meaningful results for the null move score, you need to do it with a full alpha-beta window instead of a zero window (this is a known error in Donninger's original article).
Maybe this is a further possible refinement.
Null-move reduction needs to be adjusted for those cases, of course.
Jörg Oster
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: search extensions

Post by Sven »

lucasart wrote:So you are saying that it's only a gain in self-play and just break-even against foreign opponents? I've heard so many claims like that, of patches than behave differently against foreign opponents than in self-play, but never seen any evidence myself.

It would be remarkable if this claim is correct. And would certainly question our whole SF testing methodology!
It is obvious that the rating difference between two engines (or versions of the same engine) E1 and E2 depends on the set of opponents that you use to calculate it. If you obtain the rating difference from games between E1 and E2 only ("self-play") you get a difference D1. If you let E1 play against a gauntlet, then E2 against the same gauntlet, you get a difference D2 which may or may not be equal to D1. The reason for that is simply the non-transitivity of playing strength in chess.

For some engines E1/E2 practical tests might show that D2 is often very close or equal to D1, but for others (e.g. Crafty) this may be different. For SF as well as for other engines I'd say that you can't tell it without actually trying. Also the personal goals may differ: while someone wants to get testing results that resemble common rating lists as close as possible, someone else may be satisfied to find out whether a new version of his engine beats the previous version.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: search extensions

Post by lucasart »

Sven Schüle wrote:
lucasart wrote:So you are saying that it's only a gain in self-play and just break-even against foreign opponents? I've heard so many claims like that, of patches than behave differently against foreign opponents than in self-play, but never seen any evidence myself.

It would be remarkable if this claim is correct. And would certainly question our whole SF testing methodology!
It is obvious that the rating difference between two engines (or versions of the same engine) E1 and E2 depends on the set of opponents that you use to calculate it. If you obtain the rating difference from games between E1 and E2 only ("self-play") you get a difference D1. If you let E1 play against a gauntlet, then E2 against the same gauntlet, you get a difference D2 which may or may not be equal to D1. The reason for that is simply the non-transitivity of playing strength in chess.

For some engines E1/E2 practical tests might show that D2 is often very close or equal to D1, but for others (e.g. Crafty) this may be different. For SF as well as for other engines I'd say that you can't tell it without actually trying. Also the personal goals may differ: while someone wants to get testing results that resemble common rating lists as close as possible, someone else may be satisfied to find out whether a new version of his engine beats the previous version.
I don't argue the fact that D1 != D2. That's obvious, and typically self-play increases rating difference. That is an observation backed by evidence!

But the fact that D1 and D2 can have opposite signum, I have never seen any valid evidence of that in my years of testing. Apart from hand waving, never seen any proof.

PS: I'm not talking about a theoretical possibility here. I'm talking about a real life scenario backed by evidence (accounting for compounded error bars etc.)
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: search extensions

Post by Sven »

lucasart wrote:
Sven Schüle wrote:
lucasart wrote:So you are saying that it's only a gain in self-play and just break-even against foreign opponents? I've heard so many claims like that, of patches than behave differently against foreign opponents than in self-play, but never seen any evidence myself.

It would be remarkable if this claim is correct. And would certainly question our whole SF testing methodology!
It is obvious that the rating difference between two engines (or versions of the same engine) E1 and E2 depends on the set of opponents that you use to calculate it. If you obtain the rating difference from games between E1 and E2 only ("self-play") you get a difference D1. If you let E1 play against a gauntlet, then E2 against the same gauntlet, you get a difference D2 which may or may not be equal to D1. The reason for that is simply the non-transitivity of playing strength in chess.

For some engines E1/E2 practical tests might show that D2 is often very close or equal to D1, but for others (e.g. Crafty) this may be different. For SF as well as for other engines I'd say that you can't tell it without actually trying. Also the personal goals may differ: while someone wants to get testing results that resemble common rating lists as close as possible, someone else may be satisfied to find out whether a new version of his engine beats the previous version.
I don't argue the fact that D1 != D2. That's obvious, and typically self-play increases rating difference. That is an observation backed by evidence!

But the fact that D1 and D2 can have opposite signum, I have never seen any valid evidence of that in my years of testing. Apart from hand waving, never seen any proof.

PS: I'm not talking about a theoretical possibility here. I'm talking about a real life scenario backed by evidence (accounting for compounded error bars etc.)
Since you seem to agree that the case (D1 * D2 < 0), i.e. opposite signs, is theoretically possible, I would indeed expect evidence from the SF team that for SF (D1 * D2 >= 0) is always true and the opposite does not occur. Of course you may claim that the SF strategy has proven to be very successful, but as long as only D1 is known and never a "D2" you can't say how many accepted patches would have been rejected due to negative D2 and how many rejected patches would have been accepted vice versa. I see no reason to believe that the case of opposite signs would practically never happen, and Crafty is one example where it has actually occurred as reported by Bob. Maybe other people can provide data to support this.

Practically, for (D1 * D2 < 0) to occur you need a change that performs worse against more than one half of the gauntlet but better against the remaining part of it, or exactly vice versa, and in each case the previous version of your engine belongs to the remaining part.

I have never looked into the technical details of the SF testing framework but I think it should be fairly easy to provide a modified version of it that is based on gauntlets with a fixed set of opponents (possibly adding the previous SF version as another reference engine). Bob can do it with his cluster, and I'm pretty sure the SF group can do that as well!
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: search extensions

Post by bob »

pkumar wrote:
When you enter search with something like alpha=100, beta=101, the null-move search is normally done with the window beta, beta+1, or 101,102 here
Are you referring to some other code? The window seems to be 100,101 as per nullmove code in crafty24.1 below:

Code: Select all

      if &#40;depth - tdepth - 1 > 0&#41;
        value =
            -Search&#40;tree, -beta, -beta + 1, Flip&#40;wtm&#41;, depth - tdepth - 1,
            ply + 1, 0, NO_NULL&#41;;
      else
        value = -Quiesce&#40;tree, -beta, -beta + 1, Flip&#40;wtm&#41;, ply + 1, 1&#41;;
That code does not change. Hsu/Donninger did ANOTHER null-move search, but only after verifying that this is not an ALL node, which you learn when any move returns a score > alpha. You don't want to use the same search, because the more you lower the window, the greater the probability that the null-move search will fail high. Which violates the principle of "the null move observation". The above code has not changed in years, other than in how "tdepth" is computed. The null-move window has always been beta, beta+1.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: search extensions

Post by bob »

lucasart wrote:
bob wrote: (1) simple singular extensions as found in robbolito/stockfish/who knows what other program. The idea is that (a) a move from the hash table is a candidate singular move. Ignoring a few details, you search every other move (except for the hash move) using an offset window, and if all other moves fail low against that window, this move gets extended. Never thought much of it, and at one point I removed it from Stockfish and my cluster testing (gauntlet NOT self-testing) suggested it was a zero gain/loss idea. I've tested this extensively on Crafty and reached the same conclusion, it doesn't help me at all. When I do self-testing, there were tunings that seemed to gain 5-10 Elo, but when testing against a gauntlet, they were either Elo losing or break-even. I gave up on this.
I also never managed to gain anything by SE in my engine. But in SF the gain is prodigious. More than 20 elo IIRC.

So you are saying that it's only a gain in self-play and just break-even against foreign opponents? I've heard so many claims like that, of patches than behave differently against foreign opponents than in self-play, but never seen any evidence myself.

It would be remarkable if this claim is correct. And would certainly question our whole SF testing methodology!
I tested the SE in stockfish a couple of years back. I found no significant gain, and reported same here. I did this experiment, because I had tried testing the same idea in crafty and could not do better than break-even. I did notice this last time that if I played CraftySE vs Crafty, CraftySE looked a bit better with some of the tests, with gains in the 10 to 20 Elo max range. But when I played CraftySE vs the gauntlet testing I use, it was ALWAYS a little (or a lot depending on tuning) worse than the non-SE version. I don't believe there is any "prodigious gain" here. Even in the days of 10 plies where it seems to work better, deep thought measured +9 elo for using it.

I have seen several cases (beyond SE) where self-testing seems better, but gauntlet-testing is the opposite. This was particularly true for the SE tests AND the threat extension testing. And I have not seen any reference yet to the sticky idea so that the testing doesn't have to be repeated each time the same position is searched. (and there are some interesting implementation details as well, dealing with being careful with the ttable since the same positions are getting searched with significantly different depths.
Uri Blass
Posts: 10281
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: search extensions

Post by Uri Blass »

bob wrote:
lucasart wrote:
bob wrote: (1) simple singular extensions as found in robbolito/stockfish/who knows what other program. The idea is that (a) a move from the hash table is a candidate singular move. Ignoring a few details, you search every other move (except for the hash move) using an offset window, and if all other moves fail low against that window, this move gets extended. Never thought much of it, and at one point I removed it from Stockfish and my cluster testing (gauntlet NOT self-testing) suggested it was a zero gain/loss idea. I've tested this extensively on Crafty and reached the same conclusion, it doesn't help me at all. When I do self-testing, there were tunings that seemed to gain 5-10 Elo, but when testing against a gauntlet, they were either Elo losing or break-even. I gave up on this.
I also never managed to gain anything by SE in my engine. But in SF the gain is prodigious. More than 20 elo IIRC.

So you are saying that it's only a gain in self-play and just break-even against foreign opponents? I've heard so many claims like that, of patches than behave differently against foreign opponents than in self-play, but never seen any evidence myself.

It would be remarkable if this claim is correct. And would certainly question our whole SF testing methodology!
I tested the SE in stockfish a couple of years back. I found no significant gain, and reported same here.
It may be interesting if you test again because I think that the value of singular extension for stockfish became bigger later(at least based on stockfish-stockfish games).

Here is the latest test from the stockfish framework almost one year ago
Note that the stockfish team stopped the test in the middle so we do not have unbiased estimate but it is probably safe to say that singular extensions give stockfish at least 10 elo in 15+0.05 time control against previous version.

http://tests.stockfishchess.org/tests/v ... 49c4e73429

ELO: -24.37 +-7.2 (95%) LOS: 0.0%
Total: 3342 W: 509 L: 743 D: 2090
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: search extensions

Post by bob »

Sven Schüle wrote:
lucasart wrote:
Sven Schüle wrote:
lucasart wrote:So you are saying that it's only a gain in self-play and just break-even against foreign opponents? I've heard so many claims like that, of patches than behave differently against foreign opponents than in self-play, but never seen any evidence myself.

It would be remarkable if this claim is correct. And would certainly question our whole SF testing methodology!
It is obvious that the rating difference between two engines (or versions of the same engine) E1 and E2 depends on the set of opponents that you use to calculate it. If you obtain the rating difference from games between E1 and E2 only ("self-play") you get a difference D1. If you let E1 play against a gauntlet, then E2 against the same gauntlet, you get a difference D2 which may or may not be equal to D1. The reason for that is simply the non-transitivity of playing strength in chess.

For some engines E1/E2 practical tests might show that D2 is often very close or equal to D1, but for others (e.g. Crafty) this may be different. For SF as well as for other engines I'd say that you can't tell it without actually trying. Also the personal goals may differ: while someone wants to get testing results that resemble common rating lists as close as possible, someone else may be satisfied to find out whether a new version of his engine beats the previous version.
I don't argue the fact that D1 != D2. That's obvious, and typically self-play increases rating difference. That is an observation backed by evidence!

But the fact that D1 and D2 can have opposite signum, I have never seen any valid evidence of that in my years of testing. Apart from hand waving, never seen any proof.

PS: I'm not talking about a theoretical possibility here. I'm talking about a real life scenario backed by evidence (accounting for compounded error bars etc.)
Since you seem to agree that the case (D1 * D2 < 0), i.e. opposite signs, is theoretically possible, I would indeed expect evidence from the SF team that for SF (D1 * D2 >= 0) is always true and the opposite does not occur. Of course you may claim that the SF strategy has proven to be very successful, but as long as only D1 is known and never a "D2" you can't say how many accepted patches would have been rejected due to negative D2 and how many rejected patches would have been accepted vice versa. I see no reason to believe that the case of opposite signs would practically never happen, and Crafty is one example where it has actually occurred as reported by Bob. Maybe other people can provide data to support this.

Practically, for (D1 * D2 < 0) to occur you need a change that performs worse against more than one half of the gauntlet but better against the remaining part of it, or exactly vice versa, and in each case the previous version of your engine belongs to the remaining part.

I have never looked into the technical details of the SF testing framework but I think it should be fairly easy to provide a modified version of it that is based on gauntlets with a fixed set of opponents (possibly adding the previous SF version as another reference engine). Bob can do it with his cluster, and I'm pretty sure the SF group can do that as well!
The first time I saw this was fiddling around with king safety. I had removed the old offensive king safety code (this is a LONG time back, and this was the code that tried to initiate attacks, such as pawn-storms and such) which didn't work very well (it was somewhat passive or insensitive to what I considered to be significant king safety changes in the position/game.)

A rewrite and just basic tuning produced something that I thought was a bit better. To be sure I had not broken anything, I ran a crafty vs crafty' (crafty' had new king safety code) match and tuned a bit. And crafty' was winning by a pretty significant margin. When I dropped it into the gauntlet test, the roof fell in. As I looked at games to see what was going on, the new code was simply way too aggressive when the opponent had an idea of how to actually attack/defend on the kingside. It would initiate attacks that were speculative, and the opponent would defend reasonably and Crafty' would simply end up in positions that were wrecked from a pawn structure perspective (or worse, of course).

Quite a few times when I added something completely new to crafty, it would beat its older cousin, but then do worse against the gauntlet. That's what I have always maintained that self-play is not a bad testing/debugging approach, but by itself it can produce wrong answers. Most of my self-testing is for stress-testing things as I get 2x the info when I am interested in "does this code break" (notably parallel search code)?

I have seen cases where self-test looks good, but gauntlet looks bad.

I have seen cases where self-test looks good, but gauntlet looks even (one has to think a bit about what to do here. in my case, if the code is cleaner or simpler, I'll keep it, if not no)

I have NOT seen a case (as of yet, where self-test looked bad but gauntlet looked good, although logic says that if the reverse is true, this must be also.) However, since self-play is not my normal testing approach, this might be a simple lack of enough testing of both types.

There are other testing issues I have also previously reported. The most important was the time-control issue. To date, I do not recall ANY evaluation changes other than king safety, that were sensitive to time control. I had a few (back in that same king safety testing I mentioned above) where faster games would suggest a change was good, but slower games would say "worse". This was a product of an eval saying one thing but a deeper search showing that was wrong tactically. I have seen a LOT of cases where a search change was very sensitive to time control. My normal quick-testing has been to play 30K games, at a time control of 10s + 0.1s increment, which completes in under an hour. My next stop has always been 1m + 1s, which is more like 12 hours or so. Normal eval changes have seen almost perfect correlation between the two time controls but search changes not so much.

One good example was the SE/threat tests. At fast time controls they don't have much time to kick in, so if they are bad, they don't look as bad as they really are if you use very fast games. When I was testing over the past 6 months, it has been 1m + 1s exclusively to give the search enough time to reach a depth where the extensions are being hit enough to really influence the game everywhere.

For SE particularly (hsu SE, not ROBBOLITO SE) it seems to pass the "eye test" pretty well. It will spot some WAC tricks quite a bit sooner, and when you look at PVs, you see extensions in the right place (I had modified Crafty so that any non-check extension added a "!" to the end of the move, as I had done years ago for EGTB probes where "!" was used to say "this is the only move that leads to an optimally short mate.") So looking at the output you think "this is not bad" which is EXACTLY what I had done in Cray Blitz. And it was almost exactly what Hsu had done in 1988, testing on tactical positions to see if it "looked better" and then playing a total of 20 self-test games which was almost a random result. A good one was fine #70, for example, since white (and black on occasion) has exactly one move in a given position that preserves the win (Kb1!) for the first move as an example, but then white has just one move at each position beyond ply=1, and it depends on what black does (coordinating squares idea, sort of like way distant opposition). But each time, doing better tactically did not translate into doing better in real games.

As far as the idea Marco tested, that I am certain is bad. IE if you fail low on a normal null-move, and you fail high on a real move, extend. The reason I am sure here is that Hsu recommended a value of 150, which Thomas refined a bit (but still over a pawn). I tested a bunch of different values here and when I dropped below an offset of 100, the scores started to plummet. A number like alpha-31, alpha-30 produced a -50 Elo change. Hsu originally used -150, with a pawn value of 128, for reference. I also tried a bunch of different null-move R values. My current R = 3 + depth/6 was better, but not as good as no SE. I thought about this test for a while and concluded "R=bigger means less overhead but less accuracy, and vice-versa. So I tried more aggressive R such as R = depth/2, and a few others that were pretty far out there. The problem is that the test is based on errors and overhead, which are inversely proportional. LMR is a similar animal, and it took a while to get to where I am currently. Apparently LMR (at least for Crafty) makes SE/TE ineffective and redundant. As LMR gets more aggressive, null-move doesn't help as much as it used to. A few years ago someone asked "what is null-move worth?" and since I had not tested this in a while, I gave it a run. Removing null-move cost me 40 Elo. Removing LMR (which at the time was a static reduction as pretty much everyone used initially) also costs me about 40 Elo. Removing BOTH cost me about 120 Elo. But what was funny was the overlap. No LMR/NULL-MOVE was -120, no LMR cost me -40, which mean null-move is adding 80, no null-move cost me -40 which means LMR was adding 80, but once I had one, no matter which one, the other only added 50% of what it would add by itself.

Hsu and Thomas eventually reported a +9 gain for SE after reasonable levels of testing. Without null-move or LMR. After dumping several trees, we already have a sort of SE. For example, the hash move is searched first, and I don't think anybody reduces the first move at any ply, hence it is extended automatically by not being reduced. History counters push moves with fail-high tendencies up in the move list, reducing them less (or pseudo-extending them more). I've become convinced that SE simply doesn't fit very well with LMR since they are already doing almost exactly the same thing.

Results from others would be interesting. My results have covered the past 6 months+, and all I have to show for it is a much cleaner search function with the original check extension and nothing else... The parallel search is a bit improved however, since the rewriting led me to fixing some split issues there that were a bit of a bottleneck.

I have saved both the SE and TE versions, but I am not sure I will re-visit them again. One note, I did NOT modify anything else when adding either SE or TE. IE I did not try a less aggressive LMR or null-move search. It is certainly possible that changes there might make a difference. One idea I thought about but did not test was to try to flag a position as "dangerous" and just dial back LMR for all or some of the moves, rather than trying to extend one. Right now I believe there is more to gain by working on ways to better order the moves. In years gone by, all we needed was a "good enough move" ordered first to optimize tree size. Now we need more since we treat moves differently depending on where they occur in the move list. Better ordering now has the chance of actually making a program play better, as opposed to just offering a small speed gain by producing smaller trees...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: search extensions

Post by bob »

Uri Blass wrote:
bob wrote:
lucasart wrote:
bob wrote: (1) simple singular extensions as found in robbolito/stockfish/who knows what other program. The idea is that (a) a move from the hash table is a candidate singular move. Ignoring a few details, you search every other move (except for the hash move) using an offset window, and if all other moves fail low against that window, this move gets extended. Never thought much of it, and at one point I removed it from Stockfish and my cluster testing (gauntlet NOT self-testing) suggested it was a zero gain/loss idea. I've tested this extensively on Crafty and reached the same conclusion, it doesn't help me at all. When I do self-testing, there were tunings that seemed to gain 5-10 Elo, but when testing against a gauntlet, they were either Elo losing or break-even. I gave up on this.
I also never managed to gain anything by SE in my engine. But in SF the gain is prodigious. More than 20 elo IIRC.

So you are saying that it's only a gain in self-play and just break-even against foreign opponents? I've heard so many claims like that, of patches than behave differently against foreign opponents than in self-play, but never seen any evidence myself.

It would be remarkable if this claim is correct. And would certainly question our whole SF testing methodology!
I tested the SE in stockfish a couple of years back. I found no significant gain, and reported same here.
It may be interesting if you test again because I think that the value of singular extension for stockfish became bigger later(at least based on stockfish-stockfish games).

Here is the latest test from the stockfish framework almost one year ago
Note that the stockfish team stopped the test in the middle so we do not have unbiased estimate but it is probably safe to say that singular extensions give stockfish at least 10 elo in 15+0.05 time control against previous version.

http://tests.stockfishchess.org/tests/v ... 49c4e73429

ELO: -24.37 +-7.2 (95%) LOS: 0.0%
Total: 3342 W: 509 L: 743 D: 2090
I certainly had better results at 10+0.1 games. But for search modifications, I use at least 1m+1s games and when something looks good I run longer tests since the tree is exponential in nature, and more time adds more depth which can expose issues a very fast search would not produce...

The Hsu algorithm is by far the best way to implement this. It really uses the idea of "a move is singular". But it is quite expensive and when we see the depths we are seeing today, the cost appears to outweigh the gain every time when real games are used instead of tactical positions where SE can really look good.

None of the SE/TE results I mentioned earlier were done at 10+0.1. I only used that to do a quick test to make sure I had not broken something. TE is trivial to do. full-blown SE ala' Hsu is quite complex and even treats fail-high moves differently from PV moves (pv-singular vs fh-singular). I definitely did a good bit of debugging there just to get it to work correctly. Then the sticky trans/ref table adds another level of complexity.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: search extensions

Post by bob »

Joerg Oster wrote:
mcostalba wrote:
mcostalba wrote:
bob wrote: Don't make the SAME mistake he did.
Sorry but there is no mistake. I have started with threshold = alpha (test still running and not going bad btw) because is the baseline.

I have already queued up a test with threshold = alpha -100

http://tests.stockfishchess.org/tests/v ... 55a3c87aab
A bit of explanation.

In SF we do null move search only for non-PV nodes and in case of a fail-high we peform a verification search.

I have used the same already exsisting code to:

- Extend null move search also to PV nodes
- In this case we use alpha - <some value> instead of beta as threshold
- In this case we check for a fail-low instead of a fail-high
- In this case if verification search fails-high flag the node to be extended instead of returning

So with reusing the exsisting null search + verification code, the patch tuns out to be very simple.
If I understand correctly, Donninger only wanted to extend nodes near the horizon.
From chessprogramming.wiki:
Donninger's idea is to extend the search one ply if a null move near the horizon (e.g. at depths <= 3) does not fail high and the null move score plus a constant margin (e.g. minor piece value) is <= alpha while the static evaluation at the node is >= beta (i.e. fails high). In order to get meaningful results for the null move score, you need to do it with a full alpha-beta window instead of a zero window (this is a known error in Donninger's original article).
Maybe this is a further possible refinement.
Null-move reduction needs to be adjusted for those cases, of course.
Remember also that this idea did NOT originate with Donninger. Null move was originally defined in a paper by Don Beal (Selective search without Tears, JICCA 1986). Several of us were using it for the 1986 WCCC event already, as a result of his paper. The NM threat idea was reported by Hsu in the 1988 singular extension paper and defined more clearly in Anantharaman's 1991 JICCA paper where he gave a LOT of test results on the different extension ideas/modifications tested inside the deep thought framework. Hsu and group did the opposite of what you suggest, only doing the extensions far from the tips, to keep the cost bearable. Doing NM threat extensions near the tips is hugely expensive as opposed to closer to the root.

I don't follow your mention of a "known error" in the TE idea however. You can't safely do a null-move search on a non-null window. That would allow the null move to actually become a "best move" but it is not a legal move... So the idea of an "actual score" for playing a null-move doesn't make much sense to me. Saying "not playing a move is bad or good" has a logical meaning...


Maybe we are not using the same terminology.