Resolving once in a trillion crashes

AndrewGrant · Post by **AndrewGrant** » Tue Jan 18, 2022 8:58 am

This should essentially be a blog post or something, but I'm going to write it here, and write out what I've just gone through because doing so is helping me to understand exactly what just happened.

Part 1:

A couple of days ago I sent Ethereal v13.50 out to a few members of rating lists, in hopes of giving them advanced time to maybe get some results in for Ethereal. One person (nameless, that is up to them!) was quick to share a few thousand games back to me, but had 3 disconnects over the course of 3,000 games. The immediate reaction was to fire up some tests on OpenBench. This person was playing Single Threaded games, with Syzygy, and no adjudication. So I started two tests, both of 250,000 games, one for Fischer Random (http://chess.grantnet.us/test/23570/) and another for Standard (http://chess.grantnet.us/test/23569/).

To my amazement, there was exactly one crash. And it was on my own machine -- even better! So I take a look at the PGN, and attempt to replay the game with a series of go depth X commands, since the PGN saves the eval/ply/move at each step of the game. But unfortunately, no luck -- the crash could not be reproduced -- but technically neither could the game. You see, in practice when playing under time controls, engines don't search to exact depths and stop. Every so often Stockfish for example will search to depth 15, get a fail-high, and then end the search, never finishing depth 15. Ethereal does not do exactly that, but it is possible for Ethereal to abort in the middle of a depth when it feels it is spending too much time.

Okay, so I can't reproduce the exact game that was played. Ethereal is deterministic though, so if I could find the exact node counts searched at each ply of the game, I could reproduce it...

Part 2:

In comes the attempts to generate a crash again. Now in that 500,000 game test, my own machine only played about 5,000 or so of the games. And its worth noting that I have the only machine on OpenBench using Syzygy at the moment. So between my own crash, and the crash from the tester, one commonality that appears promising is Syzygy -- which is a daunting idea since that code is mostly out of my control.

So I launch hundreds of thousands of games on my own machine at various time controls using Syzygy, using fixed-depth games in order to reproduce the issue if a crash occurs. Well, hundreds of thousands of games later, there are no crashes. Now I'm worried that it has to do with fixed-depth not being a real time control. So I add some code to Ethereal to report the exact node counts and have them saved at each ply. Queue up the same tests. No luck. Hyperbullet games, longer games, fischer games, standard games, no luck. I even made an opening book of the positions that appeared in the crashed game, and played essentially the same game with minor variations due to CPU speed thousands of times. No luck.

So at this point I have exactly one crash at resonable TC. And no way to reproduce it... Well almost no way. You see, Ethereal will only abort the search at a node that is one less than a multiple of 1024. This is a common thing in engines, to avoid slamming the system with time requests. So lets take an example.

At the first move of the game, Ethereal searched to d15 in K nodes, and reported the best move e4 with score +0.20. The next search go to d16, and had exactly N nodes. If I then start a new game, and have Ethereal first search to d15, then push e4 and the opponents response, and then to d16, I would hope to again have exactly N nodes. But I don't. This is because in the first search, Ethereal looked a bit past d15.

Instead of doing "go depth 15", we do "go nodes K". We will get the same result. But what if we do "go nodes K+1024". A different result. And again different if we do "go nodes K+2048". Well eventually, we will find the exact K+X value such that the following search reaches d16 in exactly N nodes -- although false positives are possible.

So we have to play the game from the starting position, with thousands of variations, until finding the exact node count before going on. And every time you fail, you have to restart the game and repeat the searches, since you need to have the exact same TT, Caches, History Tables, etc. The result is playing the "same" game with minor minor differences, tens of thousands of times until finding the EXACT variation that occured before.

So I took the game, and at every move performed this process to find the exact node count searched at each step, even though that information was not anywhere to be found. I found a few false positive lines, but eventually I managed to find the entire game. I could now, up until the very last move of the game, issue a series of "go nodes" commands to get the exact output.

So I put this into practice.... no crash on my Windows machine. But if we go back to the original Linux box, we crashed! Every time! Perfect. A reproducible bug is a fixable one. And while I have exactly one case study to work on out of millions and millions of games, it might just be enough.

Part 3:

From here, solving the bug is easy. A quick run through valgrind and with fsanitize=address, and we get the two reports which lead us to the issue. and

In Ethereal's NNUE, and many NNUEs, when we have a position we want to evaluate, we have to update the accumulator. So what we do is ask if our parent has an up to date version. If they do, we copy it, make the changes from the most recent move, and then calculate. In Ethereal I do this process recursively, which I believe is different from anyone else. If any of the ancestors of this node has an accurate accumulator, we will update all the accumulators along the family tree until updating our own.

The root node of the search might as well be God. The root node is the first person. All other nodes are children of God. God simply exists, and has no parents. So naturally if we try to update God's parents, we run into issues. Memory access obviously. So guarding against that is enough to handle the situation. Perfect.

But why on earth does this happen SO RARELY? Ethereal processed literally trillions of positions, to get this one crash. How can a simple memory access bug be so rare?

Part 4:

Ethereal's recursive update algorithm has one oversight. If the current position is ALREADY accurate, then we say "Yeah you can update the entire linage until a certain point". The position HAS an accurate familial relation. It just happens to be ... itself. So what would happen if NONE of the nodes parents were accurate? Well we would assume that there is one somewhere. So if the current position is depth=14, we would look at d=13, d=12.. all the way to d=0 the Root node. But the Root is inaccurate too. So lets look at d=-1. Problems.

Fair enough. But why is this not common still? Its possible for the entire history of positions to be cached into the TT/EvalCache and never be updated. So its not that crazy (confirmed) for a position to have no accurate parents.

Well there is another question. How can a node already have an accurate accumulation? Any move made to get to the position would invalidate what was already there. Even a NULL move would, since we don't have a copy of the last plys accumulator just sitting around. Any new position will have an inaccurate accumulator. But yet this one does?

LMR researches. Its common in LMR schemes to do a search(), and then another search() of the same position back-to-back if the former beat some margin. So I actually get millions of nodes with "already accurate" accumulators at the top of search() when doing researches.

But here is the absolute kicker. Almost always. All but once in a trillion nodes, do we NOT need to call evaluate() a second time. We have just seen this position. Its sitting in the Transposition table almost always. And if its not there, Ethereal has a per-thread Evaluation table that stores the value.

Part 5:

So what must go wrong for this bug to occur?

Ethereal must be in a position in which all parents were cached into the TT/EvalCache. (Common)
Ethereal must then look at a child of this position twice in a row via LMR researches. (Common)
This position must not already be cached somewhere. (Pretty uncommon given that entire linage already exists, but it happens)
Between the start of the 1st search, and the 2nd of the second, the evaluation stored into two seperate hash tables has to be overwritten. (What the fuck)

hgm · Post by **hgm** » Tue Jan 18, 2022 10:31 am

It might be an idea to append the 'nodes searched' info always to some file created for this purpose, and undo that at the start of a new game, or when the engine terminates normally. Then that info would be left after an engine crash, and you could just request that file from the tester.

mar · Post by **mar** » Tue Jan 18, 2022 10:33 am

or have a way to effectively disable TT and EC, say using 1-entry only. this plus asan should've caught the bug much faster, of course easier said now that the bug has been found and fixed

jdart · Post by **jdart** » Tue Jan 18, 2022 6:46 pm

I think you don't really need to search back to the root. In fact if you always stopped the lookback before the root, you wouldn't miss much in terms of opportunities for reusing an accumulator value.

Look · Post by **Look** » Wed Jan 19, 2022 2:59 pm

I imagine you may place some defensive assert for depth at the places you decrement it.

emadsen · Post by **emadsen** » Wed Jan 19, 2022 8:05 pm

AndrewGrant wrote: ↑Tue Jan 18, 2022 8:58 am This should essentially be a blog post or something, but I'm going to write it here, and write out what I've just gone through because doing so is helping me to understand exactly what just happened.

Very interesting post, Andrew! Thanks for sharing.

Not having written a NNUE engine I can't say I fully understand the details of accumulator logic. Though I think I grasp the essentials: The speed in NNUE comes from incremental updates instead of full calculations (of a network layer?), these updates are done in child nodes using the parent node as a starting point, though sometimes the parent node's accumulator is out-of-date, so the code traverses up the search tree until it finds an up-to-date node, and in this extremely rare situation, the code runs right off the end of the tree and dereferences ply[-1]... crash.

Interesting insight about LMR re-searches being a contributing factor. This made me add a TODO note to my code to prevent duplicate static evaluation of a position when re-searched due to LMR fail high. Which can be avoided without requiring an eval cache simply by passing a parameter in the recursive call to the search method.

Your write-up reminded me of my own Engine Crash Detective Story I wrote a year and half ago.

Look wrote: ↑Wed Jan 19, 2022 2:59 pm I imagine you may place some defensive assert for depth at the places you decrement it.

Asserts are a great defensive programming technique. But they're not a panacea. They can create Heisenbugs, as I explain in my post. Often Debug.Asserts slow down an engine because of all the extra validation performed. This could prevent the condition that triggers an extremely rare bug. Or it could delay the condition so it takes more time or games to trigger the condition. If you knew ahead of time what would trigger the extremely rare bug you could remove the slow Debug.Asserts (board integrity checks, Zobrist incremental & full comparison, etc) and leave only the fast Debug.Asserts. However, with that foreknowledge, you wouldn't have written buggy code

connor_mcmonigle · Post by **connor_mcmonigle** » Wed Jan 19, 2022 11:09 pm

mar wrote: ↑Tue Jan 18, 2022 10:33 am or have a way to effectively disable TT and EC, say using 1-entry only. this plus asan should've caught the bug much faster, of course easier said now that the bug has been found and fixed

In my understanding of Andrew's description, this bug would never appear if the TT and EC were disabled (as then accumulators stored in the accumulator stack would always be accurate from root to any position).

Daniel Shawul · Post by **Daniel Shawul** » Thu Jan 20, 2022 7:34 am

I have such a crash that I can’t reproduce and probably never will. It has to do with FRC but it passes perft(6) for hundreds of positions, no crashes after 10s of thousands of games. Not sure if saving node counts help because it is threaded search, and MPI too so zero chance I can reproduce this. At this point I am content with putting some safeguards and see if it happens again, after all I am not sending a spaceship to mars …

AndrewGrant · Post by **AndrewGrant** » Thu Jan 20, 2022 7:45 am

Daniel Shawul wrote: ↑Thu Jan 20, 2022 7:34 am I have such a crash that I can’t reproduce and probably never will. It has to do with FRC but it passes perft(6) for hundreds of positions, no crashes after 10s of thousands of games. Not sure if saving node counts help because it is threaded search, and MPI too so zero chance I can reproduce this. At this point I am content with putting some safeguards and see if it happens again, after all I am not sending a spaceship to mars …

Syzygy, and MPI, are two things that if they start to crash with incredible rarity, I don't know where I would begin looking for a fix.

I'm sure you've tried this or thought of this, but you presumably have a is_pseudo_legal() function somewhere. If you made any assumptions about castling, they might no longer be true with FRC.

Further more, one extremely slow, but effective way to test those functions when your Move_t is something like 16 bits, is to do a PERFT and at every node run is_psuedo_legal() for all i=0 to i=int16t_max. And verify that you have the same things you generate. Terribly slow...

Daniel Shawul · Post by **Daniel Shawul** » Thu Jan 20, 2022 8:55 am

AndrewGrant wrote: ↑Thu Jan 20, 2022 7:45 am
Daniel Shawul wrote: ↑Thu Jan 20, 2022 7:34 am I have such a crash that I can’t reproduce and probably never will. It has to do with FRC but it passes perft(6) for hundreds of positions, no crashes after 10s of thousands of games. Not sure if saving node counts help because it is threaded search, and MPI too so zero chance I can reproduce this. At this point I am content with putting some safeguards and see if it happens again, after all I am not sending a spaceship to mars …
Syzygy, and MPI, are two things that if they start to crash with incredible rarity, I don't know where I would begin looking for a fix.

I'm sure you've tried this or thought of this, but you presumably have a is_pseudo_legal() function somewhere. If you made any assumptions about castling, they might no longer be true with FRC.

Further more, one extremely slow, but effective way to test those functions when your Move_t is something like 16 bits, is to do a PERFT and at every node run is_psuedo_legal() for all i=0 to i=int16t_max. And verify that you have the same things you generate. Terribly slow...

Yes, I have actually fixed a couple of bugs there in is_pseudo_legal, but they had the opposite effect.
It turns out that the checks there were too restrictive and were dumping out completely fine castling moves.
Unfortunately my castling from-to is represented as actual from and to destination squares of the king, and those cases where
the king doesn't move were considered illegal. This had made the engine significantly weaker but fixing it doesn't solve my bug.
I guess other engine represent castling moves as (from square of king, from square of castle) so they don't make this error.

I have modified my perft() a while ago to do the actual incremental move generation steps the engine goes through in actual search.
Although this is slower, perft() should be a sanity test not a speed competition.
It was not able to catch the is_pseudo_legal move errors because there were no killer moves, or TT moves to try.
But I guess now, I can fill in this killer moves with some moves from other branches, or even fill it with a rand value from [0, int16_max] as you suggested!

Resolving once in a trillion crashes

Resolving once in a trillion crashes

Re: Resolving once in a trillion crashes

Re: Resolving once in a trillion crashes

Re: Resolving once in a trillion crashes

Re: Resolving once in a trillion crashes

Re: Resolving once in a trillion crashes

Re: Resolving once in a trillion crashes

Re: Resolving once in a trillion crashes

Re: Resolving once in a trillion crashes

Re: Resolving once in a trillion crashes