Very simple, by introducing the so-called "tester bias". I'll illustrate from an example from my own practice.Branko Radovanovic wrote: ↑Sat Nov 17, 2018 8:19 pm How can one hurt fairness by including non-crashing games of a crashing engine?
Early in my testing career I was testing an old Winboard engine under Cute Chess GUI and observed an unusually high proportion (over 30%) of abnormal terminations, mostly stalled connections. My standard practice is to replay such games with exact same opening lines, and that I proceeded to do. The crashes persisted, albeit on progressively smaller scale, with some games having to be replayed three-four times; I, too, was bent on getting nice clean results and persevered.
When the test was complete at last I found to my astonishment that the engine performance exceeded its current rating by over 250 Elo! Why? No losses, and just a handful of draws, including results against opponents that were guaranteed to draw some of its "blood".
A closer look at the mechanics of the situation revealed the reason for the stalls: the engine was trying to resign in lost positions but did that in a nonstandard way, by sending "computer resigns" to the GUI instead of the proper "(computer) resign". The GUI ignored the nonstandard command while the engine ceased trying to communicate, considering the game over. And by replaying the games until the "crashes" disappeared I unintentionally introduced severe distortion into the results. In effect it amounted to manually picking the wins and draws and discarding all losses. Hence the term, "tester bias".
Just another example of how naively following the dictums of common sense while failing to consider hidden nuances can sometimes lead to unexpected and undesirable results.
That code may not be present in the old TCEC fork of cutechess if they are still using it. There were commits to the master branch related to hanging engines only a few months ago.Daniel Shawul wrote: ↑Sat Nov 17, 2018 6:41 pm Edit:
Indeed cutechess implements things correctly like I suspected. It waits for a "feature done 1" before initializing.Code: Select all
else if (name == "done") { write("accepted done", Unbuffered); m_initTimer->stop(); if (val == "1") initialize(); return; }