The problem here is that your claims have nothing to do with 'reality'. You did not actually run an engine match, and then added randomness to the engine or played them at much faster TC to increase their intrinsic evaluation noise, to compare the match results. You just pulled two totally arbitrary match results (10/90/0 and 55/0/45) out of your hat, for which the one with more draws happened to be statistically more significant. It is just as easy to doctor the comparison such that the results with fewer draws is more significant: just pull 10/90/0 and 100/0/0 out of the hat... That the win/loss difference should be preserved rather than the win/loss ratio or any more subtle relation between the two on changing TC is just your conviction. Nothing more. And this seems to be based on nothing: no actual result, no underlying model... Only the flawed logic that what I say must be wrong because I am not willing to admit that it is.syzygy wrote: ↑Fri Apr 21, 2023 12:37 amWhat is totally unrealisitic is that adding noise would help. Just plug in a random generator and you'll instantly get more accurate statistics. Right.
Both intuitively and in reality noise is unwelcome. That you are confused by the knowledge that "draws do not count" is just that, it confused you.
Some time ago you were suggesting that 1 Elo difference could not be measured. Clearly it can be with the right testing set up and enough games. But noise is not going to help there.
As for 'intuitively', you can only speak for yourself. And your intuition might very well be flawed.
As a matter of fact there are many cases where noise improves a result, and this is common knowledge amongst testers. E.g. when playing from a suit of opening positions, randomly picking those per game or pair of games greatly improves the significance of the match from always picking the same start position (where the stronger engine might easily lose 100-0 by playing each game 50 times).
To give a more mathematically inclined example:
If engines playing a game are modelled by randomly blundering away some evaluation score compared to the theoretically best move, the score will perform a random walk along the score axis during the game, and after many moves the distribution of the evaluation of the game position will approach a normal (Gaussian) distribution. When the level of play is very high, the score blundered per move is on the average low, and the width of this distribution will only increase slowly. When the initial score is close to equality (as it will be in the usual case of testing from balanced position), the width might still be narrow compared to the threshold score you need to win the game. So although the score has diffused in the direction of a win for the stronger engine, you would have to be very far in the wings of the distribution to actually exceed the win threshold. The probability for this might be astronomically small, so that wou would basically not observe any wins at all, just draws.
Adding noise, even by symmetrically perturbing the evaluation (e.g. by now and then adding or subtracting some evaluation score by forcing a randomly picked engine to play a poorer move that it would have picked on its own) would increase the width of the distribution, but not the drift towards the win threshold. A much larger part of distribution can then stick out above the win threshold, resulting in many more wins, while thre average final score is still so far from the loss threshold that losses remain negligible. (E.g. you could be 1 standard deviations away from the win trheshold, and 3 from the loss threshold, to get ~16% wins and 0.1% losses, while with half the standard deviation you would have been 2 and 6 standard deviations away from the respective threshold, resulting in 2.2% wins and 0.0000001% losses. And a 16/84/0 result would be far more significant than a 2/98/0 result. So the added noise really helps improving the significance of the result, in this example.