Unfortunately this is an
inverse-square relationship. The standard error in the total score over an N-game match against a roughly equal opponent is about 0.4*sqrt(N). So for 100 games you would have a 4 point = 4% standard error, and each 1% score is about 7 Elo. So after playing 100 games against opponents of known strength you have a standard error of 28 Elo, and a 95% confidence of ~55 Elo in the rating of E.
Now if you play E' also for 100 games, it has its own, independent rating estimate with a 95%-confidence range of +/- 55 Elo. The 95% confidence of the difference of the two is then sqrt(2)*55 = 77 Elo. So if the difference between E and E' is more than 77 Elo, you could identify the better one by letting each play 100 games, with 95% confidence.
If you want to get 95% confidence for a smaller difference, say 11 Elo, this is 7 times smaller, so you need 49 times as many games. That is 4900 games for E, and 4900 games for E'. To as reliably resolve 5 Elo you would have to do ~20,000 games each. (Ouch!)
Playing a round-robin is an enormous waste of time, as you would also be playing engines against each other that you don't want to test. These results don't contribute any information to discriminating E vs E'. Never do it, just play gauntlets.
The only way out of this inverse-square trap is making the games
less independent. As such, playing with a book (where the engines randomly select opening lines from the book) is a bad idea. The games will certainly be independent then, nothing you can do about it. Even two identical, deterministically playing engines would play completely different and independent games, just because the opponents selected different openings. Playing from fixed positions with deterministic engines and deterministic opponents would at least allow you to see in a small number of games that E and E' are identical. If you would have made a modification that only affects the end-game (e.g. let the evaluation recognize KBK and KNK, and score them as draws) they would end in the same set of end-games, as the games would be move for move the same until they could make / refrain from that fatal mistake of 'winning' the opponent's piece at the expense of their last pawn. The early moves would just be a way to generate a representative sample of and-games, so that you can see how important it is to recognize this in practice.
So play from a (large) number of positions that are representative for early middle game.
If your opponents or your own engine is not deterministic enough to end up in the same end-game from the same initial position, force it to do so: Play a game between E and O, take the first position from that game where there are only 8 pieces (incl. kings, excl. pawns) on the board, and use that as a starting point for the game between E' and O. (If you want to test the above end-game eval.) If you are worried that a set of early middle-game positions is not representative for your engine, generate the positions yourself, by applying the same startegy on early middle game: play a game between E and O with own books until they are out of book, and use those positions as a starting point for E' vs O as well. You want to test the effect of your change, not compare the quality of two book lines, if your change was in the engine, rather than the book.
Test the quality of your evaluation and search under time controls that do not interfere with determinism. Even if this would not be the best way to manage time, it is
extremely unlikely that changes that give an improvement at deterministic time controls (fixed depth or fixed number of nodes, clear hash, exact-depth hash hits only) would not give a similar improvement at more advanced (but noisier) time controls.
If opponents do not support a fully deterministic mode, try to force them to play equal games by selection. Play both E and E' against O simultaneously, letting O only think once, and use its move against
both E amd E' for as long as E and E' play the same. Play E and E' to finish their iterations, and let E determine the number of iterations from reading the clock, and then (for as long as they play the same game) force E' (and E"and ...) to use the same number of iterations on a move-by-move basis. This is not so difficult as it seems: you would have one 'baseline version' of your engine that you used as a reference, and for every game in its gauntlet you would store moves and depth (like is usually done in a PGN file). You would program the test versions of your engine to read that PGN, and as long as the moves are the same as in any game in the PGN, use the same number of iterations (rather than a preset time) to search their next move ans were used for the next move in that game.
Measure the time they use, rather than enforcing it, and correct for small differences according to an empirical formula (e.g. 1 Elo penalty per 1% extra total time usage).
After the two versions branch off, you can play multiple games starting from the first position that was different to account for opponent variability (to make ot less likely you think E' did the better move just because the continuation of the opponent was untypically poor). You could again play E and E' simultaneously against O (a number of times) from positions just after their original games diverged ('tree-wise comparison').
There are really lots of things one can do to improve testing accuracy without having a super-cluster, other than just throwing your hands up in the air (a condition know as MIPS addiction
)!