Yes, 5-nomial with counted pairs of games fits well the jackknifed values.
Great. So theory and practice agree!
For the balanced ones the variance, as in the data, is 10%-15% smaller than the trinomial one, for unbalanced, a factor of 2-2.5 compared to trinomial. This adequacy of 5-nomial is great, and should be used for side - reversed matches. Also, it means that unbalanced positions will stop much faster (are more relevant),
You are right that all other things being equal, there is no down side to computing the variance correctly and it makes the (G)SPRT more efficient. If your findings are confirmed (I hope people try) then it shows that the trinomial model is quite wrong for paired games in typical testing scenarios and should be replaced by the 5-nomial model.
as from all my past tests, the (w-l) value is not much different in cases balanced-unbalanced.
Well an elo model can make some predictions about this. Alas not about the effect of correlations (your tests seem to indicate this is the dominant effect). So this is uncharted territory I think.
Yes, 5-nomial fits the jackknifed data _very_ well, to 0-3% in accuracy in variance across the tests. 3-nomial is off by minimum 8% for ultrabalanced and up to a factor of 2.5 in variance for unbalanced. I hope other people will confirm these findings, and both your "master LLR formula" and 5-nomial variance will be included in tests with side - reversed games (most of tests people do). I have some small tools for performing jackknifing and I have sets of openings ranging from extremely balanced to very unbalanced, if people are interested, just PM me.
There is an issue with repeated openings, where both multinomial and jackknifing fail. In some cases it's not important, when the number of openings is much larger than the number of games, but say in SF framework the sets are comparable in size. I took the following approach: I played sequentially the same opening twice (side-reversed) TWO times, total 4 consecutive games from the same opening. I performed jackknifing by slicing correctly in consecutive 8 games, 4 games, and incorrectly in only 2 consecutive games, and computed variances. Slicing in 2 games is equivalent to the jackknifing of a database with repeated openings where we don't know the place of repeats.
Fixed time control (ultra-fast 2''+0.02''):
Variance for jackknife with 8 games: 0.0378226
Variance for jackknife with 4 games: 0.0377792
Variance for jackknife with 2 games: 0.0366197
With fixed time, the error from incorrect sampling is 3%.
Sanity check - fixed nodes, stronger correlation between games:
Variance for jackknife with 8 games: 0.0515040
Variance for jackknife with 4 games: 0.0515589
Variance for jackknife with 2 games: 0.0279226
With fixed nodes the wrong jackknifing with 2 games gives a factor of 1.8 smaller variance than the real one.
All in all, at least at ultra-fast time controls, repeated openings seem to not be a large issue (but only one test performed).