Well, I suppose everyone knew this, and you are just trying to batter down an open door.Dann Corbit wrote: ↑Sun Jul 05, 2020 1:06 pmyou will notice that there are about 1000 items in each bin, except for the first and last bins, which have about 500 each.
That means that it is about equally likely that the LOS algorithm will say any of the possible answers between "I am absolutely sure A is stronger than B" to "I am absolutely sure that A is not stronger than B" We can also turn it around and say the same thing in the other direction.
It is about equally likely that the LOS algorithm will say any of the possible answers between "I am absolutely sure B is stronger than A" to "I am absolutely sure that B is not stronger than A" .
I offered this data last week in table form and even as a relational database table, but nobody seemed very interested.
The point is that, exactly because this is true, the likelihood that the LOS between equal engines is >99% is 98 times smaller than that the LOS is between 1% and 99%. Because that ae 98 bins, and all bins get equal numbers of hits. So if you get a LOS of 0.99+, you are either dealing with a 1-in-a-hundred fluke, or there is some other reason.
I don't think any criticism you are doing this is justified; it is a perfectly valid test for the method. It is just that the results you get from this do not reflect badly in any way against the method. They might reflect very badly on you interpretation of the numbers that the method provides, but that is quite another issue.Anyway, you questioned why I would use a LOS algorithm to test superiority of an engine over itself. I mean, it does sound kind of silly because we already know the answer, "It's not superior to itself." but that is exactly why the experiment is important. If the algorithm claims that the program is superior to itself when it is not superior to itself, then that indicates a problem.
What we see here is that the LOS algorithm coughs up an answer that is bad most of the time.
That is because as we have more and more trials, the wins and losses of the same engine do move towards the mean. However, the raw number of wins and losses that are not exactly on the mean will increase in spread (even though, on average, they compute a better mean). This destroys the LOS calculation.
Again this weird belief that draws would be able to tell you anything about what happens in the other cases. You still did not tell whether you also have such delusions in other areas of life. If I tell you that 99% of the people that use XBoard only use it for Chess. Do you think that this affects whether the other 1% use it for Shogi or Xiangqi? How do you think the ratio of the number of Xiangqi and Shogi users must be affected if more people started to use XBoard for Chess?Now, the LOS calculation is not the worst calculation in the world. It also tells us the same thing that common sense tells us. If engine A has more wins than engine B, it is probably stronger. But because it does not care about draws is is missing important information (including the number of games in total).
In principle they could provide a matrix with the LOS between every pair of engines. But it is of course most interesting for engines that are close. Anyone would be able to guess that Rybka is almost certainly stronger than Fairy-Max. The point is that error bars of the Elo do not give the full story; you would have to know the covariances between all pairs as well to compare ratings. The LOS gives you results of that without you having to do the calculation.Another important reason for testing LOS using an engine against itself is that the only thing I have ever seen it used for is for a tiebreaker. For instance CCRL uses it to tell differences between adjacent engines on their list. That makes them (by definition of the ordered list) fairly close in strength to each other.
Also note that LOS is partially transitive: if both A > B and B > C with high LOS, you can be sure that A > C with even better LOS. The opposit is not true: if both LOS are close to 50%, the LOS between A and C can still be very large, when B is just poorly determined w.r.t. the pair of them, but happens to fall in between.
Draws do not provide any evidence for which one is stronger. You are the only one that insists that when the fraction of draws is large, the propensity for winning should be as large as that for losing. The rest of the world believes these to be completely independent properties.It may be that with only a couple thousand games the error spread has not grown large enough to dominate. And so the answers may be OK. But I also think it is faulty for throwing out draw numbers. Draw numbers impart vital information about strength (as demonstrated by the Elo calculation) and therefore tossing out that information makes the algorithm more prone to bad guesses.