Ways to avoid "Draw Death" in Computer Chess

hgm · Post by **hgm** » Sat Aug 05, 2017 9:44 am

Assuming a 'prior' is just fooling yourself. It is just a fancy way of saying that the conclusion is based on assumptions rather than data.

5-1 is not statistically significant, but if we assume that it makes it almost certain the first engine is better, then, guess what? It suddenly proves the first engine is better!

Laskos · Post by **Laskos** » Sat Aug 05, 2017 10:51 am

hgm wrote:Assuming a 'prior' is just fooling yourself. It is just a fancy way of saying that the conclusion is based on assumptions rather than data.

5-1 is not statistically significant, but if we assume that it makes it almost certain the first engine is better, then, guess what? It suddenly proves the first engine is better!

I see it differently. Unbalanced prior for Wins and Losses like that I presented just says 5-1 outcome is more likely to happen than 3-3. Then, with a posteriori 5-1 for whichever engine, result is less likely to revert to opposite disbalance or equality, assuming disbalance is likely (non-uniform prior). A bit of an art and belief, but not unreasonable.

For example: my gut feeling is that whichever is stronger Win/Loss ratio in this match is in the "reasonable region" of 2-4 (based for example on Andreas extensive tests). Then a "reasonable" prior is W**2 * L**(1/2).

Laskos · Post by **Laskos** » Sat Aug 05, 2017 11:24 am

hgm wrote:It might be hard to check this, for lack of a referee that plays/evaluates perfectly. You might think Stockfish is so much better than other engines that it can be used as such, but in fact it is equally stupid as weaker engines strategically, if not more.

A clear example (yes, again an extreme one) is the 3Q vs 7N position. Stockfish lacks the strategic knowledge about the Scharnagl effect that makes this a lost position for the Queens, and evaluates it as +8 at low depth. An then the score drops in small steps to -16 as the game progresses, or, in any fixed position, if its look-ahead depth allows it to see deeper into the game. So to Stockfish it looks like it is making a relatively small error on every move, compared to its deeper searches. While in fact it didn't really make any error at all, and is just seeing the consequences of starting in a hopeless position. If it would have been started in a position where it had the choice to convert to this hopeless position or not (e.g. where it just can force a Q-for-2N trade before the Knights get organized) it would not take the opportunity to save the game, and the entire error would be made in that one move, sealing its fate. It cannot be solved by analyzing the game backwards, because in any position there will be several moves that apparently look equally good if you don't have the required evaluation knowledge, but are in fact equally bad.

This is rather typical of strategical errors. The unrecognized disadvantage restricts your options, so that your best moves are just a bit worse than those of the opponent. Positional horizon effect will be used to smoothen out material losses, making the score gradually drop already before an actual material loss occurs, as the engine plans to postone the loss by making smaller positional sacrifices to push it over the horizon, before it can see the loss is unavoidable.

BTW, what does sound like paranoia to me is the assumption that playing strength at Balanced Chess would correlate so poorly with that of Unbalanced chess that it would ever be needed to live with the vastly larger statistical error that Balanced Chess gives you for the same number of games. It would almost require anti-correlation, or no correlation at all. And even if that would be the case, it is still a good question which of the two abilities is more relevant to the typical user. It seems to me that Balanced Chess is a dead horse. Either it measures the same as Unbalanced Chess, but in an incredibly less efficient way, or it measures something different, which the average user is not much interested in. It cannot win. So there is little point in flogging it with advaced statistical methods.

I agree, but I don't think we were discussing that. For perfect player, there no any small errors anyway. All it sees is 0 or -Infinity. We can then see the counts for these two, but I don't think we were discussing that. Several years ago I let analyze Houdini (1.5?) the self games of a weak and even weaker engine, that is my recollection. And you are right in the sense that close to perfection, those distribution curves might become pathological.

hgm · Post by **hgm** » Sat Aug 05, 2017 12:27 pm

Laskos wrote:I see it differently. Unbalanced prior for Wins and Losses like that I presented just says 5-1 outcome is more likely to happen than 3-3. Then, with a posteriori 5-1 for whichever engine, result is less likely to revert to opposite disbalance or equality, assuming disbalance is likely (non-uniform prior). A bit of an art and belief, but not unreasonable.

For example: my gut feeling is that whichever is stronger Win/Loss ratio in this match is in the "reasonable region" of 2-4 (based for example on Andreas extensive tests). Then a "reasonable" prior is W**2 * L**(1/2).

So in the end you are substituting gut feeling for data, disguising that by using statistical jargon that to most readers would just be mumbo-jumbo.

You assume (without any basis I can see) that one of the engines must be much weaker (for the task at hand, in this case Balanced Chess) than the other. So after 5-1 that would make it very unlikely that it is the much weaker engine that has the 5 wins. While it would be quite normal that a slightly weaker engine manages a 5-1 lead. But you artificially excluded ('by gut feeling') that it could be only slightly weaker, so you make yourself believe that you have now proved it must be much stronger.

By following this procedure, you would classify slightly weaker engines as "far stronger with high confidence", almost habitually. That doesn't sound sensible at all, to me.

It also makes little sense to me to do this two-sided. To justify the prior assumption that one of the engines is much weaker, you must have a preconception about their strength, and it is hard to imagine you would not have any idea which of the two is so much stronger. And if that were the case, the whole match becomes futile: you can just feed the 0-0 score into the prior, and declare the suspected strong one winner with high LOS without playing any games. The problem is usually not that you get two executables, but don't know which of the two is Stockfish and which is micro-Max.

Laskos · Post by **Laskos** » Sat Aug 05, 2017 4:43 pm

hgm wrote:
Laskos wrote:I see it differently. Unbalanced prior for Wins and Losses like that I presented just says 5-1 outcome is more likely to happen than 3-3. Then, with a posteriori 5-1 for whichever engine, result is less likely to revert to opposite disbalance or equality, assuming disbalance is likely (non-uniform prior). A bit of an art and belief, but not unreasonable.

For example: my gut feeling is that whichever is stronger Win/Loss ratio in this match is in the "reasonable region" of 2-4 (based for example on Andreas extensive tests). Then a "reasonable" prior is W**2 * L**(1/2).
So in the end you are substituting gut feeling for data, disguising that by using statistical jargon that to most readers would just be mumbo-jumbo.

You assume (without any basis I can see) that one of the engines must be much weaker (for the task at hand, in this case Balanced Chess) than the other. So after 5-1 that would make it very unlikely that it is the much weaker engine that has the 5 wins. While it would be quite normal that a slightly weaker engine manages a 5-1 lead. But you artificially excluded ('by gut feeling') that it could be only slightly weaker, so you make yourself believe that you have now proved it must be much stronger.

By following this procedure, you would classify slightly weaker engines as "far stronger with high confidence", almost habitually. That doesn't sound sensible at all, to me.

It also makes little sense to me to do this two-sided. To justify the prior assumption that one of the engines is much weaker, you must have a preconception about their strength, and it is hard to imagine you would not have any idea which of the two is so much stronger. And if that were the case, the whole match becomes futile: you can just feed the 0-0 score into the prior, and declare the suspected strong one winner with high LOS without playing any games. The problem is usually not that you get two executables, but don't know which of the two is Stockfish and which is micro-Max.

Well, you seem to have a grunt not so much about my methods in deriving LOS, but about the foundations of Bayesian statistics. Bayesian statistics is based on "beliefs", "plausibilities", "naturalness" and such, like it or not. You can have a grunt against it to degree you like, but you might be surprised how often it is used in empirical theoretical sciences, with some bright minds, for example in supersymmetry phenomenology or string model building. You will often hear there terms as "naturalness" and "plausibility" ("gut feeling" is avoided for no good reason, it sounds bad, but basically it's the same). That's why many supersymmetry guys are now panicking, LHC ruled out the most "natural" of their models, and one has to find new ways to view what is "plausible" or not. I don't think all these guys were dumber than you using Bayesian reasoning.

I base my "feelings" on many empirical data collected. When I say that statistical insignificance in 90 games +5 -1 =84 is due not so much to Win/Loss ratio approaching 1, but to high draw rate, I have a back up of empirical data. If you don't have any, it's your problem. Nobody knows the "true prior", but I have a degree of confidence (belief) in my prior and in the derived a posteriori quantities.

hgm · Post by **hgm** » Sat Aug 05, 2017 5:10 pm

I have no 'grunt' about Bayesian statistics, but about the prior that you use. There is no limit to the silliness you can get out of a Bayesian analysis by taking the a sufficiently crazy prior. Therefore it always has to be applied with caution.

Obviously the 'improvement' you seem to get here is simply what you put in through the prior, and not what is suggested by the data. Your prior is a self-fulfilling prophecy of your belief that all engines that draw a lot against each other must be very different in strength. You claim to have data to back that up.

Well, I have no data at all, but I don't need it to know when I am being conned. Surely, if I play a super-strong engine at long TC in Blalanced Chess it must have a high draw rate. (Does your data contradict that?) Despite the fact that the strength difference is exactly zero. If the data you have collected supports your belief, you just have been biased in collecting that data. Engines of equal strength do exist at any strength level, because each engine is equally strong as itself. Modest patches of strong engine versions that do produce engines that differ very little in strength; engine developers are creating and testing them all the time. Saying that strong engines cannot possibly be close in strength (which is exactly what your prior does) is just nonsense.

AlvaroBegue · Post by **AlvaroBegue** » Sat Aug 05, 2017 5:12 pm

H.G.'s criticism of Bayesian inference is very common among scientists and engineers: If we can't agree on what the prior distribution looks like, we cannot agree on the conclusions.

I tend to favor Bayesian statistics more than most, but in this particular case Kai seems to be using the prior to make it look like we have more information than we really do.

So we have a coin, flipped it 6 times and it came down heads 5 times and tails once. How certain are we that the coin is not fair? Well, for a Bayesian analysis you need to know something about the origin of the coin, so we can have some a priori distribution for the hidden parameter p, the true probability of getting heads. Unfortunately in the real world you don't know where the coin came from. So the best we can do is try to quantify the evidence that we got from the observations we have. For instance, you can compute how often you expect to get a result as lopsided as the one you observed, if the coin were fair. In this case, it's 11% of the time. That's easy to interpret and doesn't depend on a prior that we can't agree on. That's why it's valuable. Go LOS!

Laskos · Post by **Laskos** » Sat Aug 05, 2017 6:23 pm

hgm wrote:Saying that strong engines cannot possibly be close in strength (which is exactly what your prior does) is just nonsense.

I was not saying that, and the prior doesn't say that. I (and prior) was saying that two top engines are less likely to have very close to 1 Win/Loss ratio, more likely it is significantly different from 1. That is all, "more likely", and I have a belief in that based on my and others' previous tests and experiments with them. LOS with that prior is independent of draws, and I ask you: +16 -4 =980 means the engines are close in strength or not? We can diverge even on that.

Laskos · Post by **Laskos** » Sat Aug 05, 2017 6:41 pm

AlvaroBegue wrote:H.G.'s criticism of Bayesian inference is very common among scientists and engineers: If we can't agree on what the prior distribution looks like, we cannot agree on the conclusions.

I tend to favor Bayesian statistics more than most, but in this particular case Kai seems to be using the prior to make it look like we have more information than we really do.

So we have a coin, flipped it 6 times and it came down heads 5 times and tails once. How certain are we that the coin is not fair? Well, for a Bayesian analysis you need to know something about the origin of the coin, so we can have some a priori distribution for the hidden parameter p, the true probability of getting heads. Unfortunately in the real world you don't know where the coin came from. So the best we can do is try to quantify the evidence that we got from the observations we have. For instance, you can compute how often you expect to get a result as lopsided as the one you observed, if the coin were fair. In this case, it's 11% of the time. That's easy to interpret and doesn't depend on a prior that we can't agree on. That's why it's valuable. Go LOS!

I don't know why you are so infatuated with p-value and frequentist approach. Yes, it gives scientifical predictions. But p-value stopping rule is more an art, and is not even sound on theoretical grounds. Type I error is unbounded, and for infinite number of games is 100%. P-value, LOS with uniformed prior, can clumsily and inefficiently be used as stopping rule in our computer chess case, because the divergence is logarithmic, so on some range of data Type I error can be controlled. Did you ever see what Type I error accumulates for p-value of 0.05 stopping rule in 1000 games? I could dig for my older posts here.

AlvaroBegue · Post by **AlvaroBegue** » Sat Aug 05, 2017 9:01 pm

Laskos wrote:
AlvaroBegue wrote:H.G.'s criticism of Bayesian inference is very common among scientists and engineers: If we can't agree on what the prior distribution looks like, we cannot agree on the conclusions.

I tend to favor Bayesian statistics more than most, but in this particular case Kai seems to be using the prior to make it look like we have more information than we really do.

So we have a coin, flipped it 6 times and it came down heads 5 times and tails once. How certain are we that the coin is not fair? Well, for a Bayesian analysis you need to know something about the origin of the coin, so we can have some a priori distribution for the hidden parameter p, the true probability of getting heads. Unfortunately in the real world you don't know where the coin came from. So the best we can do is try to quantify the evidence that we got from the observations we have. For instance, you can compute how often you expect to get a result as lopsided as the one you observed, if the coin were fair. In this case, it's 11% of the time. That's easy to interpret and doesn't depend on a prior that we can't agree on. That's why it's valuable. Go LOS!
I don't know why you are so infatuated with p-value and frequentist approach. Yes, it gives scientifical predictions. But p-value stopping rule is more an art, and is not even sound on theoretical grounds. Type I error is unbounded, and for infinite number of games is 100%. P-value, LOS with uniformed prior, can clumsily and inefficiently be used as stopping rule in our computer chess case, because the divergence is logarithmic, so on some range of data Type I error can be controlled. Did you ever see what Type I error accumulates for p-value of 0.05 stopping rule in 1000 games? I could dig for my older posts here.

I am not proposing to use p-value as a stopping rule. If you were to use the t-value of the posterior distribution in your Bayesian setup as a stopping rule, that wouldn't fare too well either. This is not at all what we are talking about.

Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess