Crafty and Stockfish question

Milos · Post by **Milos** » Mon Jul 19, 2010 10:46 pm

lkaufman wrote: Is there some interface that allows a negative depth input and handles it properly? The Fritz interface I normally use doesn't allow this.

Arena.

bob · Post by **bob** » Mon Jul 19, 2010 11:18 pm

Look at previous post. This might be more in line with what you are looking for???

bob · Post by **bob** » Mon Jul 19, 2010 11:37 pm

jhaglund wrote:
bob wrote:
lkaufman wrote:My "proof" that my human evaluation (and that of all GMs) is the right one is that even in engine databases, the openings score more or less similarly to the results in human play (in most cases). Thus you would never find 75% scores for White in any database of Sicilian games, which might be expected from the huge evals in Crafty and Stockfish. I also try randomized playouts with Rybka from major openings and usually get similar results to human databases, i.e. scores around 55% for White.

As for earlier starting positions, what do you think about randomized moves for the first N ply, filtering out all positions unbalanced by more than X?
Aha, now we get to the bottom of the "symantic" war here.

What you are doing, and incorrectly, is to assume that the scores match up proportionally to winning or losing. But if you think about the score as something different entirely, it becomes less muddy. A program thinks in terms of "best move" and has to assign some numeric value so that minimax works. But nothing says that +1.5 means "winning advantage". To understand, what if I multiplied all scores by 100, including material. Now you would see truly huge scores with no greater probability of winning than before. The scores help the program choose between moves. I'd like to have scores between -1.0 and +1.0, where -1.0 is absolutely lost and +1.0 is absolutely won. But that would not necessarily help the thing play one bit better. Would be a more useful number for humans to see, I agree. But that would be all.

Comparing scores between engines is very much like comparing depths, or branching factors, etc. And in the end, the only thing that really matters is who wins and who loses...

What I think could be useful here:

Implementing a concurrent search (SFSB)...
http://talkchess.com/forum/viewtopic.ph ... 70&t=35066

Playing out the moves to a complete game or just to x plies + y and generating statistics of each evaluation +/-... and using that to chose your move or for move ordering.

1.e4 ...
... e5 (34%W)(32%L)(34%D)
... c5 (33%W)(30%L)(37%D)
... c6 (31%W)(28%L)(41%D)
...Nc6(29%W)(30%L)(41%D)
...
or...

Average eval at end of x plies...

1.e4 ...
... e5 (+.34)
... c5 (+.15)
... c6 (+.11)
...Nc6(+.09)
...

or...

Order move after Eval pv[x]+pv[ply[y]]

1.e4 ...
... e5 (+.4)
... c5 (+.3)
... c6 (+.2)
...Nc6(+.1)
...

I'm not a fan. The problem I see is that while we are perfectly willing to use an N-1 ply search to order moves for the N ply search, I would not be willing to use an N-15 ply search to order moves for an N-ply. And that is what you would be doing because those "games" would have to be so shallow, in order to complete them in reasonable time, they would represent really minimal searches.

I more prefer the idea I proposed in another thread, that is to simply play a million games or so, then go in and create buckets for each eval "range" (say 0 to .25, .25 to .50, etc) and then look thru each log file and for each search that had an eval in one of the buckets and to that bucket add the result (0, .5 or 1.0). Once going thru a million games, compute the average for each bucket, which would convert an eval of 0 to .25 into a winning probability. Ditto for .25 to .5. Then the eval could be a pure number between 0.0 and 1.0. Or perhaps even better, double the number and subtract 1.0, so that the numbers fall in the range of -1.0 (absolutely lost) to 0.0 (drawish) to 1.0 (absolutely won). Or they could be scaled in whatever way someone wants, perhaps even via an option. I think I might play around with this, just for fun, to see what the winning probability looks like for each possible scoring range.

marcelk · Post by **marcelk** » Tue Jul 20, 2010 12:48 am

bob wrote: I more prefer the idea I proposed in another thread, that is to simply play a million games or so, then go in and create buckets for each eval "range" (say 0 to .25, .25 to .50, etc) and then look thru each log file and for each search that had an eval in one of the buckets and to that bucket add the result (0, .5 or 1.0). Once going thru a million games, compute the average for each bucket, which would convert an eval of 0 to .25 into a winning probability. Ditto for .25 to .5. Then the eval could be a pure number between 0.0 and 1.0. Or perhaps even better, double the number and subtract 1.0, so that the numbers fall in the range of -1.0 (absolutely lost) to 0.0 (drawish) to 1.0 (absolutely won). Or they could be scaled in whatever way someone wants, perhaps even via an option. I think I might play around with this, just for fun, to see what the winning probability looks like for each possible scoring range.

I did it a year ago. You'll get something that looks like this:

Details here. The curves are nothing unexpected. See also http://chessprogramming.wikispaces.com/ ... 2C+and+ELO.
In general, evaluations can be mapped to and from winning probabilities using this relation.

I agree with Larry that in early opening positions the engines' scores are much higher than expected. I see exactly the same in my engine. One theory is that this is an effect of optimizing the evaluation with bias to the middle game: if middle game scores carry a heavier weight than opening positions, then optimizing tends to trade accuracy of one for the other. Either because there are more middle game positions than opening positions and not weighing the difference, or because these positions are simply more important to the game result? Speculation here.

IMO such anomalies could theoretically lead to poorer move selection if this causes the program to mix up 'normal' evaluations with 'wild' scores in the same search tree. The program could choose a path to an advantage that is not real instead of one that is realistic but gets a lower score. Wether or not that really happens is speculation: Normally the book takes you beyond this point.

As a general principle, I'ld strive to make evaluations match well to winning expectations. Better moves should follow from that eventually.
I have the feeling there is something to learn from this effect which may ultimo result in slightly better engines.

lkaufman · Post by **lkaufman** » Tue Jul 20, 2010 12:49 am

bob wrote:Look at previous post. This might be more in line with what you are looking for???

If you mean your "bucket" idea, that is indeed a reasonable way to translate evaluations into probability of winning. But it does not address my concern, which is that the evals in Stockfish, Crafty, and many others will almost certainly "translate" to very unrealistic probabilities of winning in positions where one side has a marked dynamic advantage (mobility and/or better piece locations). I hope you try your idea so you can confirm that this is indeed the case.

BubbaTough · Post by **BubbaTough** » Tue Jul 20, 2010 12:58 am

My thoughts echo Larry's exactly in this area. Unfortunately, my tuning efforts so far have shown that tuning to real win percentages was not as effective as I had hoped (though real winning percentages have acted as good guidelines for areas of eval to tune in other ways). Perhaps I just do not have the hardware to make the idea work (or perhaps my intuition on how evals should match up with winning percentage is simply wrong).

-Sam

bob · Post by **bob** » Tue Jul 20, 2010 2:51 am

BubbaTough wrote:My thoughts echo Larry's exactly in this area. Unfortunately, my tuning efforts so far have shown that tuning to real win percentages was not as effective as I had hoped (though real winning percentages have acted as good guidelines for areas of eval to tune in other ways). Perhaps I just do not have the hardware to make the idea work (or perhaps my intuition on how evals should match up with winning percentage is simply wrong).

-Sam

I am not talking about tuning or modifying my eval at all. I am only talking about taking the final eval that I would show for each PV change, and run it thru a function that maps it from a (now) internal Crafty evaluation to an external probability of winning. Program would not change at all, just a procedure that modifies the eval before it is shown. Many programs do this already to a degree because they use something other than 100 for a pawn. If you use (say) 256 then you have to scale the eval you display by multiplying the normal eval by 100 / 256 to get it back into the pawn=100 range.

BubbaTough · Post by **BubbaTough** » Tue Jul 20, 2010 3:14 am

bob wrote:
BubbaTough wrote:My thoughts echo Larry's exactly in this area. Unfortunately, my tuning efforts so far have shown that tuning to real win percentages was not as effective as I had hoped (though real winning percentages have acted as good guidelines for areas of eval to tune in other ways). Perhaps I just do not have the hardware to make the idea work (or perhaps my intuition on how evals should match up with winning percentage is simply wrong).

-Sam
I am not talking about tuning or modifying my eval at all. I am only talking about taking the final eval that I would show for each PV change, and run it thru a function that maps it from a (now) internal Crafty evaluation to an external probability of winning. Program would not change at all, just a procedure that modifies the eval before it is shown. Many programs do this already to a degree because they use something other than 100 for a pawn. If you use (say) 256 then you have to scale the eval you display by multiplying the normal eval by 100 / 256 to get it back into the pawn=100 range.

I understand. You are only talking about a cosmetic change. My thought (which I think reflects Larry's original sentiment though he is free to correct me) is that there is something wrong with today's programs that can only be fixed by non-cosmetic changes, since the evaluations do not match up as well as they should with probabilities. This is illustrated by the fact that many times positions with low probabilities of win have much higher scores than positions with higher probabilities (particularly in the opening). I think most strong players, even experts, it does not take a grandmaster, understand the engine's translation to probability is just wrong early in openings. This inaccuracy is less obvious after development has been mostly completed, when engines are better at judging things. The weakness in opening judgement accuracy is somewhat important since a primary use of engines is analysis of openings. It is also not intuitive that this weaknesses is somehow necessary; it seems to me making the evaluation more accurately reflect win probability should improve actual engine performance though my evidence to support this is slight.

-Sam

bob · Post by **bob** » Tue Jul 20, 2010 3:51 am

BubbaTough wrote:
bob wrote:
BubbaTough wrote:My thoughts echo Larry's exactly in this area. Unfortunately, my tuning efforts so far have shown that tuning to real win percentages was not as effective as I had hoped (though real winning percentages have acted as good guidelines for areas of eval to tune in other ways). Perhaps I just do not have the hardware to make the idea work (or perhaps my intuition on how evals should match up with winning percentage is simply wrong).

-Sam
I am not talking about tuning or modifying my eval at all. I am only talking about taking the final eval that I would show for each PV change, and run it thru a function that maps it from a (now) internal Crafty evaluation to an external probability of winning. Program would not change at all, just a procedure that modifies the eval before it is shown. Many programs do this already to a degree because they use something other than 100 for a pawn. If you use (say) 256 then you have to scale the eval you display by multiplying the normal eval by 100 / 256 to get it back into the pawn=100 range.
I understand. You are only talking about a cosmetic change. My thought (which I think reflects Larry's original sentiment though he is free to correct me) is that there is something wrong with today's programs that can only be fixed by non-cosmetic changes, since the evaluations do not match up as well as they should with probabilities. This is illustrated by the fact that many times positions with low probabilities of win have much higher scores than positions with higher probabilities (particularly in the opening). I think most strong players, even experts, it does not take a grandmaster, understand the engine's translation to probability is just wrong early in openings. This inaccuracy is less obvious after development has been mostly completed, when engines are better at judging things. The weakness in opening judgement accuracy is somewhat important since a primary use of engines is analysis of openings. It is also not intuitive that this weaknesses is somehow necessary; it seems to me making the evaluation more accurately reflect win probability should improve actual engine performance though my evidence to support this is slight.

-Sam

I completely disagree. You/Larry want the evaluation to match a human perception about the advantage and who has it. I see nothing at all requiring such a thing to play good chess. We simply need an evaluation that causes the program to play the best move each time. And tuning leads to this. Whether a program thinks white is +1.5 before making a single move or +0.0 has no bearing on how well the engine will play the game. What is important is that better moves raise the score higher than worse moves.

This fetish about wanting the eval to match a human's perception makes no sense. What human examines 100M nodes per second to choose which move to make? What human does alpha/beta, minmax, LMR, null-move and all the other things a computer does. So why, since a computer plays the game so differently, is there a sense of "wrongness" when the computer's evaluation seems out of touch with a human's evaluation?

Now if the goal is to somehow map computer evaluations into something more palatable for humans, that can be done. But there is absolutely no reason to physically change a computer's evaluation to make it match a human's perception better, any more than it would be useful to use a more human-like approach to playing the game...

BubbaTough · Post by **BubbaTough** » Tue Jul 20, 2010 4:04 am

What philosophy generates the most effective evaluation is probably not solvable by debate...our thoughts on the issue just differ.

Regarding cosmetic translation from eval # to win probability, the funny thing is, I don't think there is a need to do a cosmetic translation because humans seem to have already adapted to centipawns. Multiple times I have seen chess players, from weaker players to IMs and GMs, talking about how good they feel a position is in terms of cenitpawns. It is odd to me to listen to titled players talk about how a position is "about +.4" but it happens. People spend so much time interacting with computer programs that it has fundamentally changed how they talk about the size of advantages.

-Sam

Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question... Larry????

Re: Crafty and Stockfish question... Larry????