Questions for the Stockfish team

lkaufman · Post by **lkaufman** » Sat Jul 17, 2010 5:03 am

What a strange remark. As far as I can tell the major engine rating lists are all very accurate and honest. Of course they can't predict with great accuracy how engines would do against top human GMs, nor can they predict performance at super-long time controls, but other than that I think they are fine. There is some distortion due to arbitrary choice of opponents, but I believe this is on the order of ten Elo points plus or minus.

bob · Post by **bob** » Sat Jul 17, 2010 5:16 am

mcostalba wrote:
bob wrote:
mcostalba wrote:
bob wrote: Don't follow your comments.
I have used official CEGT lists for single core CPU as reference for the numbers I have given.
I use my cluster test results exclusively.
I know this

But you cannot simply ignore public lists do exsist and are the independent official sources for engine comparisons. In a public discussion on engine comparison we should stick to them because are guaranteed to be independent.

You, as an engine author, should answer yourself a much more interesting question: "Why my ELO references are _so_ different from the public ones ?"....and then you have to choices: try to stick to them or prove them are garbage.

I think I already answered that.

(1) opening books are a pain. We have a good one, but we do not release it as we use it in tournaments. That means that if we play using a book in external testing, the book is an unknown quantity. If all engines use the same book, does that favor an engine or not? If I know what book is being used, I can certainly do a bit of tuning since I am not then trying to optimize things for _all_ possible openings. Is such a book tuned for a specific engine, which can put other engines using that opening at a disadvantage? Our tournament book has been produced by analyzing various positions to determine which one Crafty seems to play best.

(2) time controls can be an issue. All of my testing is for increment-type games. Chess servers have become the "norm" and that's the way the games are typically played, unless they are pure sudden-death in N minutes. I've simply not taken the time to do any significant time tuming. Tracy has done a bit of this and produced significant improvements for certain time controls.

Professional opening books are a real weapon. And they are outside the engine's sphere of influence. I don't want to include that kind of noise in my testing, I am purely interested in how the engine plays, "on its own" from a wide variety of different opening positions.

That's really all there is to this. I suppose we could start testing at 40/5 40/5 repeating to improve the results there since I already know that 23.2 has a pretty serious problem in that kind of game (loses on time some, but more importantly simply does not allocate time very smartly). But for the moment, there are more interesting things to test.

bob · Post by **bob** » Sat Jul 17, 2010 5:28 am

Tord Romstad wrote:
bob wrote:I use my cluster test results exclusively. No book issues or anything else, just plain and simple "engine vs engine." Already found one serious timing bug that will influence rating lists that use "repeating" time controls that I never use or test with (for example, 40/60 repeating, 40 moves in 60 minutes, then repeat. Gross error in time usage in that.

I prefer fischer-clock games to avoid time scrambles also.
So do I.

This is a danger we are all facing: We are always optimizing for our own favorite testing conditions, and there is always a risk that the strength will suffer when somebody else tests the program under different conditions.

For people with limited testing resources, like us in the Stockfish team, the public rating lists are invaluable. I understand why it is not as interesting to you with your ability to get thousands of games very quickly on your cluster, but I still think it would be useful for you to have a look at the public rating lists once in a while, in order to to see how Crafty performs under different conditions and discover bugs like the one described above.

I simply don't pay much attention. If we were to reach a point where we have no ideas to test, I might jump into trying to look at how others are testing and see if there is anything that needs addressing. I did not know we had the repeating time-control bug until Frank pointed it out in a post here a week or two ago.

Your partner, MC, commented that he could not understand why G2.x is ahead of Crafty in some lists since it has been out for a year or two. Here's a newsflash for him: I don't spend time going thru other programs. Again, if and when I run out of ideas, I may well start to look around. The TT approach to a _very_ weak singular extension idea is not particularly new. There are a couple of other "almost-singular-extension" ideas as well. And then there is the real-mccoy as defined by Hsu. You might notice that the TT-singular idea is not in Crafty, while in the past there have been other "sorta-singular-extension" ideas a couple of times. What's the moral? I don't spend any time digging through other programs to try and copy ideas. At least not until my "to-do" list has been emptied, and there is a long list of things remaining, including re-testing the old SE stuff I did several years ago, which might work better in light of the rather ridiculous depths we see today with LMR amd forward pruning working together so well.

If your guys want to stomp thru ip* and take what's worth taking, that doesn't bother me. But to assume that I am doing the same is a bit out in left field. Doesn't mean I won't, but I certainly have not, so far. I find this much more fun when I work to come up with ideas myself, rather than just writing code to implement an idea someone else has either created or copied from yet another third party...

Let's see if MC hangs around doing this for 20 or 30 or 40 years. If so, then he can talk. I do what I enjoy, and no more. And that has kept me involved in this for now 42 years come October. For me, it is more important to enjoy this as opposed to worrying about how high up or down a list I am. Too many have burned out and fell by the wayside following that pursuit.

zamar · Post by **zamar** » Sat Jul 17, 2010 8:53 am

bob wrote: Let's see if MC hangs around doing this for 20 or 30 or 40 years. If so, then he can talk. I do what I enjoy, and no more. And that has kept me involved in this for now 42 years come October. For me, it is more important to enjoy this as opposed to worrying about how high up or down a list I am. Too many have burned out and fell by the wayside following that pursuit.

Personally I don't expect to enjoy computer chess for 20 years or more. For me improving and understanding a strong chess engine is an interesting challenge (like solving a puzzle). Because I'm very new to this hobby, I often find myself trying to visualize and understand how fx. null move or singular extension affects search tree or how the scores get backed up from leaves to root. As a newbie I also prefer to take a look at other program's source code and see how they've done things and try them in Stockfish.

But I'd expect that at least in five years or so, I'll get bored with this hobby. When I've seen enough, understood enough, tested my own ideas enough and contributed to open-source chess community enough, it's time to do something else. For example "computer go" is something I'd definitily want to try at some point in my life.

So I think it's a question of different goal and there is nothing wrong with that. You are working with tens of years perspective. I want to reach good results quickly and when I've seen enough, it's time to give up and do sth else. Of course I can't speak for Marco or Tord. I have no idea how much they expect to do computer chess in future.

mcostalba · Post by **mcostalba** » Sat Jul 17, 2010 10:20 am

zamar wrote: I want to reach good results quickly and when I've seen enough, it's time to give up and do sth else. Of course I can't speak for Marco or Tord. I have no idea how much they expect to do computer chess in future.

I have the same approach as you have. Life is too short to make long time plans

As long as I am having fun I toy with SF, once I won't I'll pass the hand: no wish to break longevity new world record with this hobby.

Regarding the ideas, well I don't value ideas from me to be more rewarding to implement then ideas from someone else. I'm having fun implementing and testing good ideas and see them working, no matter who was the author, I have no attitude to put my mark on them.

Stockfish is open becasue is a book on chess engine development, but instead of being presented as a collection of papers or as an "how to" documentation is presented in form of actual source code that, in my personal opinion, is the best way to present / teach software related stuff.

Given this role, it would be silly and against the underlying approach to discard some good material just because is not original or not "made here". People that want to understand this should put themselves in the role of writing a scientific book on a given argument: they collect info from as much interesting sources as possible, organize them and build up the book. They don't ignore good material just because they didn't got the idea first.

And this is what SF is: a live book on chess engine development.

And please let me add the best book currently out there in terms of material coverage, state of the art, readibility and overall quality. This is were my proudness goes, not in the "new ideas" to be mine that I don't care at all.

frankp · Post by **frankp** » Sat Jul 17, 2010 12:06 pm

zamar wrote:I don't like making comparisons with any specific program (like Crafty), but I can list some major points why I think Stockfish is stronger than most other modern programs:

* Relaxed singular extension
* Logarithmic LMR
* Complicated king safety evaluation
* Fine tuned tables and constants in evaluation
* Speed-optimized code
* Aggressive late move pruning at low depths

Out of interest did you measure the elo increase of the complicated king safety eval? After a comment by Bruce around a CCT several years ago, I implement something similar (king cover wrecked, pieces that could reach squares (protected and unprotected) around the king etc), but was never convinced it was any better than king cover + simple tropism. Perhaps it is a question of tunning?

Tord Romstad · Post by **Tord Romstad** » Sat Jul 17, 2010 12:07 pm

bob wrote:
Tord Romstad wrote:
bob wrote:I use my cluster test results exclusively. No book issues or anything else, just plain and simple "engine vs engine." Already found one serious timing bug that will influence rating lists that use "repeating" time controls that I never use or test with (for example, 40/60 repeating, 40 moves in 60 minutes, then repeat. Gross error in time usage in that.

I prefer fischer-clock games to avoid time scrambles also.
So do I.

This is a danger we are all facing: We are always optimizing for our own favorite testing conditions, and there is always a risk that the strength will suffer when somebody else tests the program under different conditions.

For people with limited testing resources, like us in the Stockfish team, the public rating lists are invaluable. I understand why it is not as interesting to you with your ability to get thousands of games very quickly on your cluster, but I still think it would be useful for you to have a look at the public rating lists once in a while, in order to to see how Crafty performs under different conditions and discover bugs like the one described above.
I simply don't pay much attention. If we were to reach a point where we have no ideas to test, I might jump into trying to look at how others are testing and see if there is anything that needs addressing. I did not know we had the repeating time-control bug until Frank pointed it out in a post here a week or two ago.

And if you had paid more attention to the public rating lists, there are good chances that you would have discovered this bug much earlier. That was my point. By ignoring the public rating list, you miss the chance to easily discover bugs that don't appear in the particular testing conditions you use.

Another point is that most users these days are not aware of how awesome Crafty is, because it's so vastly underrated on the public lists. I know you don't care much about what the average computer chess enthusiast thinks about Crafty, but I and your other fans do care. We want Crafty to get the recognition it deserves.

Your partner, MC, commented that he could not understand why G2.x is ahead of Crafty in some lists since it has been out for a year or two. Here's a newsflash for him: I don't spend time going thru other programs. Again, if and when I run out of ideas, I may well start to look around. The TT approach to a _very_ weak singular extension idea is not particularly new. There are a couple of other "almost-singular-extension" ideas as well. And then there is the real-mccoy as defined by Hsu. You might notice that the TT-singular idea is not in Crafty, while in the past there have been other "sorta-singular-extension" ideas a couple of times. What's the moral? I don't spend any time digging through other programs to try and copy ideas. At least not until my "to-do" list has been emptied, and there is a long list of things remaining, including re-testing the old SE stuff I did several years ago, which might work better in light of the rather ridiculous depths we see today with LMR amd forward pruning working together so well.

I'm not sure what any of this has to do with what I wrote, but we haven't claimed anywhere in the thread that our singular extension implementation is new, unique to Stockfish, nor our own, original idea. Joona just listed it as one of several factors that make Stockfish strong. It wasn't an example of how we are somehow better chess programmers than you. We are not, and we never claimed to be.

I've said it previously in the thread, but because people evidently didn't listen, I'll say it again: Please don't let this thread degenerate into a Crafty vs Stockfish flamewar! Reading the CCC is depressing enough in the first place, but a thread where the good guys start fighting about nothing without any intervention of the trolls is just too much. We're not rivals; we're on the same side!

And by the way, I don't spend any time digging through other programs to try and copy ideas either. It's been a very long time since I last had a look at the source code of another program.

Let's see if MC hangs around doing this for 20 or 30 or 40 years. If so, then he can talk. I do what I enjoy, and no more. And that has kept me involved in this for now 42 years come October. For me, it is more important to enjoy this as opposed to worrying about how high up or down a list I am. Too many have burned out and fell by the wayside following that pursuit.

I certainly won't hang around for 20 years or more. Like you, I don't worry about how high up or down a list I am, but the lists serve as a useful way to measure progress. It doesn't matter much whether Stockfish is number 1 or number 20 on the list, but it matters that version X+1 of Stockfish is stronger than version X.

Tord Romstad · Post by **Tord Romstad** » Sat Jul 17, 2010 12:20 pm

frankp wrote:Out of interest did you measure the elo increase of the complicated king safety eval?

Not recently. The complicated king safety was the main difference between Glaurung 2.0 and Glaurung 2.1, and hasn't seen any dramatic changes since then. The Elo difference between these two versions seems to be about 30 points, but it is possible that some of these 30 Elo points are caused by other, more minor improvements in 2.1.

After a comment by Bruce around a CCT several years ago, I implement something similar (king cover wrecked, pieces that could reach squares (protected and unprotected) around the king etc), but was never convinced it was any better than king cover + simple tropism. Perhaps it is a question of tunning?

I don't think so. Our king safety is one of the few parts of the evaluation function which are still almost entirely untuned. Indeed, one of the disadvantages of complexity is that it makes tuning difficult.

It's possible that a simpler king safety would have been better, but since I spent a considerable amount of time designing the current king safety evaluation, I'm a little too emotionally attached to the code to just throw it all out. Moreover, although the effect on strength is disputable, the effect on style is not.

frankp · Post by **frankp** » Sat Jul 17, 2010 12:41 pm

Tord Romstad wrote: It's possible that a simpler king safety would have been better, but since I spent a considerable amount of time designing the current king safety evaluation, I'm a little too emotionally attached to the code to just throw it all out. Moreover, although the effect on strength is disputable, the effect on style is not.

Yes, I know the feeling.... just cannot help believing the 'more sophisticated' approach (and code) must be better. It is pleasing when it whips up an attack from apparently nowhere, but (at least in my case) is often over optimistic about its king side chances, the attack fails and the resulting position resembles a train crash.

Mangar · Post by **Mangar** » Mon Jul 19, 2010 5:50 pm

Hi,

I like to go back to the original question. First thing I didn´t understand was the strength of Rybka Beta. Today I beleave that an evaluation function that is much optimized for a non LMR engine is not optimal for an LMR engine. Thus developing a new eval allready having LMR in search could be the "trick" of Rybka 1.0 Beta. Maybe Stockfish has the same "trick".

For the search improvements in Stockfish I think that there is no single idea that improved strength. It is the combination of pruning methods. Example:

LMR in the last 3-4 plys brings some elo. But if you do, your search gets a little too instable (often researching in LMR at ply 5-7 with results < alpha). As far as I have seen, Stockfish does huge cutoffs in the last plys (Futility, VBP, Razoring, Static Nullmove) without getting the instability of LMR at the last plys.

Stockfish has a drawback. It is not fast at positions that needs a large sacifice first to win later. IMHO Stockfishs implementation of Singular Extension is "only" fixing drawbacks of it´s large pruning. If you do not prune that much you will not gain much of Stockfishs singular extensions.

Concusion: Everything they does fits together very well. Thats the point why Stockfish is that strong.

Greetings Volker

Questions for the Stockfish team

Re: Questions for the Stockfish team

Re: Questions for the Stockfish team

Re: Questions for the Stockfish team

Re: Questions for the Stockfish team

Re: Questions for the Stockfish team

Re: Questions for the Stockfish team

Re: Questions for the Stockfish team

Re: Questions for the Stockfish team

Re: Questions for the Stockfish team

Re: Questions for the Stockfish team