Scalability study for chess.

bob · Post by **bob** » Thu May 01, 2008 7:01 pm

Don wrote:
bob wrote:
Don wrote:Will a program with a very good evaluation function improve more with depth than one with a poor evaluation function?

I'm doing a scalability study for chess to help understand this. It's similar to other studies that have been done in the past which determines how much a chess program improves with depth of search, but this study also tries to answer the above question.

In the study, I'm testing Toga with 11 different levels using 2 different evaluation functions, one weak and one strong. The details of the study are posted here with numbers and graphs:

http://greencheeks.homelinux.org:8015/

Any comments are welcome. I may extend the study to depth 12 or 13 after I've accumulated more data at the current levels.

This site is temporary and I'm simply hosting it on my home computer, while I'm looking for a more permanent place to host it. Does anyone have a publicly available site I can use for this?

I will also be making the PGN files containing all the games available. The study is ongoing and will continue for as long as my time and patience permits.

I am also interested in other studies which may have a bearing on this. I know that Hans Berliner did a study many years ago with 2 versions of Hi-tech (one of them he called lo-tech) but the number of games was too low to have any real statistical meaning.
My initial thought is that there is no significant difference. The difference between the two programs is less on the lower end, but that could just be compression since the Elo down there is so low. On the upper end, the gap widens but not by a lot, and I suspect that is normal, since the "smarter program" ought to play better, and may well gain just a bit since it is artificially faster than its opponent. (two programs searching to the same depth gives an advantage to the slower program since it gets to do more computation per move than its opponent. Best example might be chessmaster, which is very slow, and searching to depth 11 might take forever in wild positions, but it would hardly miss a thing at that depth.
Bob,

I thought there would more improvement than I actually see for the strong version, but we clearly need more data to get a good picture.

I would have preferred to do a time (or nodes) based study, as I already mentioned in another post, since that is the most relevant for practical programs. Such a study would show the "real" difference in the two evaluation functions.

You pose an interesting question about the older programs that were far less selective. How would they do against modern programs at the same depth? How much is modern selectivity hurting programs today?

If you were to do a study of full width vs selectivity by time, I'm pretty sure you would see that selective program improve much faster. They play weaker at a given depth, but the extra speed more than compensates for this disadvantage and indeed is the whole point of selectivity.

I could test full width vs selective later, by depth. Perhaps I will hack Toga later to allow testing by fixed nodes and do this very study too.

I would suspect you are correct. But fixed depth testing makes it tough to draw conclusions since a ply for program X is not the same as for program Y. A set number of nodes is OK, but it eliminates a signficant part of an engine because it disables any time allocation tricks a program might use (fail low = more time, instant replies give more time later, easy moves save time, etc... So such tests are also a bit inaccurate, particularly if one of the two opponents has no clever time allocation code, as that will often be the deciding factor between two fairly equal opponents...

All of my cluster testing is time-based to let the engines us anything they have available to beat their opponent...

Don · Post by **Don** » Thu May 01, 2008 7:50 pm

Hi Bob,

Yes, I know there are arguments for and against using node counts. However, when I use node counts for testing I am not concerned with the contribution of time control algorithms to the overall strength of the program. If I were testing the absolute strength of two different engines I would of course consider this very important. In this case I'm trying to isolate the effect of the evaluation functions.

This is similar to how you use opening books in testing. If you are trying to test which of two engines are better, you let them use the opening book the author has designated as being best. However if you are more interested in the overall strength of the actual computing engine, the opening book can be a distraction. For instance some opening book lines are designed specifically to steer the program away from lines it doesn't play well, but that's not useful if you are trying to build up the evaluation function. Of course it's very useful if you are trying to win a tournament.

So I guess the short response to you is that I agree, fixed node testing is not a good way to test the overall strength of the program. But it's an excellent tool for general program research and development because it is a way to make testing clean and predictable, even when testing with several different kinds of machines under different loads. It is more natural than fixed depth depth testing and tests most aspects of the chess program far better than fixed depth. So in my view it's the best way to test when normal time control matches are out of the question.

Probably most things can be isolated like this. For instance you can test the time control algorithm, improve the evaluation and not be required to test the time control algorithm each time you make some other small improvement. But it's also true that everything tends to interact to a degree - but it's pretty much impossible to test each possible interaction the way you wish you could.

Dann Corbit · Post by **Dann Corbit** » Thu May 01, 2008 8:57 pm

Don wrote:Will a program with a very good evaluation function improve more with depth than one with a poor evaluation function?

I'm doing a scalability study for chess to help understand this. It's similar to other studies that have been done in the past which determines how much a chess program improves with depth of search, but this study also tries to answer the above question.

In the study, I'm testing Toga with 11 different levels using 2 different evaluation functions, one weak and one strong. The details of the study are posted here with numbers and graphs:

http://greencheeks.homelinux.org:8015/

Any comments are welcome. I may extend the study to depth 12 or 13 after I've accumulated more data at the current levels.

This site is temporary and I'm simply hosting it on my home computer, while I'm looking for a more permanent place to host it. Does anyone have a publicly available site I can use for this?

I will also be making the PGN files containing all the games available. The study is ongoing and will continue for as long as my time and patience permits.

I am also interested in other studies which may have a bearing on this. I know that Hans Berliner did a study many years ago with 2 versions of Hi-tech (one of them he called lo-tech) but the number of games was too low to have any real statistical meaning.

I think that a second experiment would also be interesting:
Simpler evaluation should be faster, so what is the cost of individual components of the evaluation in time verses depth.

For example, suppose that some component of the evaluation clearly produces a smarter answer, but it takes a great deal of time to compute. Will the elimination of the evaluation component increase or decrease the Elo, given the same amount of time?

So I suggest if you can turn eval components on and off, do a component by component on-off toggle with 500 games per component.

I have a hunch that in very fast time control games the value of depth will make turning things off more valuable than in slower paced games.

Tony · Post by **Tony** » Sun May 04, 2008 1:10 pm

Don wrote:
Tord Romstad wrote:
Don wrote:It's very odd that Tord recently posted about fixed node levels because this was a factor here. My testing method of choice has always been "fixed nodes" and I tested that way for years. However most programs do not seem to support this method of testing. Had Toga supported this I would have tested levels that were each 2X more nodes than the previous level.

Testing by time, as you suggest, is subject to more ambiguities, especially when I'm doing this test on a computer that does not have a constant load. Fixed depth was my only reasonable choice without hacking Toga.
You could use Glaurung instead, you know. It supports fixed-node searches, and if you ask the author for the latest development version, you'll get a program not significantly weaker than Toga for your tests.

Tord
Tord,

That would be perfect for this. Can you make me a Linux version of your best stable version for this study? I would prefer a 64 bit version if it performs better, but I have the libraries installed to be able to run 32 bit binaries.

My experience is that with a better evaluation, the effective branchingfactor will become less. That is a thing you will miss with fixed nodes search.

Tony

Scalability study for chess.

Re: Scalability study for chess.

Re: Scalability study for chess.

Re: Scalability study for chess.

Re: Scalability study for chess.