Ply vs ELO

andrejcher · Post by **andrejcher** » Tue Jun 28, 2011 9:59 am

Hi all,

I have few questions about engines development.

1)what is the best way to calculate engine strength/ELO?
2)how limit engine strength by ELO, user provided before game? In other words - how to calculate search depth depending on ELO limit?
3)how to test engine to be sure that it works properly. Are there any standard test sets?

Thanks,
Andriy

hgm · Post by **hgm** » Tue Jun 28, 2011 10:56 am

1) let it play a few thousand games against opponents with known Elo in the range +/-300 Elo around it.
2) Depth is not a good way to tune engine strength, because with fixed depth it willplay like an absolute moron in the end-game, while it is still unbeatably strong in the middle game. A better way to tune the strength is by thinking time or number of nodes. The rule of thumb there is that each doubling is worth 70 Elo points.
3) You can't. Errors like illegal moves you will of course find quickly enough, and after a few thousand test games that should be OK. Errors that only manifest themselves by sub-optimal move choice you might never find by testing. (Unless, perhaps, they cause horrendous blunders. If the engine plays 1.e3 ... 2.Ke2, it is a good guess you accidentally flipped the sign of its King Safety eval term.)

Desperado · Post by **Desperado** » Tue Jun 28, 2011 11:23 am

andrejcher wrote:Hi all,

I have few questions about engines development.

1)what is the best way to calculate engine strength/ELO?
2)how limit engine strength by ELO, user provided before game? In other words - how to calculate search depth depending on ELO limit?
3)how to test engine to be sure that it works properly. Are there any standard test sets?

Thanks,
Andriy

1: as HGM pointed out already.
2: a)
well, think you meant _skill_ level feature. I am not sure, but
i think people dont use depth as main controller for skill levels.
Maybe its more a randomization of eval elements combined with
search parameters like depth.
b)
again like HGM said, maybe one can add that on lower plies, lets say
up to 7,8,9 it can easily be worth 100 elo each additional depth, which
will decrease at the higher plies. So, 13 to 14 may only be worth 50,60 elo with fixed depth comparison.

3: here i am also not sure if you talk of _testing_ or _debuging_ issues.
There are a lot of testsuites out there with different themes like,
mating, tactics,positional or endgame knowledge.
But if you were talking of debuging, there are not so many standard
ways beside _Perft_ which allows you to verify your movegenertion
code. There are also a lot of epdFiles,Fens around the net
which allow you to compare nodecounts then. Other things are mainly engine dependent to debug, because
of unique data structures.

Michael

andrejcher · Post by **andrejcher** » Tue Jun 28, 2011 1:10 pm

1) it is clear
2) I'm not sure that using time for strength limit is a good idea. Than it will depend on hardware...
3) Yeah. I'm talking about testsuites for finding the best move. I've already found BT2450 and BT2630. Would be nice to have more...

Thanks,
Andriy

hgm · Post by **hgm** » Tue Jun 28, 2011 1:28 pm

2) This is why you can also use node count. That is hardware independent.
3) Test suites might tell you about the engine strength, but that doesn't tell you if it functions properly. As an extra complication, tuning the engine such that it scores better on test suites often lowers its Elo.

Mark · Post by **Mark** » Tue Jun 28, 2011 8:36 pm

hgm wrote:2) This is why you can also use node count. That is hardware independent.
3) Test suites might tell you about the engine strength, but that doesn't tell you if it functions properly. As an extra complication, tuning the engine such that it scores better on test suites often lowers its Elo.

Regarding test suites, although they aren't great for making improvements in play, they are very useful when working on the search. I've been using the WAC test suite almost exclusively when developing my search, and it's been very helpful. My eval is only piece square and material but that doesn't matter for the tactical WAC test suite. I've gradually progressed from solving 116 out of 300 in one second to 250.

Dann Corbit · Post by **Dann Corbit** » Tue Jun 28, 2011 8:51 pm

Mark wrote:
hgm wrote:2) This is why you can also use node count. That is hardware independent.
3) Test suites might tell you about the engine strength, but that doesn't tell you if it functions properly. As an extra complication, tuning the engine such that it scores better on test suites often lowers its Elo.
Regarding test suites, although they aren't great for making improvements in play, they are very useful when working on the search. I've been using the WAC test suite almost exclusively when developing my search, and it's been very helpful. My eval is only piece square and material but that doesn't matter for the tactical WAC test suite. I've gradually progressed from solving 116 out of 300 in one second to 250.

Good tactics are a necessary, but not sufficient, element of excellent chess play in general.
Great tactics are not necessary for a great chess engine, however. Even more so, an engine tuned exclusivly with tactics will play poorly.

Don · Post by **Don** » Tue Jun 28, 2011 10:34 pm

andrejcher wrote:Hi all,

I have few questions about engines development.

1)what is the best way to calculate engine strength/ELO?
2)how limit engine strength by ELO, user provided before game? In other words - how to calculate search depth depending on ELO limit?
3)how to test engine to be sure that it works properly. Are there any standard test sets?

Thanks,
Andriy

1. By playing many thousands of games. Determine in advance how many games and make use of the error margins to estimate possible error. Remember that BOTH (or ALL) programs being test will be subject to the error margins so +10 may not mean what you think it does.

2. The answer may depend on what your purpose is for limiting the strength.

If you rate any engine at 2, 3, 4, 5, 6,7,8 etc. ply searches you will find that the ELO gain gets smaller with depth. And it varies enormously between 1 and 14 ply or so, something like 200+ ELO down to 40 or 50 ELO. - you have to just test this for the engine in question. I suggest it makes more sense to go by nodes. If you limit strength by controlling the depth you will get occasionally searches that take many times longer than other searches, so I don't think it's a practical way to limit strength. Most modern program will do a 17 ply search pretty quickly, but might hang up in some positions.

We generally use fixed depth for really short tests just to gather infromation about the speed of the program after the change and get a rough idea of whether there is an ELO gain. But then we immediately proceed to time control testing for the reasons HG points out. However we rarely find that fixed depth testing returns the wrong result but it can happen, especially if the change has much greater consequences at high depths in endings.

3. No easy answer. If you are a developer liberal use of assertions. Unit tests are almost always a really good thing and we found many bugs in komodo as soon as I implemented some unit tests. Basically there is perft (which is a kind of unit test) but that does not test the regular search.

In Komodo I have a built in test function that searches 50 positions in "deterministic" mode and knows exactly how many nodes, what the score should be and the PV. That test is useful for many kinds of changes to verify it has no impact on the search tree. It does not test other kinds of changes such as evaluation or search changes, but it's very good at verifying that the program does not change between compilers. For example if the 32 bit version of your program fails this test, then you are doing something wrong.

I discovered that I was relying on C compiler implementation defined behavior for something by having this test.

CThinker · Post by **CThinker** » Wed Jun 29, 2011 3:46 am

I once devised a way to lower the engine's playing strength by simulating a slow hardware.

What I did was to reduce the target search time, but still consume as much time.

For example, if the target search time is 10 seconds, search() is told to run for only 5 seconds, and then the xboard shell will sleep for the remaining 5 seconds before printing out the move.

The sleep is necessary so that the clock would wind down in the expected fashion.

Ply vs ELO

Ply vs ELO

Re: Ply vs ELO

Re: Ply vs ELO

Re: Ply vs ELO

Re: Ply vs ELO

Re: Ply vs ELO

Re: Ply vs ELO

Re: Ply vs ELO

Re: Ply vs ELO