Comparing two version of the same engine

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
pedrox
Posts: 1056
Joined: Fri Mar 10, 2006 6:07 am
Location: Basque Country (Spain)

Re: Comparing two version of the same engine

Post by pedrox »

Hi Fermin,

You would use EPD files, no opening books, I use the file of Marc Lacroise.

I think that you can use your old engine for testing against the new, if the difference is significant, for example 60-40, I will say that there are lots of possibilities that the new version is better than the last. Of course if you play against more engines is better.

According to some studies of Bob, so you can see a significant difference in ELO, for example 20 points, you should play around 2500 games, but not everyone has a computer for so many games, nor patience, and your engine is still a phase that can improve a lot, so you may not need as many games.

With the positions of Marc or Noomen, you could play about 60 games against about 4 or 5 rivals, I think that would be sufficient.

Time control: 40/4 better than 1+1.

Pedro
Karmazen & Oliver
Posts: 374
Joined: Sat Mar 10, 2007 12:34 am

Re: Comparing two version of the same engine

Post by Karmazen & Oliver »

Kempelen wrote:Hi

I release Rodin v1.14 a few months ago. Now I am writting improvements for a new version and have doubts about doing a match between both versions for testing purposes

How many games and what result do you consider that a match between both engines would be necesary to know that the new one is stronger?.

Thx
Excuse, but all that of facing the ENGINE X version A, versus to group of rivals, and to the version engine X version B, versus the same group of rivals. It can that not of good results. :roll:

when you are programming, the first versions are very weak, to face final engines, therefore, the factor luck and that those engines X version to and b will lose all the games practically, with what we won't really have a lot of information. :?

The best option is to face to TO versus B, among them, to a match 12-24... the first versions should improve winning to the previous version. that better form of knowing if it has been a success or not...

Rodin v1.14 ??? It is a very young engine :?: , to face veteran engines... it is better to face the version among them.

in two ways, fixing the depth, like it is the same engine, the engine that more knowledge chess have it won to fixed depth. ( ply : max indentical in eng version A y B) :idea:

the other form is to face it to fast rhythms of games, to see if the incorporate algorithms, are too heavy and they consume many clock cycles... cpu. :arrow:

OK ? :wink:

bye. from spain.

oliver.-
Tony Thomas

Re: Comparing two version of the same engine

Post by Tony Thomas »

Oliver, in my humble opinion version a vs B matches are sort of useless. Mainly because sometimes a newer version of the engine will perform better against an older version but at the same time scores considerably worse against other engines. So far the only example of A vs B that has worked is for Rybka, mainly because it has no other suitable opponents within its strength range at any time controls. I personally play 30 game matches vs 14 opponents for Romi, even then it is sort of hard to judge the improvements unless I am testing a major change such as a completely new book or eval.
BubbaTough
Posts: 1154
Joined: Fri Jun 23, 2006 5:18 am

Re: Comparing two version of the same engine

Post by BubbaTough »

I would disagree with the opinions of most that playing against your old versions is useless...but would agree it is less useful that it seems like it should be.

In computer chess, your progress will be very slow is you insist on perfectly testing everything...most just don't have the time and resources. Often, you need "quick and dirty" tests to confirm an idea you are trying is not ridiculous. Every once in a while, it is good to do a full real test, and when doing that I would agree with others here that running lots of positions against lots of other engines (not your own) is best. My current testbed is 82 games against 5 opponents, but that keeps creeping up as I add more positions and opponents over time.

I use two "quick and dirty" tests. One is I have a tactical testbed off the Arasan website (I use iq?) that I test on around 8 seconds a move. I often only try the last 83 positions, but at the beginning I used all 183. Improving your score on those positions for a young computer program will often (not always but often) correlate with improving your program. For example, it would be a good first place to try things like check extensions, singular extensions, reductions, pruning, quiescence ideas, move ordering, and many other things you will be playing with. As chess coaches tell human players, chess is 99% tactics. If you are adding something to improve your tactics, and it makes your results on a testbed like this worse, be suspicious that there is a bug, or the overhead of your implementation is not worth it. Running this test will take 10-15 minutes, instead of two days, so you can try many ideas. Second, I will occasionally play short matches against old versions ... quick games ... and watch. This is particularly good for changes that are improving your search or speed...if you are losing a large number of games to an old version where the only change is speed start looking for a bug. Another form of testing I do is run my program on a chess server (ICC). I have often found bugs by logging on, and seeing a big rating drop even though it did fine in my internal test beds. ICC gives me good variety, and helps prevent accidental tuning where my engine gets very good at beating a small set of opponents, or solving certain problems, but gets worse against everything else. Again, ICC rating drop means nothing on its own, and is in no way scientific, but if I log on and lose 4 times in a row to engines like Arasan or Crafty or Tinker where I am use to scoring > 50%, I get suspicious my changes are hurting not helping. Looking over those games for odd moves or patterns can also be very informative.

If watching games between old versions or on ICC is not likely to tell you anything because you do not trust your own chess knowledge, I guess you are stuck with testbeds and matches. Still, I would advise not worrying too much about the "statistical significance" of your testing results...just use your intuition. If you are pretty sure what you added is useful, and minor testing does not contradict, keep it. If you added something complicated and suspect problems, major testing is a good idea. If you refuse to supplement the rigour of your tests with intuition and common sense, your engine progress will be very very very slow (unless you have Bobeskian resources).



Anyway, that is my two cents...feel free to ignore, my opinions are usually minority positions and thus suspect...most importantly have fun :).

-Sam
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Comparing two version of the same engine

Post by Michael Sherwin »

Kempelen wrote:Hi

I release Rodin v1.14 a few months ago. Now I am writting improvements for a new version and have doubts about doing a match between both versions for testing purposes

How many games and what result do you consider that a match between both engines would be necesary to know that the new one is stronger?.

Thx
When an airplane propellar is turning at a certain speed then it appears to be turning backwards. That is because the refresh rate of the eye picks up the light from the propeller just before the next blade gets to the same position.

There is an analog of this effect in computer vs computer chess when you test two very similar versions of the same engine. Your new version can beatup on the older version by consistantly seeing a certain tactic that the older version misses. But, the new version may not be helped by the new code against other engines and may be hurt. The exact oppisite may also happen--a new version may look bad against the older version, but it might be stronger overall. I call this effect, 'pathalogical behavior' between two very close versions.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
BubbaTough
Posts: 1154
Joined: Fri Jun 23, 2006 5:18 am

Re: Comparing two version of the same engine

Post by BubbaTough »

Michael Sherwin wrote:
Kempelen wrote:Hi

I release Rodin v1.14 a few months ago. Now I am writting improvements for a new version and have doubts about doing a match between both versions for testing purposes

How many games and what result do you consider that a match between both engines would be necesary to know that the new one is stronger?.

Thx
When an airplane propellar is turning at a certain speed then it appears to be turning backwards. That is because the refresh rate of the eye picks up the light from the propeller just before the next blade gets to the same position.

There is an analog of this effect in computer vs computer chess when you test two very similar versions of the same engine. Your new version can beatup on the older version by consistantly seeing a certain tactic that the older version misses. But, the new version may not be helped by the new code against other engines and may be hurt. The exact oppisite may also happen--a new version may look bad against the older version, but it might be stronger overall. I call this effect, 'pathalogical behavior' between two very close versions.
Yes, 'pathalogical behavior' can happen, and does happen more often than one would suspect. Still, if version A beats up version B, and I was forced to put money on one of them, I would bet on version A. As would most people I suspect. Considering the test useless is a significant exaggeration.

-Sam
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Comparing two version of the same engine

Post by bob »

BubbaTough wrote:
Michael Sherwin wrote:
Kempelen wrote:Hi

I release Rodin v1.14 a few months ago. Now I am writting improvements for a new version and have doubts about doing a match between both versions for testing purposes

How many games and what result do you consider that a match between both engines would be necesary to know that the new one is stronger?.

Thx
When an airplane propellar is turning at a certain speed then it appears to be turning backwards. That is because the refresh rate of the eye picks up the light from the propeller just before the next blade gets to the same position.

There is an analog of this effect in computer vs computer chess when you test two very similar versions of the same engine. Your new version can beatup on the older version by consistantly seeing a certain tactic that the older version misses. But, the new version may not be helped by the new code against other engines and may be hurt. The exact oppisite may also happen--a new version may look bad against the older version, but it might be stronger overall. I call this effect, 'pathalogical behavior' between two very close versions.
Yes, 'pathalogical behavior' can happen, and does happen more often than one would suspect. Still, if version A beats up version B, and I was forced to put money on one of them, I would bet on version A. As would most people I suspect. Considering the test useless is a significant exaggeration.

-Sam
If you play several _thousand_ games with A vs B, you might learn something, although I am not sure what that "something" is. If you play a hundred games, you are not going to learn anything at all unless one of the two versions is _far_ stronger than the other. Usually this is an incremental process, which means that A' is just slightly stronger than A. It takes a ton of games to make that determination. I am using 32,000 games to compare versions and it works quite accurately. For 32,000 games, the error bar is +/-4 , so you need an improvement of more than 8 elo to be really certain the new version is better, otherwise the error bar causes the two rating ranges to overlap. Changing the incheck extension from 1.0 to .75 gives about a 10 Elo drop in rating, for reference, so making big rating jumps is not easy. Going from 1.0 to 0.5 is a 20 Elo deal...
BubbaTough
Posts: 1154
Joined: Fri Jun 23, 2006 5:18 am

Re: Comparing two version of the same engine

Post by BubbaTough »

If you play several _thousand_ games with A vs B, you might learn something, although I am not sure what that "something" is. If you play a hundred games, you are not going to learn anything at all unless one of the two versions is _far_ stronger than the other. Usually this is an incremental process, which means that A' is just slightly stronger than A. It takes a ton of games to make that determination. I am using 32,000 games to compare versions and it works quite accurately. For 32,000 games, the error bar is +/-4 , so you need an improvement of more than 8 elo to be really certain the new version is better, otherwise the error bar causes the two rating ranges to overlap. Changing the incheck extension from 1.0 to .75 gives about a 10 Elo drop in rating, for reference, so making big rating jumps is not easy. Going from 1.0 to 0.5 is a 20 Elo deal...
Well, somehow I have muddled my way to a reasonably strong engine with pretty bad testing practices, which include self test at some points and ICC at some points, and test sets at some points, all strongly frowned upon practices. The only times I have used approved methods is when I am not really actively developing, but have lots of time on my hands to test. I would guess the reason I have managed to get a reasonably strong engine doing this is because BUILDING A 2600 level ENGINE DOES NOT REQUIRE DETECTION OF SMALL IMPROVEMENTS. There are many bundles of improvements that have large Elo jumps, and bad but quick testing methods are sufficent. I know this is the case building LearningLemming / Crafty level engines (I am sure Crafty is stronger using multiple threads but you know what I mean) since I have experienced it. And I am convinced it is true well past our level. That is why I try to emphasize to new engine writers that GOOD TESTING IS NOT IMPORTANT. The first few years of development most engines will improve so dramatically, that any testing method will detect it.

Maybe I am wrong. Maybe I have been able to get away with poor testing practices by relying on my own chess sense too much in a way others probably should not emulate. Or maybe I am just lucky. But all the maybes aside it is my opinion that most engines are far weaker because of inappropriate attempts to tweak and test minor issues, and more engines would be stronger should less emphasis be placed on this.


OK, I will attempt to quiet down now, and let the testing advice continue :oops:. Probably the last thing someone wants to hear when asking for testing advice is don't worry about it. OK, maybe 2nd to last thing...the last thing they want to hear is try running 32,000 games :).

-Sam
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Comparing two version of the same engine

Post by bob »

BubbaTough wrote:
If you play several _thousand_ games with A vs B, you might learn something, although I am not sure what that "something" is. If you play a hundred games, you are not going to learn anything at all unless one of the two versions is _far_ stronger than the other. Usually this is an incremental process, which means that A' is just slightly stronger than A. It takes a ton of games to make that determination. I am using 32,000 games to compare versions and it works quite accurately. For 32,000 games, the error bar is +/-4 , so you need an improvement of more than 8 elo to be really certain the new version is better, otherwise the error bar causes the two rating ranges to overlap. Changing the incheck extension from 1.0 to .75 gives about a 10 Elo drop in rating, for reference, so making big rating jumps is not easy. Going from 1.0 to 0.5 is a 20 Elo deal...
Well, somehow I have muddled my way to a reasonably strong engine with pretty bad testing practices, which include self test at some points and ICC at some points, and test sets at some points, all strongly frowned upon practices. The only times I have used approved methods is when I am not really actively developing, but have lots of time on my hands to test. I would guess the reason I have managed to get a reasonably strong engine doing this is because BUILDING A 2600 level ENGINE DOES NOT REQUIRE DETECTION OF SMALL IMPROVEMENTS. There are many bundles of improvements that have large Elo jumps, and bad but quick testing methods are sufficent. I know this is the case building LearningLemming / Crafty level engines (I am sure Crafty is stronger using multiple threads but you know what I mean) since I have experienced it. And I am convinced it is true well past our level. That is why I try to emphasize to new engine writers that GOOD TESTING IS NOT IMPORTANT. The first few years of development most engines will improve so dramatically, that any testing method will detect it.

Maybe I am wrong. Maybe I have been able to get away with poor testing practices by relying on my own chess sense too much in a way others probably should not emulate. Or maybe I am just lucky. But all the maybes aside it is my opinion that most engines are far weaker because of inappropriate attempts to tweak and test minor issues, and more engines would be stronger should less emphasis be placed on this.


OK, I will attempt to quiet down now, and let the testing advice continue :oops:. Probably the last thing someone wants to hear when asking for testing advice is don't worry about it. OK, maybe 2nd to last thing...the last thing they want to hear is try running 32,000 games :).

-Sam
Getting to "reasonably strong" is not that hard. And can be done with any sort of ad hoc testing. But going beyond that becomes quite tedious and difficult, because the changes become minor Elo gains, and detecting minor elo changes is a chore. I have found some interesting stuff over the past few months, but being able to play 32,000 games in an hour, or about a million in a day and a half is a big help. I now have an idea of exactly what the give check extension is worth, for example, and what it is worth to extend by 1/2 or 3/4 or 1 whole ply or even 1.25 plies. The Elo change is not very big. A thousand games is not going to touch the accuracy needed to choose between 1/2 and 1.0 plies, for example. So once you get past the "now I will add an evaluation" stage, testing becomes difficult. 32,000 games gives a +/- 4 Elo error bar. If every change was more than +4, it would not take a year to break 3200. Unfortunately they are not. We have gained at least 75 real Elo in the past 6 months. At maybe 10 million games a week, that is 250 million games or so. :) Some tests gain 2-3-4 Elo. Here and there we pick up 10. Quite a few pick up nothing or lose... Even if they sound quite reasonable...
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Comparing two version of the same engine

Post by Michael Sherwin »

BubbaTough wrote:
Michael Sherwin wrote:
Kempelen wrote:Hi

I release Rodin v1.14 a few months ago. Now I am writting improvements for a new version and have doubts about doing a match between both versions for testing purposes

How many games and what result do you consider that a match between both engines would be necesary to know that the new one is stronger?.

Thx
When an airplane propellar is turning at a certain speed then it appears to be turning backwards. That is because the refresh rate of the eye picks up the light from the propeller just before the next blade gets to the same position.

There is an analog of this effect in computer vs computer chess when you test two very similar versions of the same engine. Your new version can beatup on the older version by consistantly seeing a certain tactic that the older version misses. But, the new version may not be helped by the new code against other engines and may be hurt. The exact oppisite may also happen--a new version may look bad against the older version, but it might be stronger overall. I call this effect, 'pathalogical behavior' between two very close versions.
Yes, 'pathalogical behavior' can happen, and does happen more often than one would suspect. Still, if version A beats up version B, and I was forced to put money on one of them, I would bet on version A. As would most people I suspect. Considering the test useless is a significant exaggeration.

-Sam
I never said that it was useless. I was just pointing out something to be aware of. If an author is going to use self testing first then the results should be verified second by testing against other engines.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through