ICCR project is planning to be canceled

Matthias Gemuh · Post by **Matthias Gemuh** » Thu Jan 05, 2012 5:21 pm

Adam Hair wrote: I think you are putting a little too much emphasis the necessity on having identical hardware speed and identical time control. It would give a more precise measurement of Elo. However, that measurement would be specific to the reference computer system. Certain engines may be stronger for that particular configuration relative to other engines, or vice-versa. That prevents a truly accurate measurement of strength, which ideally would be in reference to a randomly chosen computer.

...

well said !!!

Don · Post by **Don** » Thu Jan 05, 2012 6:08 pm

Proper testing procedure:

You cannot mix testing conditions UNLESS you standardize exactly what that mix should be.

Example: I create a testing agency with 5 testers. 3 of them have AMD and 2 have Intel. I tell them to play whatever matches they want whenever they want at any time control they want and send me the games so that they can be rated. BAD!

So I get the idea to standardize the time by handicapping the faster computers, for instance asking they adjust their time control based on some reference program - i.e. how long it takes Shredder (for example) to do an 18 ply search. It should take S seconds and if it doesn't, make the adjustment. That is an improvement, but still wrong. Some program will play much better on one hardware than another and the pairing are ad-hoc, a very bad thing.

The right way is to determine ahead of time exactly how each program will be tested on each machine and exactly what the pairings will be. The testing conditions do not even have to match, but they must be consistent. For example on one machine it may be decided that each program play 5m + 2s Fischer, ponder on, 1000 games each between all players in the rating pool with hash table and other parameters decided in advance. Another tester may play a somewhat different time control with different test conditions or even different hash table sizes, but he will ALWAYS test exactly the same.
Intermediate results should be taken with a grain of salt until all games are complete from all machines.

How to interpret the results? You cannot interpret the results as proof that program X is better than program Y, you can only say that under the grand umbrella of how this particular test is conducted, program X is performing better than program Y. It's like saying that an IQ test measure how well you take IQ tests. Or computer benchmarks measure how well a given program performs the specified benchmark.

Ingo's test is an anal version of this and meets all the standards here. I have no problem with mixing in other results too, in fact that is probably a GOOD thing as it might be more robust, measuring each program over a variety of conditions. The only caveat is that the mix is always the same and consistent. For example if Ingo were to get Intel machines in addition to his AMD and decides to double the sample size by running an equally stringent test on the Intel machines and then COMBINE the results, it's perfectly valid (even more valid due to greater samples) because the testing conditions are consistent.

It turns out that you can be a lot sloppier and "probably" still be within 20 or 30 ELO of any other sloppy result, assuming that the time control is not hugely disparate and the tests are not hugely flawed. However without stringent (consistent) testing conditions you are adding a significant amount of noise the results. You should probably double or triple the stated error margins that you see on the posted results.

Sedat Canbaz wrote:Dear Chess Friends,

Unfortunately i am sorry to announce:
-I am planning to cancel my new project ICCR

More details about why i am planning to stop working on the new ICCR project:
-I still strongly believe (in case of such list) that we will see high-quality games,played under super fast machines
-I plan to cancel this project is not due to there is no interest over it...
*For example,even in a very short time,ICCR has already 3 Testers,which are waiting instructions from me

The main reason of planning to not continue working over ICCR is that:
-There will be wrong Elo calculations,in case of using adapted time controls under different speed processors
-I mean,in case of creating a such adapted rating- all participant Elo standings will be effected...
-So far...I could not find any right solution for measuring the real Elo strength of the ICCR engines
-The biggest Elo calculation problem is appearing e.g 6c against 4c or 6c against 12c ...

*A simple example of wrong Elo calculation about in case of if we are not combing the mp engines in one version:
How can we calculate the game results played by:
Code: Select all
Intel Core i7 920       @ 4.00 GHz    4 core   12454   50m+10s
and
AMD Phenom II X6 1035T    2.60 GHz    6 core   9283    70m+10s
-The right hardware Elo calculation (6c against 4c):should be done on two separate machines via Auto232 mode
*I mean for hardware speed test (6c against 4c):
-Engines should be tested with adapted time controls,on two separate machines via Auto232 mode
-Even if we combine ICCR played game results by Quads/Six-Cores... in one chess engine version
-Then is appearing another Elo calculation problem:as Clemens Keck stated...some buggy mp engines will be effected from that
In my opinion,combining in one chess engine version is the right way,but only for NON-buggy MP chess engines
-But in case of combining all mp participants,then it will lead to another wrong Elo calculation results

A little note more:
-For the first time in my Computerchess life, i could not complete a project with a success

And after all: I am sorry dear Chess Friends for all that i can not continue to work over this new project

But i have good news:
-I plan to start a new Rating list (15m+10s) based on only i7 Six-core machines
Note:more info coming soon...

BTW,i stopped the current hardware Elo speed test and the results are here:

Games:
http://www.sedatcanbaz.com/chess/games/ ... o_Test.rar

Best Wishes,
Sedat Canbaz

Sedat Canbaz · Post by **Sedat Canbaz** » Thu Jan 05, 2012 8:59 pm

My final words on this issue:

In my opinion:
1)Its not a good idea creating a rating list (played on same hardware-6 CPUs against 4CPUs) with adapted time controls + different speed hardwares
*In other words,in reality if i create another Auto232 rating list (i mean exactly with same conditions with ponder off) we will see completely different Elo standings
*For more details,please check my previous notes

2)Its not recommended creating a rating list too,if we combine ICCR played game results by Quads/Six-Cores... in one chess engine version

3)The best/right/accurate way of measuring the Elo engines strength is the system by running all engines against each other on same hardware+same time control

BTW,if anybody disagree with me...i will respect that too,but instead of comments...i will prefer to see and check the games (with annotations if possible)

Greetings,
Sedat

rbarreira · Post by **rbarreira** » Thu Jan 05, 2012 9:19 pm

Don wrote: Example: I create a testing agency with 5 testers. 3 of them have AMD and 2 have Intel. I tell them to play whatever matches they want whenever they want at any time control they want and send me the games so that they can be rated. BAD!

Bad for whom? For someone who wants to have a precise Elo measurement on a given piece hardware? Yes, it is bad (or at least sub-optimal). But for someone who wants to have a general idea of the strength of an engine? Not bad. In fact it it's better than running on a single piece hardware, as it's less biased towards the hardware of choice.

It's always possible to organize the games into several categories (i.e. sort them by CPU brand, number of cores, time control, or whatever), and produce several different rating lists, as well as having a global rating list where all games are counted irrespective of hardware.

Don · Post by **Don** » Thu Jan 05, 2012 9:26 pm

rbarreira wrote:
Don wrote: Example: I create a testing agency with 5 testers. 3 of them have AMD and 2 have Intel. I tell them to play whatever matches they want whenever they want at any time control they want and send me the games so that they can be rated. BAD!
Bad for whom? For someone who wants to have a precise Elo measurement on a given piece hardware? Yes, it is bad (or at least sub-optimal).

But for someone who wants to have a general idea of the strength of an engine? Not bad. In fact it it's better than running on a single piece hardware, as it's less biased towards the hardware of choice.

If you read my post I actually think it's good to play on a variety of different hardwares and time control.

What is bad is not having controlled conditions. This is not my opinion, it's a scientific principle.

It's always possible to organize the games into several categories (i.e. sort them by CPU brand, number of cores, time control, or whatever), and produce several different rating lists, as well as having a global rating list where all games are counted irrespective of hardware.

rbarreira · Post by **rbarreira** » Thu Jan 05, 2012 9:29 pm

Don wrote: What is bad is not having controlled conditions. This is not my opinion, it's a scientific principle.

So if each of 20 people offered to run 10,000 games with your engine you would deny it just because they all have different hardware?

I think it would be very useful to receive the games especially if people did send the hardware specifications each game ran on...

Don · Post by **Don** » Thu Jan 05, 2012 10:25 pm

rbarreira wrote:
Don wrote: What is bad is not having controlled conditions. This is not my opinion, it's a scientific principle.
So if each of 20 people offered to run 10,000 games with your engine you would deny it just because they all have different hardware?

I think it would be very useful to receive the games especially if people did send the hardware specifications each game ran on...

I don't know how to explain this that is not long winded. But think of it like this. If I use a yardstick that is inaccurate but I always use the same yardstick, I can compare any two objects and tell you which is longer.

Keep in mind that I can determine how strong any program is within 50 ELO very quickly with my own tests. The testing agencies tell me within 30 ELO or so and Ingo perhaps within 15-20 ELO. However Larry and I are interested in being able to measure 2-5 ELO improvements, at least with some reasonable measure of confidence.

You can see the frustration when one of the authors submits a program and gets completely different results from different testing agencies. A lot of this is statistical noise from low samples but the problem would not go away even if 1 million games were run for each program. Some program run well on AMD and others run best on Intel. Some play better against strong opponents. In fact playing much weaker opponents tends to deflate the ratings of the stronger player. So the results are in part dependent on who runs the most games at any given point in time, and which opponents they choose to pair up.

So my argument is based on being able to make comparisons. I don't think you can be very certain just how strong one program is compared to another no matter what you do because the fact of the matter is that it depends on the testing conditions. What you CAN control is which yardstick you use. If the test is always run the same then you can directly compare the results and only the only consideration you have to worry about is sample size.

rbarreira · Post by **rbarreira** » Fri Jan 06, 2012 1:10 am

Don wrote:
rbarreira wrote:
Don wrote: What is bad is not having controlled conditions. This is not my opinion, it's a scientific principle.
So if each of 20 people offered to run 10,000 games with your engine you would deny it just because they all have different hardware?

I think it would be very useful to receive the games especially if people did send the hardware specifications each game ran on...
I don't know how to explain this that is not long winded. But think of it like this. If I use a yardstick that is inaccurate but I always use the same yardstick, I can compare any two objects and tell you which is longer.

Keep in mind that I can determine how strong any program is within 50 ELO very quickly with my own tests. The testing agencies tell me within 30 ELO or so and Ingo perhaps within 15-20 ELO. However Larry and I are interested in being able to measure 2-5 ELO improvements, at least with some reasonable measure of confidence.

You can see the frustration when one of the authors submits a program and gets completely different results from different testing agencies. A lot of this is statistical noise from low samples but the problem would not go away even if 1 million games were run for each program. Some program run well on AMD and others run best on Intel. Some play better against strong opponents. In fact playing much weaker opponents tends to deflate the ratings of the stronger player. So the results are in part dependent on who runs the most games at any given point in time, and which opponents they choose to pair up.

So my argument is based on being able to make comparisons. I don't think you can be very certain just how strong one program is compared to another no matter what you do because the fact of the matter is that it depends on the testing conditions. What you CAN control is which yardstick you use. If the test is always run the same then you can directly compare the results and only the only consideration you have to worry about is sample size.

The yardstick analogy works for your purposes as an engine developer, but I'm not sure what's the target audience/purpose of the rating lists.

I suppose there are several people interested in the results of rating lists:

- developers. they already run their own tests anyway for the purposes of improving the engine, so they don't rely on the rating lists for any real scientific measuring purpose (indeed they can't rely on them that way, unless they develop so slowly that each version only has one change).

- very picky and savvy chess engine users who want the most accurate information about which engine is strongest on their own hardware (I assume this audience is small, and it does them no good to know, for example, what engine is strongest on Intel if they have a AMD or vice-versa). For these users it's good to have measurements done in a variety of hardware as it's not as biased to any arbitrary yardstick someone else chose. As I said, it's even better if the rating list allows to users to view results for a specific kind of hardware.

- less picky engine users who just want to have a rough idea of which engines are strongest regardless of hardware. It is no problem for these users whether the same hardware is used on all games or not as they are not interested in absolute precision...

Given the above I don't see why one should demand a scientific-like accuracy in a rating list. But maybe Sedat has a different idea about his target audience or the purpose of a rating list.

Sedat Canbaz · Post by **Sedat Canbaz** » Sat Jan 07, 2012 9:46 am

It seems still there are people (CSS Forum and here too) who have have difficulties to understand me what i mean exactly
So again i will try to explain that the adapted ratings don't give right accurate Elo results

More it looks like we can say for the adapting ratings 'average elo' and that can lead approx.30-40 Elo difference between a rating based exactly on same machine

By Ernest Bonnem:

Hi,

If you look at the error bars (you did not show them for CCRL, I put them herebelow), the results appear not so strange...

Actually, going from 4 to 6 CPU (or threads) should gain something of the order of 27 Elo (see empiric formula at
http://rybkaforum.net/cgi-bin/rybkaforu ... #pid165304)

CEGT
1 Houdini 1.5a x64 6CPU 3290 23 23
2 Houdini 1.5a x64 4CPU 3275 11 11
CCRL
1 Houdini 2.0c 64-bit 6CPU 3404 +13 -12
Houdini 2.0c 64-bit 4CPU 3377 +21 -20
Houdini 2.0 64-bit 4CPU 3359 +20 -20

1.Example:
-Lets say i am a CEGT/CCRL Tester
So...i tested same mp engine versions at 40/4 ;Ponder OFF (i7 920 against AMD Phenom II X6 in Auto232 mode)
Do you expect that there will be 30-40 ELO difference between the above processors-i7 Quad against AMD Six-Core ?

2.Example:
-Lets say,i tested same mp engine versions (AMD Phenom II X6 against i7 920 in SCCT Auto232 rating conditions)
Do you expect that there will be 30-40 ELO difference between the above processors-i7 Quad 4 core against AMD Six-Core ?

In my opinion,in case of calculating the adapting ratings-all tested processors names should be mentioned too,i mean like SSDF Rating list

In other words,its completely wrong and its NOT recommended measurable system:
-in case of combining all played games by Intel/AMD 4CPUs in one engine version
Note:due to AMD chips are slower than Intel chips (for chess)

About your empirical formula 27 ELO,
Is there any proof,is there a such kind of hardware Elo test ?

For example,a long time ago i have done a such Elo hardware test:
http://sedatchess.110mb.com/index.php?p=1_31
Note:2xQX9775@4.0GHz has almost same kns values as Core i7 920@4.1GHz,but it performed 50 Elo better

BTW,maybe...we need a serious proof,before to say about empirical formula of 27 ELO,but anyway if this system is right,then
the empirical formula maybe will gain of a 6-core relative to a Quad,if are all games played by Intel Nehalem architecture

In other words....ICCR project will work accurate,if all games will be played by Intel Nehalem of Six-core architecture

One thing more,in my opinion:
Actually combining all mp engines in one version is better idea than combining all AMD 4CPUS and INTEL 4 CPUs in one engine version

At least this small test is proving a little bit that the Elo strength are almost identical:

Just i'd like to mention again:
Even if we combine ICCR played game results by Quads/Six-Cores... in one chess engine version
-Then is appearing another Elo calculation problem:as Clemens Keck stated...some buggy mp engines will be effected from that

I hope this time helps...

Best Wishes,
Sedat

Sedat Canbaz · Post by **Sedat Canbaz** » Sat Jan 07, 2012 10:02 am

Adam Hair wrote:
Sedat Canbaz wrote:Hello Adam,

See bellow please (postings on CSS forum)

Clemens Keck schrieb:

Hi Sedat

other tester groups also have 2, 4, 6 cpu in their list.
How do they do that? Could you not learn from them a good system?

Regards, Clemens

Dear Clemens,

Probably you mean about CEGT/CCRL ?!

First of all,i'd like to say that i have big respect to their works-CEGT/CCRL

And the both great teams are putting a lot free efforts-for this i am so thankful for that...

But anyway,i think the both rating list (CEGT/CCRL) include misunderstanding results,especially i mean for 6CPU and 4CPU
Or maybe 4CPUs and 2CPU or 1CPU too

For example,as far as i know,they are owner of AMD Phenom II X6

And their testings with 6CPU against 4 CPU on same hardware
In my opinion a such test is wrong-it will not give us right Elo performance

BTW,strange results indeed,e.g CEGT has 15 ELO difference:
http://www.husvankempen.de/nunn/40_40%2 ... liste.html

1 Houdini 1.5a x64 6CPU 3290 23 23 600 73.0% 3117 36.0%
2 Houdini 1.5a x64 4CPU 3275 11 11 2641 73.0% 3102 34.9%

CCRL Rating has 45 ELO difference:
http://computerchess.org.uk/ccrl/404/ra ... t_all.html

Houdini 2.0c 64-bit 6CPU 3404
Houdini 2.0 64-bit 4CPU 3359

And i strongly believe in reality,AMD Phenom II X6 is not stronger than Intel Quads
Or maybe there will be 5-10 Elo difference-no more no less (it depend on the clock speed)

Honestly this is one of the main reason of canceling my new project-ICCR

Once more i'd like to thank you for your useful note

Greetings,
Sedat
Hi Sedat,

First of all, I am not going to try to convince you to do something that you do not want to do .

But I would like to clear up a bit of a misunderstanding. In your example above, you are comparing two different versions of Houdini. Furthermore, you have to take into account the error bars when comparing those two differences. The discrepancy you see has more to do with the number of games played than differences in computer systems.

Also, as I noted in my first post, if possible I would chose to test either 4 CPU or 6 CPU, but not both. That does get rid of one source of possible error. However, if that is not possible it still does not mean that highly useful data could not be generated from the time control you are proposing.

Good Luck with whatever you choose to do,

Adam

Thank you for your kind words dear Adam

Sure,Everybody has its own choice...and i wish you good like with your tournaments too

Best Regards,
Sedat

ICCR project is planning to be canceled

Re: ICCR project is planning to be canceled

Re: ICCR project is planning to be canceled

Re: ICCR project is planning to be canceled

Re: ICCR project is planning to be canceled

Re: ICCR project is planning to be canceled

Re: ICCR project is planning to be canceled

Re: ICCR project is planning to be canceled

Re: ICCR project is planning to be canceled

Re: ICCR project is planning to be canceled

Re: ICCR project is planning to be canceled