Beta for Stockfish distributed testing

gladius · Post by **gladius** » Sun Mar 24, 2013 2:57 pm

Ajedrecista wrote:Me (I guess that also you):
s = sqrt{[mu*(1 - mu) - D/4]/(n - 1)]}

Thanks Lucas and Jesus! I have updated the error bars calculation to include the draw ratio. Please let me know if you see any other errors

.

gladius · Post by **gladius** » Sun Mar 24, 2013 2:58 pm

mcostalba wrote:I have just setup a google group:

https://groups.google.com/forum/?fromgr ... ishcooking

Hopefully it will be useful. It is more persistent than a chat, but at the same time has almost the same responsiveness.

Suggestions are welcomed !

Great idea Marco!

Michel · Post by **Michel** » Sun Mar 24, 2013 8:51 pm

It seems likely that with all these resources now thrown at it Stockfish will soon be the strongest chess program in the world! Finally a community chess engine development project that seems to be doing everything right!

I'd like to contribute but have some difficulty installing the client on CentOS 6. Would it be possible to supply a tarball which you can simply unpack and run on a vanilla python install? I don't have the time to puzzle together all the various bits and pieces that are needed.

Since I assume many (as in MANY) non-technical users will want to run the client it would be nice to make it as simple as possible for them as well (e.g. something like SETI@HOME).

Don · Post by **Don** » Mon Mar 25, 2013 2:25 pm

mcostalba wrote:
gladius wrote: The current queue of tests for Stockfish is up at http://54.235.120.254:6543/tests. Please note, operations like start run, delete run, etc. are restricted.
Gary, this is really another big step in the evolution of chess engine develpment.

What you've done will have far reaching consequences for engine development. I'm not sure many of the people here fully understands the deep implications.

I'm not only talking of the amazing technical achievement of this very sophisticated distributed testing framework, but especially to the fact that testing and devloping is now open, for the first time, to everybody.

With Galurung we started to have open source and GPL official releases

Then, with Stockfish we open the development branch through github, from when it is possible to track the build of a new relese change by change.

And now the last step: the opening of the testing process and validation. People can now see (and contribute) to what we test, to see what it works and what it doesn't.

These are IMHO the big 3 milestones we reached with Stockfish development, and it is a world first and I'm proud of it.

If you think that in this world majority of authors are still very secretive and jealous of their work and most engines are distributed only in binary format, you can apreciate even more the revolution that Stockfish development has been and still is.

Thanks Gary !

We started doing distributed testing with many volunteers donating machine time a couple of years ago, so this is not new idea. At MIT when developing Cilkchess we also did this too, where students donated machine time from their workstations. That makes this idea AT LEAST 20 years old and I doubt I was the first. I don't know how this particular system works, but my system required people to just start a single executable with a simple configuration file on their machine and it did all the rest. It was more like an alpha study to determine if I should continue to refine and develop the system.

As it turns out, we concluded that it was not very useful for measuring less then 20 ELO improvements very reliably.

Here are some of the factors that could change the result:

1. The hardware that happened to be utilized at any moment.
2. The operating system in use (my system was cross platform.)
3. The weather.
4. What time people went to bed.

The last 2 are not a joke, the weather and bedtime affected the results and with a little though you should be able to figure out why.

Michel · Post by **Michel** » Mon Mar 25, 2013 2:42 pm

That makes this idea AT LEAST 20 years old and I doubt I was the first.

This might be but what I like about this system is that the test queue is not a black box but visible to the public (I am not sure if the public can contribute tests as I have not found the time yet to install the client).

Whatever it may be, the openess by which this is run should generate a lot of new ideas.

gladius · Post by **gladius** » Mon Mar 25, 2013 2:53 pm

Don wrote:We started doing distributed testing with many volunteers donating machine time a couple of years ago, so this is not new idea.

It's not the idea that's new, it's that the framework is both public and open source.

Don wrote:As it turns out, we concluded that it was not very useful for measuring less then 20 ELO improvements very reliably.

I find this rather hard to believe! 20 ELO is a huge amount of error.

Of course, it's possible the tests we've been running so far have been completely useless, but the strength certainly appears to be increasing based off of the results. This is only self-testing though, so who knows against the wider world of engines.

Don · Post by **Don** » Mon Mar 25, 2013 2:54 pm

Michel wrote:
That makes this idea AT LEAST 20 years old and I doubt I was the first.
This might be but what I like about this system is that the test queue is not a black box but visible to the public (I am not sure if the public can contribute tests as I have not found the time yet to install the client).

Whatever it may be, the openess by which this is run should generate a lot of new ideas.

There is some value in having non-developers contribute ideas and we have received many good ideas from our testers - even the ones not familiar with software engineering practices.

Don · Post by **Don** » Mon Mar 25, 2013 3:12 pm

gladius wrote:
Don wrote:We started doing distributed testing with many volunteers donating machine time a couple of years ago, so this is not new idea.
It's not the idea that's new, it's that the framework is both public and open source.

Don wrote:As it turns out, we concluded that it was not very useful for measuring less then 20 ELO improvements very reliably.
I find this rather hard to believe! 20 ELO is a huge amount of error.

Of course, it's possible the tests we've been running so far have been completely useless, but the strength certainly appears to be increasing based off of the results. This is only self-testing though, so who knows against the wider world of engines.

I think I was being generous when I said 20 ELO, it was probably closer to 30 ELO when considering the data as a whole. I actually had in mind a system to minimize that more which I never implemented but which I'll explain shortly.

But first some background - and there may be some ideas here for you guys to improve this system based on what I did.

We allowed only Linux and Windows machines that had the required SSE support and we also utilized a few foreign programs - but we could only use ones that were openly available since my system had to fetch any binaries not available.

I also tracked the OS and CPU so I could compile results by platform/OS combination. But this waters down the samples considerably. The idea was that you could rate these separately if you had enough volunteers. In fact I would strongly suggest that this project does that - probably get you within 10 or 15 ELO by doing so.

My system also ran a "calibration" test before starting a test on any particular client in order to make the appropriate adjustment. It was very much like a rating agency might do it.

The idea which I never implemented was to choose a canonical target platform - probably the most ubiquitous platform that your clients are using such as the i7 quad. If you TRACK all the details, the OS, CPU and perhaps anything else of relevance you can over time maintain statistic data about platform differences and adjust accordingly. For example let's say that for whatever reason Stockfish performs better on AMD relative to the program you are testing against. You can weight the results according to the performance on those platforms and the number to test games played on those platforms. Give them little weight until you have enough data to do this accurately statistically. In addition you can simply post separate results for each hardware/OS combination. You can probably eliminate SOME of the error that way.

From what I am reading on this thread you are simply limiting the platform and even the OS and programs that can run by running only self tests so this is probably less of a factor.

gladius · Post by **gladius** » Mon Mar 25, 2013 4:20 pm

Don wrote:I think I was being generous when I said 20 ELO, it was probably closer to 30 ELO when considering the data as a whole. I actually had in mind a system to minimize that more which I never implemented but which I'll explain shortly.

Classifying results by cpu/OS is an excellent idea. We record the OS, but not the type of CPU currently.

The goal is definitely to control the environment that the tests run in as much as possible. cutechess manages the games, and it's all self-testing. We've seen small ELO improvements ~3-4 ELO hold up when scaled up in TC (60s+0.05s), and then again, when included in current master, vs 2.3.1 (again at a decent TC for superfast testing, 60s+0.05s).

Don · Post by **Don** » Mon Mar 25, 2013 4:27 pm

gladius wrote:
Don wrote:I think I was being generous when I said 20 ELO, it was probably closer to 30 ELO when considering the data as a whole. I actually had in mind a system to minimize that more which I never implemented but which I'll explain shortly.
Classifying results by cpu/OS is an excellent idea. We record the OS, but not the type of CPU currently.

The goal is definitely to control the environment that the tests run in as much as possible. cutechess manages the games, and it's all self-testing. We've seen small ELO improvements ~3-4 ELO hold up when scaled up in TC (60s+0.05s), and then again, when included in current master, vs 2.3.1 (again at a decent TC for superfast testing, 60s+0.05s).

It's easy to get the CPU details on most platforms. I don't think you can get the CPU speed but you can do your own timing. But you cannot easily determine how well the processor family scales up with multiple tests running and even that is an issue. The older CPU's did not do as well and CPU's with more cores may not do as well as CPU's with less cores.

You COULD benchmark that as part of a global system configuration" when the system is initialized for the first time. To do this well, as you say, you want to control the environment as much as possible but you also need to know as much about the environment as possible in order to provide the most consistent conditions for the testing that you can manage.

Beta for Stockfish distributed testing

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing

Re: Beta for Stockfish distributed testing

Re: Beta for Stockfish distributed testing

Re: Beta for Stockfish distributed testing

Re: Beta for Stockfish distributed testing

Re: Beta for Stockfish distributed testing

Re: Beta for Stockfish distributed testing