Progress on Rustic

lithander · Post by **lithander** » Sun Mar 28, 2021 11:14 pm

emadsen wrote: ↑Sun Mar 28, 2021 7:35 pm Sven, can you elaborate on this? My initial reaction is I don’t agree with you. A rating is meaningful only in relation to the ratings of other engines. It has no intrinsic significance. If that relationship is not well established due to lack of play against a variety of opponents, or lack of play of opponents amongst themselves, then the rating cannot be trusted. But I have not conducted a formal study of the matter. If you have, would you please share your results?

I think you can skip the part where these engines play each other if you know their relative ratings (from CCRL) *and* if you use these ratings as anchors when you process your PGNs. In ordo for example you can pass a CSV file via the -m parameter. (it's explained in the manual)

emadsen wrote: ↑Sun Mar 28, 2021 10:39 pm This reinforces the argument that one never release a software update without incrementing the version number. In my opinion, the onus is on the software publisher- in this case, you as the chess engine author- to release a new version with the bug fix and with an incremented version number in the download file and in the id response to the uci command. In my opinion, it's too much to ask CCRL to track this (two binaries with same version number but different code). Just my opinion- I'm not speaking for them.

I know of build systems that increase the number with each build automatically. Or that will include the revision of the version control system in the build. Definitely a best practice.

But with my private project and without a build system I would have to remember to increase the version manually with every source push. So far I only changed the version number when making a build and would certainly inc the version whenever I release binaries. But that wouldn't stop someone from checking out a specific revision from git *after* important features have been added but *before* I tagged the next version and it would play under the previous version but much stronger.

Do you increase the version with each push manually or do you have an automatic system?

mvanthoor wrote: ↑Sun Mar 28, 2021 10:55 pm The worst that could happen is that Alpha 2 is in the list at a somewhat lower rating than expected; lesson learned.

Well, that's still bad because people will look for a engine rated ~1800, play Rustic Alpha 2 (the fixed build) which then plays 60 ELO stronger than advertised and draw the wrong conclusions. (e.g. their engines will appear weaker to them then they really are) So I think it's still worth to be investigated.

mvanthoor · Post by **mvanthoor** » Sun Mar 28, 2021 11:18 pm

By the way: I'm testing again, running gauntlets where the engines that have a TT, have it set to 256 MB (or their higest default); it the previous tests, all engines played at the default TT. This should make the other engines stronger. This means that, if this new list is also calibrated at Alpha 1 = 1677 just like the previous one, but the other engines in the second gauntlet being stronger, Rustic Alpha 2 should therefore turn out weaker. It could therefore be that my estimation was off because of this misconfiguration in the testing pool.

mvanthoor · Post by **mvanthoor** » Sun Mar 28, 2021 11:26 pm

lithander wrote: ↑Sun Mar 28, 2021 11:14 pm Well, that's still bad because people will look for a engine rated ~1800, play Rustic Alpha 2 (the fixed build) which then plays 60 ELO stronger than advertised and draw the wrong conclusions. (e.g. their engines will appear weaker to them then they really are) So I think it's still worth to be investigated.

I've sent Gabor a PM. I'll see what the CCRL-testers say. As they put all games into a database, it should be possible to delete them (and other references to Alpha 2 in the database), and run a test again with the fixed build. It would make the first test wasted of course. That's a pity, but can't be helped. I promise never to release anything with bugs again, I guarantee it

*deletes engine from hard disk*

Eh, wait; maybe that's not the right way to do things.

*clone repo from Github*

Next time I'll just do a new release with a higher version number.

As said, it's also possible to run another test with the fixed binary, at least as big as the last one (against the same engines as before), and that'll probably fix (most) of the rating automatically.

They did a second run with Alpha 1 in January, and they included TSCP in that run, which dropped Alpha 1's rating by about 20 points (as expected, as it plays poorly against TSCP). If they just replace the binary for all future tests (to be certain), the rating will probably fix itself (if it is wrong, even... I'll see what my new test says, tomorrow. I have another 1000 games running).

lithander · Post by **lithander** » Mon Mar 29, 2021 12:30 am

mvanthoor wrote: ↑Sun Mar 28, 2021 11:18 pm This means that, if this new list is also calibrated at Alpha 1 = 1677 just like the previous one, but the other engines in the second gauntlet being stronger, Rustic Alpha 2 should therefore turn out weaker.

You used Alpha 1's score as the single anchor? That might already be the problem then. If you still got the PGNs you could use multiple anchors at CCRLs ratings for each of the engines except Alpha 2. (as I mentioned above, that's possible and should assign results to Alpha 2 that are more comparable to the CCRL ratings)

Sven · Post by **Sven** » Mon Mar 29, 2021 8:50 am

emadsen wrote: ↑Sun Mar 28, 2021 7:35 pm
Sven wrote: ↑Sun Mar 28, 2021 6:46 pm
mvanthoor wrote: ↑Sat Mar 27, 2021 11:28 pm but it only achieved 1781 in the CCRL-list. There can be a few possibilities:

[...]
- I only test Rustic against other engines: the other engines don't play one another. (I don't have the computing power to do this; if I upgrade to a 16+ core computer at some time, I may have the option to do a few round-robin tournaments.) It could be that my results are skewed because of this.
You can safely exclude this as a reason. Results from games between the other engines don't influence the rating estimation you get for your engine. A long while ago I had the same thought as you but I learned that the opposite was true.
Sven, can you elaborate on this? My initial reaction is I don’t agree with you. A rating is meaningful only in relation to the ratings of other engines. It has no intrinsic significance. If that relationship is not well established due to lack of play against a variety of opponents, or lack of play of opponents amongst themselves, then the rating cannot be trusted. But I have not conducted a formal study of the matter. If you have, would you please share your results?

When I bought a new computer three years ago, the first thing I did was run tournaments among four classes of engines (separated by estimated strength), estimate Elo, then run more tournaments among the lower half of a class against the higher half of the lower class to ensure good cross-pollination of games and eliminate isolated groups of engines. See my Tournaments page for details.

I viewed this as a critical prerequisite before ever attempting to measure the strength of MadChess. I combine these games with games from a gauntlet tournament pitting a particular version of MadChess against ten opponents of strength in the range +/- 100 Elo.

Your comment above suggests this is unnecessary. I’m struggling to understand what I’m missing here. Because my technique has produced private Elo estimates of MadChess strength that highly correlate with CCRL Elo measurements when I release a public version of my engine. (I use Ordo and anchor four engines to their CCRL Elo rating.) Interested to hear your thoughts.

I can't present any formal study on this, just the observation that I made in my own testing a couple of years ago (and I assume that nothing has changed in the way how our rating progams like BayesElo or Ordo calculate Elo ratings for chess engines). Playing additional games between the opponents of a "candidate engine" does neither change the resulting rating of the candidate significantly, nor does it lower its error bars.

I think that is caused by the way how ratings are calculated. As you say, Elo ratings are relative to the pool of entities you are observing. For each entity (engine) you have a record of games against some opponents. Rating calculation for chess engines takes all games at once "from scratch", in contrast to human ratings which change over time and the most recent tournament results have a bigger influence than older ones. The "from scratch" calculation is comparable with the rating calculation for a set of initially unrated human chess players. It works iteratively: you start by assigning an arbitrary average rating for your pool, then each iteration calculates new estimated ratings for each single engine by determining the rating for which the expected game results of that engine (based on current estimates) would be equal to its actual game results. Calculation stops as soon as estimates become stable enough.

Now you can look at the rating of your candidate. Our question was, what happens if you add more games between other engines where your candidate is not involved? Say your pool contains games between N+1 engines (your candidate and N other engines) and your candidate was involved in games against K of these. If you only add games between those K opponents then obviously your candidate's rating won't be affected since the additional games do not change the relative strength of the K opponents within the whole N+1 pool and therefore the candidate's rating estimate (which is a rating relative to the N+1 pool) remains the same. However, if you add games between the K opponents and the other N-K engines that your candidate did not play then you may be unlucky to observe that your K opponents perform stronger or weaker relative to the whole N+1 pool compared to the estimation without the additional games, so your candidate will be affected.

In the example above, K=N means that your candidate was involved in games against all other engines from the pool. That may be a frequent situation when testing your own engine: your pool only contains engines against which your candidate has played. It was a surprising observation (but with an "obvious" explanation as I elaborated above) for me as well that adding games between the opponents did not affect rating or error bars of my candidate.

emadsen · Post by **emadsen** » Mon Mar 29, 2021 5:56 pm

Sven wrote: ↑Mon Mar 29, 2021 8:50 am If you only add games between those K opponents then obviously your candidate's rating won't be affected since the additional games do not change the relative strength of the K opponents within the whole N+1 pool and therefore the candidate's rating estimate (which is a rating relative to the N+1 pool) remains the same.

However, if you add games between the K opponents and the other N-K engines that your candidate did not play then you may be unlucky to observe that your K opponents perform stronger or weaker relative to the whole N+1 pool compared to the estimation without the additional games, so your candidate will be affected.

Thanks Sven. That distinction makes sense to me. It's less absolute than your earlier statement. I understand you now and agree with you. I have observed the same effect (average rating of K opponents may shift as I run more games in the entire N pool, affecting perf rating of my engine versus K opponents).

mvanthoor · Post by **mvanthoor** » Mon Mar 29, 2021 6:06 pm

emadsen wrote: ↑Mon Mar 29, 2021 5:56 pm Thanks Sven. That distinction makes sense to me. It's less absolute than your earlier statement. I understand you now and agree with you. I have observed the same effect (average rating of K opponents may shift as I run more games in the entire N pool, affecting perf rating of my engine versus K opponents).

Yes, and it also matters which engines are in the pool. For example: Clueless 1.4 consistently performs against Rustic Alpha 2 at +40 Elo. Because Clueless is in the CCRL-list at 1900 Elo, this would put Rustic Alpha 2 at 1860 Elo. However, Pigeon 1.51 performs consistently at +10 Elo against Rustic Alpha 2. As Pigeon 1.51 is 1798 in the CCRL list, this would put Alpha 2 at 1788.

So 1781 can actually be the correct rating, at the lower end of the scale, if the pool contains lots of engines that perform very well against Rustic Alpha 2. I think I'm going to try and see if I can get as many engines together as used by the CCRL testers, and see what the result is (but many of the older engines are hard or impossible to find, I've noticed).

The reason why the tested engine's rating shifts when the pool changes, is because Elo determination is solely based on percentages. For example, a 60:40 (60%) result is a difference of +70 Elo, and it doesn't matter what the strength of both engines is. If you add engines to the pool the average rating changes. For example, it goes up... but if your engine plays well against that newcomer and the percentage of points scored by your engine, versus the total available points, stays the same or even rises, the rating of your engine will improve.

That's the reason why Rustic Alpha 1 dropped 20 rating points when TSCP was in the second test run: Alpha 1 performs poorly against TSCP. If CCRL would put Clueless 1.4 in the second test run for Alpha 2, then Alpha 2's rating will rise (assuming it plays as well against Clueless in 2m+1s TC as it does in 1m+0.6s).

emadsen · Post by **emadsen** » Mon Mar 29, 2021 6:11 pm

lithander wrote: ↑Sun Mar 28, 2021 11:10 pm I think you can skip the part where these engines play each other if you know their relative ratings (from CCRL) *and* if you use these ratings as anchors when you process your PGNs. In ordo for example you can pass a CSV file via the -m parameter. (it's explained in the manual)

Yes, I anchor four engines to their CCRL rating. My question to you is, how do you know the ratings of engines running on your PC with your particular CPU and memory configuration? How do you know that without running games on your PC?

lithander wrote: ↑Sun Mar 28, 2021 11:10 pmSo far I only changed the version number when making a build and would certainly inc the version whenever I release binaries. But that wouldn't stop someone from checking out a specific revision from git *after* important features have been added but *before* I tagged the next version and it would play under the previous version but much stronger.

Do you increase the version with each push manually or do you have an automatic system?

Because it's a hobby project with infrequent public releases, I manually increment the version number. Regarding git, you are correct, anyone can clone the repo and build any revision of the code. I publish official releases on my website's Downloads page. I accompany it with a blog post and an announcement here in TalkChess. Wow, it's been 3.5 years since my last release. Been working on other software projects, both professional and personal. I hope to complete MadChess 3.0 soon.

mvanthoor · Post by **mvanthoor** » Fri Apr 16, 2021 12:14 am

Hi

In the last few weeks I haven't been too active with regard to Rustic's progression because I didn't have much time.

I've a few days of vacation coming in May (and longer vacations in the summer), and some more free time in the next few weekends. I may be able to get back to Rustic again. I'll either write some parts in my book/tutorial, or write the Killer/History heuristics. I did manage to write a build-script though, that builds all Rustic versions for the platform it's compiled on:

Windows 32-bit (1 version: 32-bit-generic)
Windows 64-bit, Linux 64-bit, MacOS 64-bit (intel) (5 versions each: generic, old, popcnt, bmi2, native)
Raspberry Pi 32-bit (1 version)

So now I can start a compile and get all the versions for that specific platform. Makes creating a release much faster. At some point I may look into Github actions for doing that, but I also wanted a single build script for people who wish to build Rustic themselves, and then test which version works (or which is the fastest, if they all work).

Rustic will now also pull it's current version from Cargo.toml in SemVer notation (so the next version will be Alpha 3.0.0), and this will automatically be reflected in the engine's UCI id and file name. If I'd ever post a binary again that turns out to have some sort of a bug, a newer version will be 0.0.1 version higher, to prevent the junky version mix-up with Alpha 2.

Mike Sherwin · Post by **Mike Sherwin** » Fri Apr 16, 2021 4:28 am

mvanthoor wrote: ↑Fri Apr 16, 2021 12:14 am Hi

In the last few weeks I haven't been too active with regard to Rustic's progression because I didn't have much time.

I've a few days of vacation coming in May (and longer vacations in the summer), and some more free time in the next few weekends. I may be able to get back to Rustic again. I'll either write some parts in my book/tutorial, or write the Killer/History heuristics. I did manage to write a build-script though, that builds all Rustic versions for the platform it's compiled on:

Windows 32-bit (1 version: 32-bit-generic)
Windows 64-bit, Linux 64-bit, MacOS 64-bit (intel) (5 versions each: generic, old, popcnt, bmi2, native)
Raspberry Pi 32-bit (1 version)

So now I can start a compile and get all the versions for that specific platform. Makes creating a release much faster. At some point I may look into Github actions for doing that, but I also wanted a single build script for people who wish to build Rustic themselves, and then test which version works (or which is the fastest, if they all work).

Rustic will now also pull it's current version from Cargo.toml in SemVer notation (so the next version will be Alpha 3.0.0), and this will automatically be reflected in the engine's UCI id and file name. If I'd ever post a binary again that turns out to have some sort of a bug, a newer version will be 0.0.1 version higher, to prevent the junky version mix-up with Alpha 2.

Take your time. I'd like a chance to catch up.

Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic