Hikaru vs. bots

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Harvey Williamson
Posts: 2010
Joined: Sun May 25, 2008 11:12 pm
Location: Whitchurch. Shropshire, UK.
Full name: Harvey Williamson

Re: Hikaru vs. bots

Post by Harvey Williamson »

lkaufman wrote: Sun Dec 06, 2020 5:00 pm If anyone has data that could make the estimated rating of Fritz 2 on 1994 more accurate, please provide it.
This is from The SSDF list:

Code: Select all

                        Rating   +     -    Games   Won  Oppo
289 Fritz 2 486/33 MHz   2032   29   -30   547      44%  2072


Their full list is here https://ssdf.bosjo.net/long.txt
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Hikaru vs. bots

Post by lkaufman »

Harvey Williamson wrote: Sun Dec 06, 2020 5:10 pm This is from The SSDF list:

Code: Select all

                        Rating   +     -    Games   Won  Oppo
289 Fritz 2 486/33 MHz   2032   29   -30   547      44%  2072


Their full list is here https://ssdf.bosjo.net/long.txt
Thanks. The quoted info in the previous post mentions Intel Pentium at speeds of 200, 75, 60, and 50 MHz, quite a range, I don't know which was used for this event, does anyone? Clearly the hardware was much better than the 486/33 MHz. Perhaps the Fritz 4 Pentium 90 SSDF rating of 2236 is the closest we can come to what played in this event. If so then adding about 550 would be the best estimate of FIDE blitz ratings. But since these are 40/2 hour ratings they aren't so relevant for blitz. This would put SF11 on 8 cores at about 4100 FIDE blitz if you ignore this issue.
Komodo rules!
mwyoung
Posts: 2727
Joined: Wed May 12, 2010 10:00 pm

Re: Hikaru vs. bots

Post by mwyoung »

lkaufman wrote: Sun Dec 06, 2020 5:42 pm
Harvey Williamson wrote: Sun Dec 06, 2020 5:10 pm This is from The SSDF list:

Code: Select all

                        Rating   +     -    Games   Won  Oppo
289 Fritz 2 486/33 MHz   2032   29   -30   547      44%  2072


Their full list is here https://ssdf.bosjo.net/long.txt
Thanks. The quoted info in the previous post mentions Intel Pentium at speeds of 200, 75, 60, and 50 MHz, quite a range, I don't know which was used for this event, does anyone? Clearly the hardware was much better than the 486/33 MHz. Perhaps the Fritz 4 Pentium 90 SSDF rating of 2236 is the closest we can come to what played in this event. If so then adding about 550 would be the best estimate of FIDE blitz ratings. But since these are 40/2 hour ratings they aren't so relevant for blitz. This would put SF11 on 8 cores at about 4100 FIDE blitz if you ignore this issue.
This event was a Intel event to showcase the new Pentium chips. If I remember correctly it was a Pentium 90.
But it could have also been the Pentium 100.

Code: Select all

March 7, 1994
"P54C" (0.6 μm)
Model number	Frequency	Release date
Pentium 75	75 MHz	October 10, 1994
Pentium 90	90 MHz	March 7, 1994
Pentium 100	100 MHz	March 7, 1994
If you would like to play Fritz 2. You can play online here.
https://www.retrogames.cz/play_1425-DOS.php
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Hikaru vs. bots

Post by lkaufman »

mwyoung wrote: Sun Dec 06, 2020 6:14 pm
lkaufman wrote: Sun Dec 06, 2020 5:42 pm
Harvey Williamson wrote: Sun Dec 06, 2020 5:10 pm This is from The SSDF list:

Code: Select all

                        Rating   +     -    Games   Won  Oppo
289 Fritz 2 486/33 MHz   2032   29   -30   547      44%  2072


Their full list is here https://ssdf.bosjo.net/long.txt
Thanks. The quoted info in the previous post mentions Intel Pentium at speeds of 200, 75, 60, and 50 MHz, quite a range, I don't know which was used for this event, does anyone? Clearly the hardware was much better than the 486/33 MHz. Perhaps the Fritz 4 Pentium 90 SSDF rating of 2236 is the closest we can come to what played in this event. If so then adding about 550 would be the best estimate of FIDE blitz ratings. But since these are 40/2 hour ratings they aren't so relevant for blitz. This would put SF11 on 8 cores at about 4100 FIDE blitz if you ignore this issue.
This event was a Intel event to showcase the new Pentium chips. If I remember correctly it was a Pentium 90.
But it could have also been the Pentium 100.

Code: Select all

March 7, 1994
"P54C" (0.6 μm)
Model number	Frequency	Release date
Pentium 75	75 MHz	October 10, 1994
Pentium 90	90 MHz	March 7, 1994
Pentium 100	100 MHz	March 7, 1994
If you would like to play Fritz 2. You can play online here.

I tried a quick game with it, but I won rather easily. It was only doing six ply searches, so probably effectively running even slower than on the 1994 Pentium 90, maybe much slower. Based on your Pentium 90 comment I'll revise my estimate of the SSDF rating for it to be in the mid to upper 2100s, so adding about 600 should be the FIDE blitz equivalent.

https://www.retrogames.cz/play_1425-DOS.php
Komodo rules!
Fritz 0
Posts: 145
Joined: Fri Mar 11, 2022 12:10 pm
Full name: Branislav Đošić

Re: Hikaru vs. bots

Post by Fritz 0 »

I have also beaten it while spending only 10 minutes for the whole game, so it's officially very weak :) This can not be the version that was No. 2 blitz player in the world back in 1994. It's 6 ply search is obviously much sloppier than the same depth of Fritz 5.32 and later versions.
jkominek
Posts: 55
Joined: Tue Sep 04, 2018 5:33 am
Full name: John Kominek

Re: Hikaru vs. bots

Post by jkominek »

Calibrating chess engines to human Elo scales is an interesting topic. Unfortunately there seems to be a lack of reliable calibration data. The SSDF rating list calibration is probably the best historical effort for the engines and embedded computers of the era (90s), for standard time controls. Ratings of current software is a long distance extrapolation.

To Larry's original post I'd like to pose a hypothetical question. Suppose you had the opportunity to run a gauntlet of games against humans ranging from grade E to the world's elite. How would you design an Elo calibration experiment?

To motivate humans to play at their full strength I presume one would have to provide incentive in the form of money. I realize Rex Sinquefield appears to have little to no interest in computer chess, but for the sake of discussion let's say a Sinquefield-level amount of money is made available for the project. (Perhaps an ongoing Beat the Bots Tournament would have value as marketing hype.)

I imagine the rules would be designed so that players are encouraged to give each game their best effort, as opposed to being incentivized to bail out of lost games early. And that for the players it would be more lucrative to score say 25% against a stronger opponent than to mop up the floor against weaker engines. How exactly that would be done I don't know. When it comes to making money humans are exceptionally creative at exploiting a system's flaws.

Beyond the aspect of incentive, how would you design the technical elements of the experiment, making use of the many Komodo/Dragon skill levels you've tuned as well as other available free software?
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Hikaru vs. bots

Post by lkaufman »

jkominek wrote: Sun Apr 03, 2022 7:25 pm Calibrating chess engines to human Elo scales is an interesting topic. Unfortunately there seems to be a lack of reliable calibration data. The SSDF rating list calibration is probably the best historical effort for the engines and embedded computers of the era (90s), for standard time controls. Ratings of current software is a long distance extrapolation.

To Larry's original post I'd like to pose a hypothetical question. Suppose you had the opportunity to run a gauntlet of games against humans ranging from grade E to the world's elite. How would you design an Elo calibration experiment?

To motivate humans to play at their full strength I presume one would have to provide incentive in the form of money. I realize Rex Sinquefield appears to have little to no interest in computer chess, but for the sake of discussion let's say a Sinquefield-level amount of money is made available for the project. (Perhaps an ongoing Beat the Bots Tournament would have value as marketing hype.)

I imagine the rules would be designed so that players are encouraged to give each game their best effort, as opposed to being incentivized to bail out of lost games early. And that for the players it would be more lucrative to score say 25% against a stronger opponent than to mop up the floor against weaker engines. How exactly that would be done I don't know. When it comes to making money humans are exceptionally creative at exploiting a system's flaws.

Beyond the aspect of incentive, how would you design the technical elements of the experiment, making use of the many Komodo/Dragon skill levels you've tuned as well as other available free software?
In principle the best incentive system for a rating test would be to reward the players based on how many elo points they gained from some assigned rating for a specified number of games (for example they get $100 base plus $2 per elo point gained, which could be negative). The main problem isn't sponsorship, as even GMs will play for relatively small prizes, but rather online cheating when money is involved (an OTB test is impractically costly nowadays). We've actually had enough test games in the last couple months for me to say that the Elo ratings in Dragon 3 will be pretty accurate assuming a 15' + 10" Rapid time control over the range of reliable FIDE ratings, maybe roughly 1600 to 2850. I've noticed that now that chess.com rates ten minute games as Rapid and holds weekly top-level Rapid tournaments (at 10 min and 10 min + 2 second time limits), the chess.com Rapid ratings look remarkably like real FIDE Rapid ratings (which in turn are similar to regular FIDE ratings at the top), so perhaps chess.com Rapid ratings should be the reference point now for engine ratings, this being the longest time control at which it is easy and practical to play against players with reliable ratings at that time control. It has the advantage of having defined ratings far below the 1000 level. Anyway it wouldn't change much now as it is so closely aligned with the FIDE scale. Ideally we would test Dragon levels at 100 elo intervals. The most important point is to settle on one time limit for all games; the faster the TC, the worse humans do versus (most) engines. 15' + 10" is the standard Rapid time control for the top level GM events of the past couple years and for the over the board World Rapid championship, but now these weekly chess.com events are at 10 min (+2 seconds for the second day), so I'm not so sure what time limit is really most appropriate for testing engines vs. humans now. I wonder what time control is most preferred by amateurs who play their engines (weakened or old/weak engines) in timed games? Seems like ten minute chess is very popular now, but of course some increment is needed vs an engine since only the human should ever flag.
Komodo rules!
Fritz 0
Posts: 145
Joined: Fri Mar 11, 2022 12:10 pm
Full name: Branislav Đošić

Re: Hikaru vs. bots

Post by Fritz 0 »

I think that people mostly play blitz (although I personally don't like it), so you should base Dragon Elo ratings on that. Higher number of games will give more reliable data. If you establish accurate ratings for, say, 5'+3'', we can always extrapolate longer time controls from that, using '90 points per doubling of time' formula for humans, which we discussed in another topic. It won't be perfect of course, but no estimation will be either in the first place.
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Hikaru vs. bots

Post by lkaufman »

Fritz 0 wrote: Mon Apr 04, 2022 1:19 am I think that people mostly play blitz (although I personally don't like it), so you should base Dragon Elo ratings on that. Higher number of games will give more reliable data. If you establish accurate ratings for, say, 5'+3'', we can always extrapolate longer time controls from that, using '90 points per doubling of time' formula for humans, which we discussed in another topic. It won't be perfect of course, but no estimation will be either in the first place.
I've heard that game/10' may be the most popular time limit online now, especially if you weight by time spent rather than just by number of games played. That's human vs human. If anything, I would think that humans would want to play longer tc vs engines than vs other humans, as they will do better with more time vs engines. But the main point for me is that online blitz ratings are not similar to FIDE blitz (or rapid or standard) ratings, whereas chess.com Rapid ratings are VERY similar to FIDE ratings. At the high end, we see blitz ratings like 3100 or 3200 on chess.com, whereas in Rapid we have Hikaru on top now (excluding players who don't play in the weekly Rapid championships) at 2841 vs FIDE rapid 2851 (2760 standard); then Caruana at 2791 vs 2784 FIDE Rapid (2781 standard), then Aronian at 2785 vs 2705 FIDE Rapid (but 2779 FIDE standard), then MVL at 2775 vs 2743 FIDE rapid (2750 standard), then Giri at 2773 vs 2730 FIDE Rapid (2761 standard), then Nepo 2749 vs 2821 FIDE Rapid (2773 standard), Dubov 2748 vs FIDE rapid 2712 (2702 standard), Andreikin 2738 vs 2675 FIDE Rapid (2729 standard), Abdusattorov 2726 vs 2661 Rapid but World Rapid Champ! (2670 standard), So 2724 vs 2776 (2773 standard), and Xiong 2719 vs 2729, 2685 standard. Since the FIDE Rapid ratings are somewhat lower than standard in general, and tend to be based on many less games, I think it is fair to compare the chess.com Rapid ratings with the higher of a given player's FIDE standard or Rapid rating, since those few with higher Rapid ratings are almost surely stronger in Rapid than in classical chess (relative to other humans). Then the rating differences for those players (chess.com minus FIDE) are : -10, 7, 6, 25, 12, -72, 36, 9, 56, -52, -10. Average difference just 1.7 elo !!, and average absolute value of difference 27 elo. That is an amazingly good fit. At more moderate rating levels I believe the fit is also good (my own numbers are close). So it's pretty clear to me that we should use chess.com Rapid ratings as the basis for engine ratings if we want to claim some relevance to FIDE scale, and the time limit should probably be close to the average actually used online, perhaps 10' plus 5" might be a good choice.
Komodo rules!
Fritz 0
Posts: 145
Joined: Fri Mar 11, 2022 12:10 pm
Full name: Branislav Đošić

Re: Hikaru vs. bots

Post by Fritz 0 »

Well, game in 10 minutes corresponds to 5'+5'', so it's also blitz in my book (although on lichess it's considered rapid, I think). I agree it is very popular time control now.