Here is the critical question, however. Tuning against another engine is absolutely within ICGA tournament rules. The rules are about copied code. So one has to ask, how similar is the naum CODE to Rybka or strelka or whatever? Right now, the observation that it was tuned over several versions neither confirms or refutes any observation about originality or lack thereof.Adam Hair wrote:If you trace the changes in Naum with the similarity tool, you will see a relatively unique engine (v1.91, v2.0) start displaying increased similarity with Strelka 2.0B (v2.1, v2.2), then high similarity with Rybka 2.x. (v3.1, v4.2). I do believe that Strelka was essential for tuning Naum to Rybka.Laskos wrote:I don't believe in this Naum story. Try to optimize engine's strength via a test suite instead of pure Elo-wise tests. You will get a weaker engine. Trying to optimize to play these neutral positions from Sim similarly to a stronger engine, if your own engine has a very different eval, will only wreck the engine. There are hundreds of parameters to tune, it would be a miracle for a completely different eval to be tunable according to the same parameters, and to get a stronger engine.Uri Blass wrote:
The incentive is to have a stronger engine and based on my memory the programmer of Naum already did it with Rybka based on his words(I think that he did it with Rybka2.3.2a but I am not sure about the exact version of Rybka).
I think that we can at least agree that big similarity is not something that can happen by accident and the engine is derived from the code or from the output of another engine.
Code: Select all
Key: 1) Fruit 2.1 (time: 290 ms scale: 1.0) 2) Naum 1.91 (time: 502 ms scale: 1.0) 3) Naum 2.0 (time: 290 ms scale: 1.0) 4) Naum 2.1 (time: 217 ms scale: 1.0) 5) Naum 2.2 (time: 180 ms scale: 1.0) 6) Naum 3.1 (time: 114 ms scale: 1.0) 7) Naum 4.2 (time: 58 ms scale: 1.0) 8) Rybka 1.0 Beta (time: 171 ms scale: 1.0) 9) Rybka 1.1 (time: 121 ms scale: 1.0) 10) Rybka 1.2f (time: 114 ms scale: 1.0) 11) Rybka 2.1o (time: 116 ms scale: 1.0) 12) Rybka 2.2n2 (time: 76 ms scale: 1.0) 13) Rybka 2.3.2a (time: 60 ms scale: 1.0) 14) Strelka 2.0 B (time: 114 ms scale: 1.0) 15) Thinker 5.4c Inert (time: 102 ms scale: 1.0) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1. ----- 48.51 47.32 53.12 52.33 54.38 54.92 55.75 56.00 55.32 55.00 55.16 56.71 57.60 53.82 2. 48.51 ----- 68.44 51.30 51.75 44.53 46.03 45.63 46.10 46.53 46.04 45.61 47.48 48.35 46.54 3. 47.32 68.44 ----- 52.49 53.69 43.93 45.36 45.31 45.62 45.42 44.93 45.44 46.97 46.78 45.91 4. 53.12 51.30 52.49 ----- 71.11 54.33 55.54 54.81 56.06 55.29 55.06 55.41 54.92 57.71 53.56 5. 52.33 51.75 53.69 71.11 ----- 53.30 54.99 53.74 54.71 54.45 53.65 54.75 53.92 56.55 52.76 6. 54.38 44.53 43.93 54.33 53.30 ----- 67.01 59.48 65.06 67.84 68.66 63.22 60.33 61.51 57.48 7. 54.92 46.03 45.36 55.54 54.99 67.01 ----- 60.37 64.20 64.25 64.13 64.54 62.01 62.99 58.24 8. 55.75 45.63 45.31 54.81 53.74 59.48 60.37 ----- 67.30 65.32 64.72 65.25 62.19 68.52 58.76 9. 56.00 46.10 45.62 56.06 54.71 65.06 64.20 67.30 ----- 73.61 72.51 69.66 65.11 68.57 60.34 10. 55.32 46.53 45.42 55.29 54.45 67.84 64.25 65.32 73.61 ----- 87.14 71.78 66.16 66.58 60.67 11. 55.00 46.04 44.93 55.06 53.65 68.66 64.13 64.72 72.51 87.14 ----- 72.07 64.91 65.96 59.86 12. 55.16 45.61 45.44 55.41 54.75 63.22 64.54 65.25 69.66 71.78 72.07 ----- 66.39 66.31 59.84 13. 56.71 47.48 46.97 54.92 53.92 60.33 62.01 62.19 65.11 66.16 64.91 66.39 ----- 65.19 59.59 14. 57.60 48.35 46.78 57.71 56.55 61.51 62.99 68.52 68.57 66.58 65.96 66.31 65.19 ----- 63.26 15. 53.82 46.54 45.91 53.56 52.76 57.48 58.24 58.76 60.34 60.67 59.86 59.84 59.59 63.26 -----
Uri's Challenge : TwinFish
Moderators: hgm, Rebel, chrisw
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Uri's Challenge : TwinFish
-
- Posts: 10281
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Uri's Challenge : TwinFish
If you tune based on the output of another engine then I expect it to cause the code also to have a bigger similaritybob wrote:Here is the critical question, however. Tuning against another engine is absolutely within ICGA tournament rules. The rules are about copied code. So one has to ask, how similar is the naum CODE to Rybka or strelka or whatever? Right now, the observation that it was tuned over several versions neither confirms or refutes any observation about originality or lack thereof.Adam Hair wrote:If you trace the changes in Naum with the similarity tool, you will see a relatively unique engine (v1.91, v2.0) start displaying increased similarity with Strelka 2.0B (v2.1, v2.2), then high similarity with Rybka 2.x. (v3.1, v4.2). I do believe that Strelka was essential for tuning Naum to Rybka.Laskos wrote:I don't believe in this Naum story. Try to optimize engine's strength via a test suite instead of pure Elo-wise tests. You will get a weaker engine. Trying to optimize to play these neutral positions from Sim similarly to a stronger engine, if your own engine has a very different eval, will only wreck the engine. There are hundreds of parameters to tune, it would be a miracle for a completely different eval to be tunable according to the same parameters, and to get a stronger engine.Uri Blass wrote:
The incentive is to have a stronger engine and based on my memory the programmer of Naum already did it with Rybka based on his words(I think that he did it with Rybka2.3.2a but I am not sure about the exact version of Rybka).
I think that we can at least agree that big similarity is not something that can happen by accident and the engine is derived from the code or from the output of another engine.
Code: Select all
Key: 1) Fruit 2.1 (time: 290 ms scale: 1.0) 2) Naum 1.91 (time: 502 ms scale: 1.0) 3) Naum 2.0 (time: 290 ms scale: 1.0) 4) Naum 2.1 (time: 217 ms scale: 1.0) 5) Naum 2.2 (time: 180 ms scale: 1.0) 6) Naum 3.1 (time: 114 ms scale: 1.0) 7) Naum 4.2 (time: 58 ms scale: 1.0) 8) Rybka 1.0 Beta (time: 171 ms scale: 1.0) 9) Rybka 1.1 (time: 121 ms scale: 1.0) 10) Rybka 1.2f (time: 114 ms scale: 1.0) 11) Rybka 2.1o (time: 116 ms scale: 1.0) 12) Rybka 2.2n2 (time: 76 ms scale: 1.0) 13) Rybka 2.3.2a (time: 60 ms scale: 1.0) 14) Strelka 2.0 B (time: 114 ms scale: 1.0) 15) Thinker 5.4c Inert (time: 102 ms scale: 1.0) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1. ----- 48.51 47.32 53.12 52.33 54.38 54.92 55.75 56.00 55.32 55.00 55.16 56.71 57.60 53.82 2. 48.51 ----- 68.44 51.30 51.75 44.53 46.03 45.63 46.10 46.53 46.04 45.61 47.48 48.35 46.54 3. 47.32 68.44 ----- 52.49 53.69 43.93 45.36 45.31 45.62 45.42 44.93 45.44 46.97 46.78 45.91 4. 53.12 51.30 52.49 ----- 71.11 54.33 55.54 54.81 56.06 55.29 55.06 55.41 54.92 57.71 53.56 5. 52.33 51.75 53.69 71.11 ----- 53.30 54.99 53.74 54.71 54.45 53.65 54.75 53.92 56.55 52.76 6. 54.38 44.53 43.93 54.33 53.30 ----- 67.01 59.48 65.06 67.84 68.66 63.22 60.33 61.51 57.48 7. 54.92 46.03 45.36 55.54 54.99 67.01 ----- 60.37 64.20 64.25 64.13 64.54 62.01 62.99 58.24 8. 55.75 45.63 45.31 54.81 53.74 59.48 60.37 ----- 67.30 65.32 64.72 65.25 62.19 68.52 58.76 9. 56.00 46.10 45.62 56.06 54.71 65.06 64.20 67.30 ----- 73.61 72.51 69.66 65.11 68.57 60.34 10. 55.32 46.53 45.42 55.29 54.45 67.84 64.25 65.32 73.61 ----- 87.14 71.78 66.16 66.58 60.67 11. 55.00 46.04 44.93 55.06 53.65 68.66 64.13 64.72 72.51 87.14 ----- 72.07 64.91 65.96 59.86 12. 55.16 45.61 45.44 55.41 54.75 63.22 64.54 65.25 69.66 71.78 72.07 ----- 66.39 66.31 59.84 13. 56.71 47.48 46.97 54.92 53.92 60.33 62.01 62.19 65.11 66.16 64.91 66.39 ----- 65.19 59.59 14. 57.60 48.35 46.78 57.71 56.55 61.51 62.99 68.52 68.57 66.58 65.96 66.31 65.19 ----- 63.26 15. 53.82 46.54 45.91 53.56 52.76 57.48 58.24 58.76 60.34 60.67 59.86 59.84 59.59 63.26 -----
because part of the code is the numbers in different tables that many engines share like piece square tables and I expect the numbers to be closer after you tune based on another engine.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Uri's Challenge : TwinFish
The PST values might be the same. Or everything could be multiplied by some odd number so that they are the same, but don't look the same. But PSTs are a tiny part of a program. Search. Evaluation. Move generation. Make/Unmake. Hashing. you-name-it. The programs would not have to look anything alike, IMHO. Only problem is that I am not willing to try to do something like that and waste all the time required. For example, the Rybka numbers don't look anything like Fruit if you just look at the raw numbers. It requires the work Zach did to see what actually happened, since Rybka used a really strange value for the value of a pawn compared to Fruit, making all numbers look different at first glance.Uri Blass wrote:If you tune based on the output of another engine then I expect it to cause the code also to have a bigger similaritybob wrote:Here is the critical question, however. Tuning against another engine is absolutely within ICGA tournament rules. The rules are about copied code. So one has to ask, how similar is the naum CODE to Rybka or strelka or whatever? Right now, the observation that it was tuned over several versions neither confirms or refutes any observation about originality or lack thereof.Adam Hair wrote:If you trace the changes in Naum with the similarity tool, you will see a relatively unique engine (v1.91, v2.0) start displaying increased similarity with Strelka 2.0B (v2.1, v2.2), then high similarity with Rybka 2.x. (v3.1, v4.2). I do believe that Strelka was essential for tuning Naum to Rybka.Laskos wrote:I don't believe in this Naum story. Try to optimize engine's strength via a test suite instead of pure Elo-wise tests. You will get a weaker engine. Trying to optimize to play these neutral positions from Sim similarly to a stronger engine, if your own engine has a very different eval, will only wreck the engine. There are hundreds of parameters to tune, it would be a miracle for a completely different eval to be tunable according to the same parameters, and to get a stronger engine.Uri Blass wrote:
The incentive is to have a stronger engine and based on my memory the programmer of Naum already did it with Rybka based on his words(I think that he did it with Rybka2.3.2a but I am not sure about the exact version of Rybka).
I think that we can at least agree that big similarity is not something that can happen by accident and the engine is derived from the code or from the output of another engine.
Code: Select all
Key: 1) Fruit 2.1 (time: 290 ms scale: 1.0) 2) Naum 1.91 (time: 502 ms scale: 1.0) 3) Naum 2.0 (time: 290 ms scale: 1.0) 4) Naum 2.1 (time: 217 ms scale: 1.0) 5) Naum 2.2 (time: 180 ms scale: 1.0) 6) Naum 3.1 (time: 114 ms scale: 1.0) 7) Naum 4.2 (time: 58 ms scale: 1.0) 8) Rybka 1.0 Beta (time: 171 ms scale: 1.0) 9) Rybka 1.1 (time: 121 ms scale: 1.0) 10) Rybka 1.2f (time: 114 ms scale: 1.0) 11) Rybka 2.1o (time: 116 ms scale: 1.0) 12) Rybka 2.2n2 (time: 76 ms scale: 1.0) 13) Rybka 2.3.2a (time: 60 ms scale: 1.0) 14) Strelka 2.0 B (time: 114 ms scale: 1.0) 15) Thinker 5.4c Inert (time: 102 ms scale: 1.0) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1. ----- 48.51 47.32 53.12 52.33 54.38 54.92 55.75 56.00 55.32 55.00 55.16 56.71 57.60 53.82 2. 48.51 ----- 68.44 51.30 51.75 44.53 46.03 45.63 46.10 46.53 46.04 45.61 47.48 48.35 46.54 3. 47.32 68.44 ----- 52.49 53.69 43.93 45.36 45.31 45.62 45.42 44.93 45.44 46.97 46.78 45.91 4. 53.12 51.30 52.49 ----- 71.11 54.33 55.54 54.81 56.06 55.29 55.06 55.41 54.92 57.71 53.56 5. 52.33 51.75 53.69 71.11 ----- 53.30 54.99 53.74 54.71 54.45 53.65 54.75 53.92 56.55 52.76 6. 54.38 44.53 43.93 54.33 53.30 ----- 67.01 59.48 65.06 67.84 68.66 63.22 60.33 61.51 57.48 7. 54.92 46.03 45.36 55.54 54.99 67.01 ----- 60.37 64.20 64.25 64.13 64.54 62.01 62.99 58.24 8. 55.75 45.63 45.31 54.81 53.74 59.48 60.37 ----- 67.30 65.32 64.72 65.25 62.19 68.52 58.76 9. 56.00 46.10 45.62 56.06 54.71 65.06 64.20 67.30 ----- 73.61 72.51 69.66 65.11 68.57 60.34 10. 55.32 46.53 45.42 55.29 54.45 67.84 64.25 65.32 73.61 ----- 87.14 71.78 66.16 66.58 60.67 11. 55.00 46.04 44.93 55.06 53.65 68.66 64.13 64.72 72.51 87.14 ----- 72.07 64.91 65.96 59.86 12. 55.16 45.61 45.44 55.41 54.75 63.22 64.54 65.25 69.66 71.78 72.07 ----- 66.39 66.31 59.84 13. 56.71 47.48 46.97 54.92 53.92 60.33 62.01 62.19 65.11 66.16 64.91 66.39 ----- 65.19 59.59 14. 57.60 48.35 46.78 57.71 56.55 61.51 62.99 68.52 68.57 66.58 65.96 66.31 65.19 ----- 63.26 15. 53.82 46.54 45.91 53.56 52.76 57.48 58.24 58.76 60.34 60.67 59.86 59.84 59.59 63.26 -----
because part of the code is the numbers in different tables that many engines share like piece square tables and I expect the numbers to be closer after you tune based on another engine.
-
- Posts: 6991
- Joined: Thu Aug 18, 2011 12:04 pm
Re: Uri's Challenge : TwinFish
1. You are entitled to your (BTW circular) bold statement but ever since some past cases and the controversy among programmers that followed I would say source comparison can be the ultimate authority given that all programmers agree. If you can accept that not everybody agrees with you then perhaps we can make progess.bob wrote: If source comparison "can fail" then the similarity test is hopeless from the get-to, because source comparison is about 100x more accurate.
2. OTOH similarity tester is unbiased, no emotional strings attached, no human errors, no tunnel visions, no like or dislike of persons, just cold numbers and no false positives at 65+ which is the (tolerant) line I draw in the sand and so far I have been proven right after each source code comparison.
3. If you realize what similarity tester measures (see my post to Milos) then this should be obvious to an experienced programmer as yourself with some basic understanding of statistics. False positives (AKA exceptions) do happen in an enviroment with millions of random variables. Here we are just dealing with a couple of hundred engines.
-
- Posts: 6991
- Joined: Thu Aug 18, 2011 12:04 pm
Re: Uri's Challenge : TwinFish
And neither is there in court, but the accused goes to jail if there is a DNA match which as you know is also not 100% reliable. In the end it is a matter of statistics.bob wrote:There is no such term as "statistically out of the question". It might have a low probability of happening, but low != 0.0... So there is absolutely room for a false positive just as there is room for a false negative as already shown.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Uri's Challenge : TwinFish
My only point was that both false positives and false negatives will occur. Which is THE reason this can not be considered as a "proof" of either innocence or copying. It can be used as a filter. Pass the test and chances are pretty good the program is original, fail and chances are pretty good the program is a derivative. But that is really all. If you run as many tests as I run in a year, even 30K games, which produces an error bar of +/-4 will occasionally produce a bad result. I had one a couple of weeks ago that puzzled me (simple change dropped Elo by 13.). Couldn't see anything wrong, re-ran the 30K game match 3 times and all three were within what I expected, back to normal rather than that odd drop.Rebel wrote:1. You are entitled to your (BTW circular) bold statement but ever since some past cases and the controversy among programmers that followed I would say source comparison can be the ultimate authority given that all programmers agree. If you can accept that not everybody agrees with you then perhaps we can make progess.bob wrote: If source comparison "can fail" then the similarity test is hopeless from the get-to, because source comparison is about 100x more accurate.
2. OTOH similarity tester is unbiased, no emotional strings attached, no human errors, no tunnel visions, no like or dislike of persons, just cold numbers and no false positives at 65+ which is the (tolerant) line I draw in the sand and so far I have been proven right after each source code comparison.
3. If you realize what similarity tester measures (see my post to Milos) then this should be obvious to an experienced programmer as yourself with some basic understanding of statistics. False positives (AKA exceptions) do happen in an enviroment with millions of random variables. Here we are just dealing with a couple of hundred engines.
It doesn't happen often, but it happens often enough to realize that this IS statistical in nature. A 95% confidence interval still has a 1 in 20 chance of breaking. For the test I ran, an unchanged Elo was expected. Most of the time, on such tests, I get what I expect, just a validation that I didn't break something. Occasionally a change has an unexpected side-effect that drops the Elo significantly (as above) or on occasion shows an unexpected gain. Those get a further look. And on occasion the test turns out to be a statistical anomaly. Most of the time not.
Ergo, positives can be used to trigger further investigation, negatives can (with more risk) be used to avoid triggering further digging. Either can happen with some degree of probability that is much greater than zero. A code comparison is as accurate as one chooses to make it. There's no statistical analysis involved.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Uri's Challenge : TwinFish
Quite a bit of difference between a simtest with 95%B confidence interval and a DNA test that is usually specified as 1 in 100,000,000 chances of being wrong (depends on number of matching markers). 5 out of 100 is way bigger than 1 out of 100 million. But even DNA, by itself, is not an automatic conviction. There has to be other evidence to go along with it. Just proving I was somewhere does not prove I committed a crime, by itself.Rebel wrote:And neither is there in court, but the accused goes to jail if there is a DNA match which as you know is also not 100% reliable. In the end it is a matter of statistics.bob wrote:There is no such term as "statistically out of the question". It might have a low probability of happening, but low != 0.0... So there is absolutely room for a false positive just as there is room for a false negative as already shown.
-
- Posts: 10281
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Uri's Challenge : TwinFish
1)I do not think that we have 5 out of 100 wrong when the similarity test shows more than 65% similarity and the probability is significantly smaller.bob wrote:Quite a bit of difference between a simtest with 95%B confidence interval and a DNA test that is usually specified as 1 in 100,000,000 chances of being wrong (depends on number of matching markers). 5 out of 100 is way bigger than 1 out of 100 million. But even DNA, by itself, is not an automatic conviction. There has to be other evidence to go along with it. Just proving I was somewhere does not prove I committed a crime, by itself.Rebel wrote:And neither is there in court, but the accused goes to jail if there is a DNA match which as you know is also not 100% reliable. In the end it is a matter of statistics.bob wrote:There is no such term as "statistically out of the question". It might have a low probability of happening, but low != 0.0... So there is absolutely room for a false positive just as there is room for a false negative as already shown.
2)The probability for wrong conviction is clearly bigger than 1 to 100,000,000 based on my knowledge about the innocence project.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Uri's Challenge : TwinFish
You are not following. The typical computer confidence interval here is 95%. IE BayesElo for example.Uri Blass wrote:1)I do not think that we have 5 out of 100 wrong when the similarity test shows more than 65% similarity and the probability is significantly smaller.bob wrote:Quite a bit of difference between a simtest with 95%B confidence interval and a DNA test that is usually specified as 1 in 100,000,000 chances of being wrong (depends on number of matching markers). 5 out of 100 is way bigger than 1 out of 100 million. But even DNA, by itself, is not an automatic conviction. There has to be other evidence to go along with it. Just proving I was somewhere does not prove I committed a crime, by itself.Rebel wrote:And neither is there in court, but the accused goes to jail if there is a DNA match which as you know is also not 100% reliable. In the end it is a matter of statistics.bob wrote:There is no such term as "statistically out of the question". It might have a low probability of happening, but low != 0.0... So there is absolutely room for a false positive just as there is room for a false negative as already shown.
2)The probability for wrong conviction is clearly bigger than 1 to 100,000,000 based on my knowledge about the innocence project.
the 1 in 100,000,000 is a common statistic representing the probability of a false DNA match. Nothing to do with convictions or anything else. They have DNA from a suspect, DNA from the crime scene (rape is by far the most common case), the DNA results come back as 1 in 100,000,000 chances of a false match.
It seems those estimates are flawed after lots of additional investigation. There are several cases of two different people matching at the common "9 loci level" that is frequently used. There are several cases with a match at 10 loci. Etc. Still far better than the usual 95% confidence interval we use to establish a program's Elo.
That is the problem with a statistical answer. There is no 100% accurate answer. There is some error bar that is considered acceptable. I don't consider the similarity tester to be bad. But I also do not consider it to be "proof" of anything whatsoever. It is just a suggestion that some seem to take with more weight than others. There's a danger when you begin to believe it is nearly perfect. Nobody knows how many false matches there can be. But to assume there are none is certainly a bit off the wall.
-
- Posts: 4605
- Joined: Wed Oct 01, 2008 6:33 am
- Location: Regensburg, Germany
- Full name: Guenther Simon
Re: Uri's Challenge : TwinFish
Does someone still have the source or a binary of it? (Wayback and search returns nothing for this anymore)Tennison wrote: ↑Fri Jan 31, 2014 9:46 amTwinFish 0.07Uri Blass wrote:Then please do it and release the source.lucasart wrote:I can make a few trivial changes to Stockfish and pass the similarity tests, any day!
It may be interesting to know how much elo do you lose for it and if the engine that you get is stronger than DiscoCheck(note that 60% is not enough and you need similarity that is smaller than 55%)
The similarity is less than 55% and the elo fall is about 70-80 only.
This version of TwinFish is based on Stockfish dev 14 01 29 6:02PM (TimeStamp : 1391014933 )
The only changes made to reach a "<55%" similarity are a complete asymetric PST (based on Adam Hair values).
If you want to see the changes just search for "Robber" in the sources files.
There is only the source code, no binary.
If someone wants to compile good binaries it should be nice.
Don't forget : this version is only a joke and I don't steal Stockfish! ;-)
I'm very interested to see the result in similarity dendogram now !
Twinfish 0.07 is more related with Toga Hair than with Stockfish with the Don's Similarity Tester. And there is no code from Toga in it !!! ;-)
I would like to do some experiments with it.
Thanks.