Some thoughts on QS

diep · Post by **diep** » Tue Jul 31, 2012 7:25 pm

Don wrote:
diep wrote:
Don wrote:
diep wrote: Additionally it's you who in 90s already posted the indication that at bigger search depths the elowin for basically anything is smaller.

Claiming the opposite now is contradicting that claim, and rather naive claim.

I think you misunderstood something about what I said either then or now. I believe that with depth any superiority is reduced generally - the ELO gap closes between a weak and strong program in general, but a terribly written unscalable program may actually lose ground with depth. You can easily write a program that does not scale well and loses ground to other programs with depth. That is not an absolute that it can never happen.

The programs of 30 years ago - play them against Komodo and handicap Komodo to be equal in strength - then keep doubling the time control for each program and you will see Komodo's ELO increase relative to them with each doubling.
We're not speaking about the past here however. You did do a claim it's 150 elopoints for LMR at a small search depth that's not realistic and THEREFORE you claimed it would be giving even more elo at slower time controls than the superbullet you tested.

I really have to see that first and i really note that even in this reply you admit that t osummerize your statement here: 'in general with depth superiority gets reduced'.

So that's contradictary to your claim that with increased depth it wins more than 150 elo.

I seriously doubt that claim.
The level I tested at was quite fast so I admit that I do not know how it would work out at long time controls. With increasing time it's true that general superiority is reduced due to more draws and the fact that you get closer to perfect play. So I cannot say for sure that you are wrong.

But this is an experiment ANYONE can do with Komodo - Komodo has an option to turn off LMR. So perhaps someone would be willing to play Komodo LMR against Komodo no LMR at something like 1+1 fischer, which is substantially longer than I ran but fast enough to hope to get a few hundred games in a couple of days or so.

Don

The second mistake is the incesttesting.

It just doesn't work.

Adam Hair · Post by **Adam Hair** » Tue Jul 31, 2012 7:36 pm

Sven Schüle wrote:
Uri Blass wrote:Note that I do not understand the more than 100 elo improvement in
40/40 and suddenly the top programs have better rating in 40/40 relative to 40/4 and the difference between programs in the 40/40 became bigger(weak programs did not earn rating points in the 40/40 list but all the top programs earned rating points)
Obviously some change has happened in the CCRL rating lists very recently. I think it is related to using different BayesElo parameters. The change affects the whole list. If you look to the bottom of the rating list you will see some engines rated around 1600 (e.g. my old engine "Surprise" and Julien's "Prédateur") which previously had around ELO 1950-2000 prior to the "offset -100" change some weeks ago, and afterwards around ELO 1850-1900. Also my newer engine "KnockOut" which had been around 2250 and then 2150 is now below 2000.

So the overall scaling is now different.

Another point regarding your post is that you must never compare ratings from two different rating lists even if both are from CCRL, i.e. any comparison of 40/40 ratings with 40/4 ratings is meaningless. You can compare ratings within each of these lists, and you can compare the relative ranking between the two lists, but the absolute rating numbers are always bound to exactly one list since each list represents an own pool of games, and in case of 40/40 vs. 40/4 the game pools are even fully disjoint.

Also absolute rating differences should not be compared between 40/40 and 40/4 CCRL lists since the scaling might differ somehow, and with the new BayesElo parameters I think the scaling even depends on properties of the corresponding set of games itself.

Sven

You are correct on everything you wrote.

Don · Post by **Don** » Tue Jul 31, 2012 7:40 pm

diep wrote: The second mistake is the incesttesting.

It just doesn't work.

Ok, so play a guantlet match then against a couple of other programs. Komodo with and without LMR but never play each other - only the foreign programs.

Larry and I have done a ton of incest experiments and what we found is that it's better to play foreign opponents because some changes work against Komodo but not other programs. HOWEVER, and I capitalize that, the difference is not that much - just enough to make a small improvement or regression turn the other way. And even then the biggest difference is in the scaling, self play tends to inflate the difference, but only rarely does it invalidate an actual improvement. But yes, that can happen on occasion.

We also test foreign programs perhaps mainly due to superstition, we feel it makes the program more robust.

Send me your best Diep and I will make some tests to find out if your idea are better.

diep · Post by **diep** » Tue Jul 31, 2012 11:54 pm

Don wrote:
diep wrote: The second mistake is the incesttesting.

It just doesn't work.
Ok, so play a guantlet match then against a couple of other programs. Komodo with and without LMR but never play each other - only the foreign programs.

There is yanks that qualify?

Larry and I have done a ton of incest experiments and what we found is that it's better to play foreign opponents because some changes work against Komodo but not other programs. HOWEVER, and I capitalize that, the difference is not that much - just enough to make a small improvement or regression turn the other way. And even then the biggest difference is in the scaling, self play tends to inflate the difference, but only rarely does it invalidate an actual improvement. But yes, that can happen on occasion.

We also test foreign programs perhaps mainly due to superstition, we feel it makes the program more robust.

Send me your best Diep and I will make some tests to find out if your idea are better.

The important argument against incesttesting is that it intuitively is already very weird to believe that if you lock up a GM in a prison, throw a few pieces in his cell and a board and have him play games against himself, that it would be possible to get any strength measurement that manner. Something like that would only work in a book (the logical deduction of this is that you can get world champion by 'training against yourself in a prison cell; actually there is a book in literature that follows this principle).

Oh as for Diep, the current version searches not only deeper also its ego has been fixed. It wins nearly 100% now against the old version at superbullet.

The old version at slow time controls against other programs had been tested at around 3000 elo.

100% score now is a tad above 700 elo points.

Does that mean it has 3700 elo now?

Incesttesting in case of Diep always favoured the deeper searching version, even if that scored 20% less against a mix of opponents.

About every serious chessprogrammer i know always confirmed that point, that much that Johan de Koning even baptized it incesttesting.

Now you're using THAT sort of bad science to prove something?

With a few of those 'changes', just testing diep-diep i can prove Diep to be 4000 elo. That was the case 15 years ago as well and it hasn't changed. And most chess programmers that i know of are aware of this phenomenon.

Yet if it is convenient for you, you use both super-bullet time control as well as incesttesting.

I call that convenience science.
It is not science though.

It's deliberate spreading nonsense into the ether. Such 'good weather messages about my programs superiority from numbers viewpoint other than score against opponents', i used to hear from a lot of such messages in the days that Chrilly worked for a sheikh....

...every month or so we heard about it being AGAIN factor 2 faster in nps, meanwhile having a near perfect speedup at the 64 processors, and other great messages as that GM's contributed to its chessknowledge.

It's as if with your postings i see a similar muppetshow again.

Do you have a sponsor or so?

michiguel · Post by **michiguel** » Wed Aug 01, 2012 12:12 am

diep wrote:
Don wrote:
diep wrote: The second mistake is the incesttesting.

It just doesn't work.
Ok, so play a guantlet match then against a couple of other programs. Komodo with and without LMR but never play each other - only the foreign programs.
There is yanks that qualify?

Larry and I have done a ton of incest experiments and what we found is that it's better to play foreign opponents because some changes work against Komodo but not other programs. HOWEVER, and I capitalize that, the difference is not that much - just enough to make a small improvement or regression turn the other way. And even then the biggest difference is in the scaling, self play tends to inflate the difference, but only rarely does it invalidate an actual improvement. But yes, that can happen on occasion.

We also test foreign programs perhaps mainly due to superstition, we feel it makes the program more robust.

Send me your best Diep and I will make some tests to find out if your idea are better.
The important argument against incesttesting is that it intuitively is already very weird to believe that if you lock up a GM in a prison, throw a few pieces in his cell and a board and have him play games against himself, that it would be possible to get any strength measurement that manner. Something like that would only work in a book (the logical deduction of this is that you can get world champion by 'training against yourself in a prison cell; actually there is a book in literature that follows this principle).

Oh as for Diep, the current version searches not only deeper also its ego has been fixed. It wins nearly 100% now against the old version at superbullet.

The old version at slow time controls against other programs had been tested at around 3000 elo.

100% score now is a tad above 700 elo points.

Does that mean it has 3700 elo now?

Incesttesting in case of Diep always favoured the deeper searching version, even if that scored 20% less against a mix of opponents.

About every serious chessprogrammer i know always confirmed that point, that much that Johan de Koning even baptized it incesttesting.

Now you're using THAT sort of bad science to prove something?

With a few of those 'changes', just testing diep-diep i can prove Diep to be 4000 elo. That was the case 15 years ago as well and it hasn't changed. And most chess programmers that i know of are aware of this phenomenon.

Yet if it is convenient for you, you use both super-bullet time control as well as incesttesting.

I call that convenience science.
It is not science though.

It's deliberate spreading nonsense into the ether.

What you call incest testing is actually the way to go for first screenings. In experimental science you always try to increase the sensitivity of the measurements even if it is in detriment of accuracy. Later, once you detected good candidates and the signal is bigger, you confirm with other more accurate (even if less sensitive) methods. So, in this case, self testing is fine if later is confirmed with, say, a gauntlet against a very diverse set of opponents.

Miguel

diep · Post by **diep** » Wed Aug 01, 2012 12:22 am

michiguel wrote:
diep wrote:
Don wrote:
diep wrote: The second mistake is the incesttesting.

It just doesn't work.
Ok, so play a guantlet match then against a couple of other programs. Komodo with and without LMR but never play each other - only the foreign programs.
There is yanks that qualify?

Larry and I have done a ton of incest experiments and what we found is that it's better to play foreign opponents because some changes work against Komodo but not other programs. HOWEVER, and I capitalize that, the difference is not that much - just enough to make a small improvement or regression turn the other way. And even then the biggest difference is in the scaling, self play tends to inflate the difference, but only rarely does it invalidate an actual improvement. But yes, that can happen on occasion.

We also test foreign programs perhaps mainly due to superstition, we feel it makes the program more robust.

Send me your best Diep and I will make some tests to find out if your idea are better.
The important argument against incesttesting is that it intuitively is already very weird to believe that if you lock up a GM in a prison, throw a few pieces in his cell and a board and have him play games against himself, that it would be possible to get any strength measurement that manner. Something like that would only work in a book (the logical deduction of this is that you can get world champion by 'training against yourself in a prison cell; actually there is a book in literature that follows this principle).

Oh as for Diep, the current version searches not only deeper also its ego has been fixed. It wins nearly 100% now against the old version at superbullet.

The old version at slow time controls against other programs had been tested at around 3000 elo.

100% score now is a tad above 700 elo points.

Does that mean it has 3700 elo now?

Incesttesting in case of Diep always favoured the deeper searching version, even if that scored 20% less against a mix of opponents.

About every serious chessprogrammer i know always confirmed that point, that much that Johan de Koning even baptized it incesttesting.

Now you're using THAT sort of bad science to prove something?

With a few of those 'changes', just testing diep-diep i can prove Diep to be 4000 elo. That was the case 15 years ago as well and it hasn't changed. And most chess programmers that i know of are aware of this phenomenon.

Yet if it is convenient for you, you use both super-bullet time control as well as incesttesting.

I call that convenience science.
It is not science though.

It's deliberate spreading nonsense into the ether.
What you call incest testing is actually the way to go for first screenings. In experimental science you always try to increase the sensitivity of the measurements even if it is in detriment of accuracy. Later, once you detected good candidates and the signal is bigger, you confirm with other more accurate (even if less sensitive) methods. So, in this case, self testing is fine if later is confirmed with, say, a gauntlet against a very diverse set of opponents.

Miguel

I start to realize why Don only wants to test against FOREIGN chessprograms.

Uri Blass · Post by **Uri Blass** » Wed Aug 01, 2012 6:17 am

Don wrote:
diep wrote: The second mistake is the incesttesting.

It just doesn't work.
Ok, so play a guantlet match then against a couple of other programs. Komodo with and without LMR but never play each other - only the foreign programs.

Larry and I have done a ton of incest experiments and what we found is that it's better to play foreign opponents because some changes work against Komodo but not other programs. HOWEVER, and I capitalize that, the difference is not that much - just enough to make a small improvement or regression turn the other way. And even then the biggest difference is in the scaling, self play tends to inflate the difference, but only rarely does it invalidate an actual improvement. But yes, that can happen on occasion.

We also test foreign programs perhaps mainly due to superstition, we feel it makes the program more robust.

Send me your best Diep and I will make some tests to find out if your idea are better.

I believe that testing a program against previous versions
usually give correct results(maybe you can be wrong about the direction of small differences but I do not believe that you can be wrong about the direction of big differences).

I would like to see a single case when chess strength is proved to be
not transitive.

The CCRL FRC list has many 100 match games and I could not find big surprises.

There is even not a single case when the weaker program based on rating could score 59.5%
Even cases when the score of the weaker program in rating is more than 55% are relatively rare.

I wonder if you can generate 3 programs A,B and C when
A score at least 60% against B
B score at least 60% against C and
C score at least 60% against A
so everyone can prove that chess strength is not transitive without playing too many games.

If 60% is too hard for you then try 55%.

diep · Post by **diep** » Wed Aug 01, 2012 11:57 am

Uri, you can guess what you guess, but there is a dozen chessprogrammers who have hard proof from their own engine that it's total useless to test your own engine against your own engine using different selective algorithms.

I simply presented facts, not speculation, in contradiction to you.

The theoretical insight is pretty simple. If you just test your own engine against your own engine, you're basically just testing against a subset of reality and this subdomain of reality you just care about seeing more accurate that subset.

Every selective algorithm brings its own risk. Some bring more than others.

By just testing your selective at a subset of reality it's pretty trivial that this is not correct scientific approach, as you can never be sure that you just optimized for the subset rather than for the much bigger reality.

In practice this incesttesting goes wrong when testing algorithms that bring a risk with them. It's not surprising it goes wrong there, as basically you stretch some search lines (because of bigger iteration depth) and shorten others (because of pruning/reduction). This creates horizon effects and make it tougher to get an accurate positional scoring possible as in a given position P of depthleft d, when applying to the same position P of depthleft d with the selective algorithm, you are comparing more moves m with different depths d'.

Trivially that removes positional insight at the same search depthleft d.

Trivially for the same evaluation function e, when we search a tad deeper, we in practice see searchlines more accurate. Especially mainline m we see tactical deeper.

That is a huge advantage given the same evaluation function e.

Such tactics of course overrules always evaluation function e. As we already have the positional insight to iteration depth i for the normal version versus i + x with x > 0 for the more selective version, using the lemma that positionally we learn little with each additional plydepth, we can therefore prove that statistical odds are tiny our positional insight, other than material, changes a lot.

Yet we do see material better, at least for mainline, even if we have a slight weaker positoinal understanding in the selective version.

The same effect does not apply when testing against other programs, as they might have a total different evaluation function and therefore if our original program because of more selectiivty loses some positional knowledge, we are not sure that the added selectivity suddenly loses it for our new program with selectivity.

So there is 2 effects, one with statistical insight and one with a theoretical background, that explain why incesttesting is a bad idea.

Uri Blass · Post by **Uri Blass** » Wed Aug 01, 2012 12:31 pm

The theoretical insight is not so simple when we talk about search changes.

If A2 is weaker than A1 when they play against unrelated program B then
it means that A2 does mistakes against B that A1 does not do against B
and if you start from the relevant positions that A2 does mistakes it is going to lose also against A1.

I did not read a convincing explanation why the positions that A2 is weaker happen only against B.

Note that I think that when we talk only about evaluation changes it may be logical to have something not transitive when you change symmetric evaluation to non symmetric evaluation

Imagine we have A1 and A2 and B(A2 is modified A1).

both A1 and A2 have common weakness and they do not have king safety evaluation.

A2 at least know that it does not know about king safety so it has a bonus for trading to the endgame even at the price of inferior position.

Of course A1 is going to beat A2 in a match because A1 is going to be happy to trade to a better endgame.

When both of them play against B(a program with king safety evaluation but no endgame knowledge) things are different.

A1 is going to get inferior endgame against B but it is going to beat B because it is stronger in the endgame.

A2 is not going to get into an endgame against B and B is going to beat A2 often by king attacks.

Houdini · Post by **Houdini** » Wed Aug 01, 2012 12:40 pm

diep wrote:So there is 2 effects, one with statistical insight and one with a theoretical background, that explain why incesttesting is a bad idea.

There is theory, and there is practice.

Your style of argumentation is very much like Bob Hyatt's, claiming things don't work based on some past experience from a very long time ago, citing "a dozen chess programmers who have hard proof" etc. Very often this is to support a claim that something "does not work" when in fact, in my and other people's experience, it works very well.

From various comments on this forum it appears that for all current top engine engine authors auto-testing is an important part of their development cycle. Your comments suggest that you're slightly out of touch.

Robert

Some thoughts on QS

Re: Some thoughts on QS

Re: Some thoughts on QS

Re: Some thoughts on QS

Re: Some thoughts on QS

Re: Some thoughts on QS

Re: Some thoughts on QS

Re: Some thoughts on QS

Re: Some thoughts on QS

Re: Some thoughts on QS

Re: Some thoughts on QS