Singular Extensions

Daniel Shawul · Post by **Daniel Shawul** » Tue Aug 03, 2010 1:10 am

My goal is also to know the truth. I have been after SE for a couple of months now with no benefit from it whatsoever while I see claims of atleast 40 elo here and there. IMO this has been dis proven already as none of the authors with SE said here they use a TC larger than 5 + 5. Where did that claim come from originally ? Can we see the test results ?

I'm not a "believer". I'm interested in the issue, because I want to know if it's worth spending much time to try to make ttSE work better. Insofar, to me, this test was not complete. The 5+5 result was a urban myth debunking point for Bob, but the 10+10 result was somehow swept under the carpet.

It is not fun to run someone else program , and for a long time at that. I wouldn't do 10 + 10 myself let alone 60 + 60. So for such long time tc requests to be reasonable, the requester must show the requestee some games showing there is indeed a significant gain. Otherwise why would he do games just to disprove someone else's hunch ? Bear in mind that we never had any kind of extensions that scales well with time before..

Don · Post by **Don** » Tue Aug 03, 2010 1:49 am

Daniel Shawul wrote:
The best way to contribute is to run some tests of your own.
So now you are directing the blame elsewhere when a flaw in your test is pointed out. What ? you did 360 or so games and call it a contribution while you assumed I did nothing.

What I'm saying is that all you do is talk. That makes me believe it won't matter how we do our testing, you would find it flawed and always be able to produce some reason why it's not the way you think it should be.

You claim that Bob mysteriously stopped his test, etc. Of course it was suspicious since it did not match what you expected.

I did many games with it with various forms. I have already posted many times here I didn't get anything out of it with various forms of it. I did more than 9000 or so games at 40/30.

So I am supporting fact , and you are relying on hunch.. For starters Scorpio,Crafty,Spike got nothing out of it so far..
Send Doch with the SE option that I can enable/disable so I can do the test down to 1 elo point at 1+1 (same us what you are doing) against many opponents.. Send it and we will see what it is made of.

I actually find it _really_ hard to believe your form of SE , PV node only with 0.8 pawn, gives you results better than stockfish's. Atleast SF does it at every node where there is fail high. That is vastly many many nodes more than what you try it on.

I don't believe that SE is going to work the same in any 2 programs. It certainly must have something to do with other extensions and reductions you are already using.

You almost seem upset that it's working for me. Do you think I should take it out on your recommendation?

Daniel Shawul · Post by **Daniel Shawul** » Tue Aug 03, 2010 2:01 am

What I'm saying is that all you do is talk. That makes me believe it won't matter how we do our testing, you would find it flawed and always be able to produce some reason why it's not the way you think it should be.

You claim that Bob mysteriously stopped his test, etc. Of course it was suspicious since it did not match what you expected.

What a stupid thing to say! If you can't read properly what people posted, just stop posting. I never ever said anything like that..
Recheck the threads and maybe you will find something like that in Ralph's post.

Ralph Stoesser · Post by **Ralph Stoesser** » Tue Aug 03, 2010 2:13 am

bob wrote:
Ralph Stoesser wrote:
Daniel Shawul wrote: 5 + 5 gives enough depth so why ask for more ??

Because 10+10 was measured (comparatively much) stronger, with an increasing tendency? Don't you believe in holy cluster test results??

But suddenly the test was stopped ... suprise, surprise.
where is this "much" stronger coming from? I got roughly +5 at one time control, +17 at another. That is not "much stronger".

Why tell roughly about exact measurements?

bob wrote:I've aborted this test. the ttSE version is +4 Elo stronger, maybe. I have started a 10min+10s match although with a lot fewer games. Report tomorrow although I am not sure how long it will run.
Code: Select all
   1 Stockfish 1.8 64bit      2850    4    4 30193   82%  2550   20% 
   2 Stockfish 1.8noSE 64bit  2846    4    4 30246   82%  2551   21% 

TC 5+5: +4 Elo

bob wrote: I finally stopped the test last night, error bar was down to +/- 8, difference was +18 Elo. Not insignificant, but also not in line what claims I had seen on freechess. One person there claimed +100 or so which would be remarkable for any change.

TC 10+10: +18 Elo

Each time the latest results reported by yourself.

In absolute terms +18 Elo difference may look tiny, but in relative terms it's much more compared to the 5+5 results. Roughly TC doubled, ELO gain quadrupled.

Isn't that something worth to look deeper into?

Uri Blass · Post by **Uri Blass** » Tue Aug 03, 2010 6:36 am

bob wrote:
Uri Blass wrote:
bob wrote:
Don wrote:
Daniel Shawul wrote:Duh I am _not_ comapring 1.7.1 and 1.8.
The point is Stockfish 1.7 or 1.8 both has SE and their blitz or long time rating yet remains the same. If it gave it a push we should see its benefits there too, no ?
No. If SF gets the same rating at both short and long time controls, why is it you think you can pick out one thing (such as SE) and claim that this is proof that SE does not help or hurt it at long time controls?
I go with Daniel here. Most of those programs in the lists are not SE-based. If SE "picks up Elo" as the depth increases should it not widen the gap between itself and other programs below it that won't pick up that same boost since they don't have SE?

It could be (and almost certainly is the case) that some things in SF scale better than others. They have the same trouble everyone else does, it's very difficult to get a lot of games in at long time controls.

So some of the things in SF probably help the program even more at longer time controls and some things help less, or even hurt it at longer time controls.

The fact that it does not get weaker or stronger at long time controls means that on average they balance out. It doesn't mean you can pick a feature at random and say this proves that feature does not help or hurt at long time controls.
No, but if you have features that do worse and offset any potential gain from something else, just because you go deeper, should not those kinds of features be removed a.s.a.p.? Why have something that does _worse_ as hardware improves?
The reason to have something that does worse as hardware improves is that it simply does better than previous versions at all time controls.

I believe that a simple speed improvement does worse as hardware improves because of small diminishing returns.

being 10 times faster may give you 220 elo at 5+5 time control and only 200 elo at 10+10 time control and there may be improvements that are equivalent to speed improvements.
This is simply not possible. If it does better at _all_ time controls, then by definition, it will do better at faster hardware, because increased hardware speed tomorrow is the same as a longer time control today.

My point is not that it is not going to do better on faster hardware but that the rating improvement may be bigger at blitz because of some good changes that are not about SE(inspite of the fact that the benefit of SE is bigger at long time control).

Uri

bob · Post by **bob** » Tue Aug 03, 2010 6:51 am

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Don wrote:
Daniel Shawul wrote:Duh I am _not_ comapring 1.7.1 and 1.8.
The point is Stockfish 1.7 or 1.8 both has SE and their blitz or long time rating yet remains the same. If it gave it a push we should see its benefits there too, no ?
No. If SF gets the same rating at both short and long time controls, why is it you think you can pick out one thing (such as SE) and claim that this is proof that SE does not help or hurt it at long time controls?
I go with Daniel here. Most of those programs in the lists are not SE-based. If SE "picks up Elo" as the depth increases should it not widen the gap between itself and other programs below it that won't pick up that same boost since they don't have SE?

It could be (and almost certainly is the case) that some things in SF scale better than others. They have the same trouble everyone else does, it's very difficult to get a lot of games in at long time controls.

So some of the things in SF probably help the program even more at longer time controls and some things help less, or even hurt it at longer time controls.

The fact that it does not get weaker or stronger at long time controls means that on average they balance out. It doesn't mean you can pick a feature at random and say this proves that feature does not help or hurt at long time controls.
No, but if you have features that do worse and offset any potential gain from something else, just because you go deeper, should not those kinds of features be removed a.s.a.p.? Why have something that does _worse_ as hardware improves?
The reason to have something that does worse as hardware improves is that it simply does better than previous versions at all time controls.

I believe that a simple speed improvement does worse as hardware improves because of small diminishing returns.

being 10 times faster may give you 220 elo at 5+5 time control and only 200 elo at 10+10 time control and there may be improvements that are equivalent to speed improvements.
This is simply not possible. If it does better at _all_ time controls, then by definition, it will do better at faster hardware, because increased hardware speed tomorrow is the same as a longer time control today.
My point is not that it is not going to do better on faster hardware but that the rating improvement may be bigger at blitz because of some good changes that are not about SE(inspite of the fact that the benefit of SE is bigger at long time control).

Uri

The original comment was "perhaps SE helps more than expected, but some other feature gets worse at longer time controls, to offset part of the SE gain."

My comment is simply that such a thing is a result of poor testing, because one should _never_ test multiple changes at the same time. If something hurts at longer time controls, which in my testing is extremely rare, then it should have already been removed. You are saying effectively the same thing, except that it has been overlooked in testing and was allowed to slip into the program somehow without being detected. While possible, improbable comes to mind as a better descriptive term. It is very difficult to design something that is worse as you go deeper. Sometimes extensions can do that since the deeper you go, the more opportunity there is for extensions to blow up the tree. But it really is quite difficult. And it would be pretty retarded to miss this and let it remain in the program.

I've run several hundred million games over the past 3 years or so, and I have only _rarely_ found changes that look good at fast time controls and bad at long ones. And I really do mean rarely. The big majority of those have to do with time allocation, in our case, as a change will do fine at one T/C and do poorly at another. But so far, no eval changes have shown that at all, and no search changes I can think of, although that doesn't mean that it couldn't happen. But it does suggest it would be quite rare considering the number of test games where we have run short and long games back-to-back to see how things measure up.

The LMR discussion is a case in point. Intuition might suggest that it does better as you go deeper. Actual testing says it might be marginally better. But notice in my previous testing that at very fast games, and even at 5+5 it was pretty consistent, and the program took a small (about +12 elo) jump at 10+10. Which could be more of an influence by something else (such as time allocation or anything) or an increase in EBF which alters how the time allocation works, etc...

Mangar · Post by **Mangar** » Tue Aug 03, 2010 2:30 pm

bob wrote:My comment is simply that such a thing is a result of poor testing, because one should _never_ test multiple changes at the same time.

Hi,

this is not allways true. There might be single changes that don´t improve but a combination of them that improves.
Currently I had the situation, that 4 changes in reduction/extension tested seperately didn´t bring anything but alltogether gained a good amount of elo. I think that testing only single issues gives you a high chance to starve in a lokal maximum.

Greetings Volker

Don · Post by **Don** » Tue Aug 03, 2010 2:46 pm

Mangar wrote:
bob wrote:My comment is simply that such a thing is a result of poor testing, because one should _never_ test multiple changes at the same time.
Hi,

this is not allways true. There might be single changes that don´t improve but a combination of them that improves.
Currently I had the situation, that 4 changes in reduction/extension tested seperately didn´t bring anything but alltogether gained a good amount of elo. I think that testing only single issues gives you a high chance to starve in a lokal maximum.

Greetings Volker

I suppose it's in the interpretation. I would view this as a single change or if you prefer a single "compound change" because it's all part of the same thing.

You and Bob bring up an interesting issue - can changes be tested in combination? H.G. Muller suggested something called orthogonal multi-tester many months (perhaps years) ago.

It may be that you CAN combine changes if you set up your testing accordingly, but you are still testing them individually as it is required that you separate them. You would not test 2 separate things combined into a single change unless you were convinced that it makes sense and you were looking specifically for interactions - but you could even do that with multi-testing. You could look for pair-wise interactions of everything you test for that matter.

Daniel Shawul · Post by **Daniel Shawul** » Tue Aug 03, 2010 5:08 pm

this is not allways true. There might be single changes that don´t improve but a combination of them that improves.
Currently I had the situation, that 4 changes in reduction/extension tested seperately didn´t bring anything but alltogether gained a good amount of elo. I think that testing only single issues gives you a high chance to starve in a lokal maximum.

If your changes are _orthogonal_ it should add up. Extensions and reductions have high correlation so I am not surprised by your result. You can only get trapped in a local maxima if you have high correlation like it is often the case for many eval parameters. There was a discussion about this when Remi announced his QLR.If you tune two eval parameters like "Rook on open file" and "mobility" separately then superposition might not work for obvious correlation between the two terms. Because of things like that it is advisable to test things one at a time unless you are sure there is interaction.

bob · Post by **bob** » Tue Aug 03, 2010 5:21 pm

Mangar wrote:
bob wrote:My comment is simply that such a thing is a result of poor testing, because one should _never_ test multiple changes at the same time.
Hi,

this is not allways true. There might be single changes that don´t improve but a combination of them that improves.
Currently I had the situation, that 4 changes in reduction/extension tested seperately didn´t bring anything but alltogether gained a good amount of elo. I think that testing only single issues gives you a high chance to starve in a lokal maximum.

Greetings Volker

Again, I believe that is the result of a lack of testing. With enough games, the error resolution is good enough that you can measure _small_ changes. While I have not seen the necessity, once can play upwards of 150K games and get the error down to +/- 1 Elo. I can't imagine a real case where you make any of four changes and get nothing, but make all 4 and get +40. Again assuming that the changes are technically not related. If there is correlation between changes (say passed pawn advancement bonus vs passed pawn blockaded penalty) then changing both may well be better than just changing either one. But they are obviously correlated in what they do. But to suggest that SE is really better than it looks because something unrelated to SE is really worse at deeper depths says, to me, that a lack of testing is at the bottom of this happening.

Singular Extensions

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games