Progress on Rustic

mvanthoor · Post by **mvanthoor** » Mon Apr 03, 2023 2:29 pm

lithander wrote: ↑Mon Apr 03, 2023 1:39 pm You obviously enjoy the benefit of a recent math class. We old folks have to make do with what we can remember and then deriving the partial derivate of something involving multip-dimensional vectors can easily seem like an insurmountable obstacle.

My last math class was 20 years ago, and I never heard of a partial derivative, even. I can get a derivative of something like "y = 5x³ + 6x² - x + 5", but what is a partial derivative of that? I'll have to look that up.

lithander wrote: ↑Mon Apr 03, 2023 1:39 pm Explaining it like that I feel like you don't even have to know what a derivative is to understand why the above code works well in tuning the weights to minimize error on a given dataset.

In the near future I'll have a look at it. I have to understand it perfectly if I'm to explain it in my documentation. When I start to actualy work with it, I'm sure I'll suddenly understand it completely, like I did with magic bitboards 3 years ago. I should have documented that then and there, because I now have to completely read up on that again. I don't understand one whit of the stuff you're talking about in Leorik's thread. It doesn't seem very difficult, but I forgot too much about magic bitboards to easily understand it.

mvanthoor · Post by **mvanthoor** » Mon Apr 03, 2023 2:32 pm

algerbrex wrote: ↑Mon Apr 03, 2023 2:27 am Feel free to message if you have any questions.

Thanks. I 'll make sure to pester you if need be.

Ras · Post by **Ras** » Mon Apr 03, 2023 2:41 pm

mvanthoor wrote: ↑Mon Apr 03, 2023 2:29 pmI can get a derivative of something like "y = 5x³ + 6x² - x + 5", but what is a partial derivative of that?

A partial derivative makes only sense when you have a function of more than one variable. Like f(x, y) = x² + y³. Then df/dx = 2x because when fiddling with x only, y is treated as constant. df/dy = 3y², same logic.

algerbrex · Post by **algerbrex** » Tue Apr 04, 2023 7:02 pm

lithander wrote: ↑Mon Apr 03, 2023 1:39 pm You obviously enjoy the benefit of a recent math class. We old folks have to make do with what we can remember and then deriving the partial derivate of something involving multip-dimensional vectors can easily seem like an insurmountable obstacle.

Guilty as charged

although I did have to reacquaint myself with some of the relevant theory since my multivariable calculus course was a couple of semesters ago now.

But I agree, all the details of the mathematics surrounding gradient descent can be a little opaque, but the overall idea is actually pretty straightforward, as you explained in your answer.

At some point I plan on writing out the derivation step by step in LaTeX, along with some explanation of what's going on and how to implement the idea in code. Just to make it easier for people who don't have the time or resources to learn all the relevant calculus theory to find the derivative themselves.

algerbrex · Post by **algerbrex** » Tue Apr 04, 2023 7:18 pm

mvanthoor wrote: ↑Mon Apr 03, 2023 2:32 pm
algerbrex wrote: ↑Mon Apr 03, 2023 2:27 am Feel free to message if you have any questions.
Thanks. I 'll make sure to pester you if need be.

Sure. Lithander's explanation of the intuition behind why gradient descent works is really nice as well, and I plan on writing a simplified explanation with a slightly more math-y flavor, but hopefully still clear. Because once you understand the idea behind gradient descent, really the main challenge is then figuring out how to compute the relevant derivatives.

mvanthoor · Post by **mvanthoor** » Tue Apr 04, 2023 11:39 pm

algerbrex wrote: ↑Tue Apr 04, 2023 7:18 pm Because once you understand the idea behind gradient descent, really the main challenge is then figuring out how to compute the relevant derivatives.

That is just the point. I understand how gradient descent works. I've used it in the past. Basically, to find a minimum in a function, you compute the derivative, and then find where the tangent is 0. You do this by picking a point on the function, see how steep the tangent is, and then pick a point further on the x-axis. If the tangent for that point gets less steep, you're moving in the right direction. If the tangent goes from (say) negative to positive, you've overshot your zero-point.

And so you go back and forth in smaller and smaller steps until you find where the tangent is 0. And that's either your optimum, or minimum.

Now just someone has to clearly explain to me how I do this for a function with something like 786 variables. (The words "partial derivative" hint at how it's done; with one variable at a time, but I'll have to look into it after I get this new computer fully set up.)

It was the same with magic bitboards: bitboards are easy to understand, but magics are not, because they are hard to explain. Until I found two explanations that very clearly explained how the magic numbers worked, and how they should be computed, and then it suddenly clicked and I had the function written in half an hour. Most posts on the internet that try to explain magic bitboards start out well with explaining the bitboards, and then descend into incomprehensible babble. (I'll try to find the two explanations I used in the past, for the magic bitboards and post them here.) The same is true for gradient descent and neural networks, to be honest. Everybody is easily able to explain the main idea, but then the explanation of WHY it works goes to hell.

Mike Sherwin · Post by **Mike Sherwin** » Wed Apr 05, 2023 1:33 am

mvanthoor wrote: ↑Tue Apr 04, 2023 11:39 pm It was the same with magic bitboards: bitboards are easy to understand, but magics are not, because they are hard to explain. Until I found two explanations that very clearly explained how the magic numbers worked, and how they should be computed, and then it suddenly clicked and I had the function written in half an hour. Most posts on the internet that try to explain magic bitboards start out well with explaining the bitboards, and then descend into incomprehensible babble. (I'll try to find the two explanations I used in the past, for the magic bitboards and post them here.) The same is true for gradient descent and neural networks, to be honest. Everybody is easily able to explain the main idea, but then the explanation of WHY it works goes to hell.

Magic Bitboards have gone poof. Kindergarten indexed Sub-Set Yielding (KiSSY) bitboards has vanished Magic as KiSSY is even faster than traditional PEXT. And Thomas is probably right now running the test for PEXT indexed Sub-Set Yielding (PiSSY) bitboards that will exceed KiSSY by a large margin. Why bother anymore with Magic?

I can't help myself. I haft to promote my new bitboard algorithm.

mvanthoor · Post by **mvanthoor** » Wed Apr 05, 2023 9:55 am

Mike Sherwin wrote: ↑Wed Apr 05, 2023 1:33 am Magic Bitboards have gone poof. Kindergarten indexed Sub-Set Yielding (KiSSY) bitboards has vanished Magic as KiSSY is even faster than traditional PEXT. And Thomas is probably right now running the test for PEXT indexed Sub-Set Yielding (PiSSY) bitboards that will exceed KiSSY by a large margin. Why bother anymore with Magic?

I can't help myself. I haft to promote my new bitboard algorithm.

That's just the thing. I have been out of the loop for nearly a year. Has this new version been explained anywhere, clearly and concisely, so I can implement it? I have a feeling that it has been developed over the course of a few weeks or months, over multiple threads in this forum, and now there are only two people who know how it works: You, and Thomas. In case someone would make a writeup of how these work, I may implement them in the future, especially if they are even faster without PEXT than Magic bitboards with PEXT. I'm certainly not going to dig through months of forum posts to piece it together.

I still have to look into PEXT as well. I assume that the only explanation about that is the classical case in the CPW.

lithander · Post by **lithander** » Wed Apr 05, 2023 9:59 am

Mike Sherwin wrote: ↑Wed Apr 05, 2023 1:33 am Magic Bitboards have gone poof. Kindergarten indexed Sub-Set Yielding (KiSSY) bitboards has vanished Magic as KiSSY is even faster than traditional PEXT. And Thomas is probably right now running the test for PEXT indexed Sub-Set Yielding (PiSSY) bitboards that will exceed KiSSY by a large margin. Why bother anymore with Magic?

I can't help myself. I haft to promote my new bitboard algorithm.

You're a bit premature with your promotion, though. My implementation is in a managed language and I posted just one match result run on one computer at one time control setting against a compilation made by Daniel that we don't even know the modifications of. Within the error margins KiSS was equal(!) with PEXT. Magic Bitboards weren't tested.

I have been doing more rigorous tests since the weekend and the first thing was to separate the improvements from switching to .Net 8 from the presumed improvements of change the slider generating code.
Marcel posted in this very thread that he only uses concurrency up to the amount of physical cores on his system because hyper-threading has been problematic in the past and so I've also been changing the concurrency from 22 (on a 12-core, 24-thread system) to 12 and the time control from 5s+0.5s to 5s+0.2s and while PEXT didn't suffer I could not reproduce the good results of KiSS. I have no explanation as to why. Also my other experiments with using subsets together with the PEXT instruction are not too impressive as well.

I'm still collecting data and will share it in due time. I'm also ready with the code and can push the branch to github, soon. Everyone who's interested will be able to compile custom Leorik versions and run their own matches. But Magic Bitboards haven't gone poof before some patch replacing them is accepted to Stockfish

mvanthoor · Post by **mvanthoor** » Wed Apr 05, 2023 10:50 am

lithander wrote: ↑Wed Apr 05, 2023 9:59 am Marcel posted in this very thread that he only uses concurrency up to the amount of physical cores on his system because hyper-threading has been problematic in the past

What I've noticed in the past is that multithreading beyond the number of physical cores you have can skew the results. An engine that plays its game from a logical thread is much slower than one playing from a real core. If you run a huge match, this should equalize out between the engines, but in practice, it not always does. Then you get the result that an engine can perform much better or worse than expected in one match, and in another match, it's a different engine with weird results.

The current trend of putting E-cores in CPU's (the big.LITTLE architecture that Intel now uses) will complicate this further, because we get another type of thread in the computer. If these E-cores gain hyperthreading / SMT at some point, we get a fourth. Then an engine could be running on a real performance core, on a logical thread of one of the performance cores, on the E-core, or on on a logical thread of an E-core.

That would make match results very unpredictable IMHO, unless you run HUGE matches to try and equalize everything out (as in: each engine runs on each type of thread the same amount of time).

Even Turbo Boost is problematic, because, at least for AMD, it is controlled by temperature. The CPU will just boost until it hits 95C, so in the winter your computer will run faster than in the summer. Depending on the cooling in the system, you could even have Turbo Boost go up and down and use different frequencies during a match, which will also skew the results. Engines running in the start of the tournament may have a faster CPU than the ones running later.

Therefore I'm going to put my new CPU into 105W Eco mode (it'll cost something like 3% performance), undervolt it by 0.1V, and probably even limit the boost speed to a point where I can be sure that the CPU can hold that speed indefinitely; even in mid-summer.

But Magic Bitboards haven't gone poof before some patch replacing them is accepted to Stockfish

We'll see. In the end, the move generator itself just accounts for something like 10% of the computations required, so you hit the notion of diminishing returns quite quickly. I liken it to an old CD-burner. The first 1x burner took 74 minutes to write a CD.

1x = 74 minutes
2x = 37 minutes
4x = 18.5 minutes
8x = 9 minutes, 15 seconds
16x = 4 minutes, 35 seconds

And beyond that, nobody cared. Even if a burner was "24x", "32x" or even "40x" or faster, none I had ever dropped below 4 minutes. That is because of the CD's lead-in was still written at 1x, then the disc had to slowly ramp up to its maximum speed and then ramp down again to write the lead-out at 1x.

I feel it is the same with move generators: Fancy / Black Magic Bitboards + PEXT are so fast, and current evaluations and NN's take up so much computational power, that any improvement that can be measured outside the margin of error will probably be less than 5 Elo.

Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic