Space/Time Tradeoff in year 2020+: What can the community do with 20TB+ Hard Drives?

dragontamer5788 · Post by **dragontamer5788** » Fri Nov 08, 2019 7:41 pm

https://www.pcmag.com/news/371784/seaga ... es-in-2020

While there has been a lot of talk about new CPUs, GPUs and maybe flash storage (or "optane") technologies, hard drive progress continues to march forward. Seagate will release 18TB "Conventional" Hard Drives next year, and 20TB "Shingled" Hard drives as well (SMR has very poor write performance, but higher-capacity). WD has similarly announced 18TB and 20TB "MAMR" drives (microwave assisted). Seagate is expected to report more progress on their HAMR (heat-assisted) drives, while Toshiba is currently shipping 16TB drives. We can reasonably expect HDD progress to continue forward.

So my question to the community: what can the chess community do with such huge storage capacities?

* IIRC, the 7-man Syzygy Tablebase weigh in at 18.4TB. Since Syzygy is typically a read-only methodology, the 20TB SMR drive could feasibly be used. Maybe 2x or 3x drives could be used for multithreading purposes: different threads could get their own hard drives to help speed up the search. If you have a 1gbps connection, you can expect to download a 18.4TB file within 3 days. But SMR drives are pretty slow, 40MBps or slower, so it'd take over 5+ days to write to the SMR drive!

* It might be more practical to wait for 20TB HAMR or MAMR drives and stay away from SMR for now. Or just RAID0 2x 18TB helium drives together.

* 7-man Lomonosov Tablebase weighs in at over 100TiBs, requiring 6x 18TB hard drives to access. Larger hard-drives are still needed!

* Larger opening books: I don't know much opening-book theory, but surely bigger-is-better? The wiki suggests that an entry can be anywhere from 16-bytes to 100-bytes per position (depending on the methodology). A 18TiB hard drive could therefore store somewhere between 180-billion an 1.2 Trillion entries.

* EDIT: If we assume 500x writes per second, each 512-bytes long (the typical sectorsize of a hard drive), it would take 800+ days to fill 18 TiB drives. And 500 IOPs is on the high-end of what HDDs can do. Realistically speaking, the only way to write to 18 TiB of hard-drive space is if the HDD were written sequentially: wherein typical HDDs achieve ~200MB/s. Any pragmatic use of a HDD would almost certainly require a 1TiB+ NVMe SSD (~100,000 writes/second and 2GBps read/write speed) to "cache + reorder" data into a form that optimizes the HDD's usage. Even in this ideal "sequential" scenario, it would take over a day before all 18TiB were written to disk.

* For inspiration, it should be noted that IBM's "quantum supremacy" counter-paper leveraged 64PiB of hard drives for the 53-qubit simulation. And would require 128 PiB for 54-qubits simulations. Big-hard drives can still accelerate big computer problems, the question is "how" ??

Zenmastur · Post by **Zenmastur** » Fri Nov 08, 2019 8:21 pm

dragontamer5788 wrote: ↑Fri Nov 08, 2019 7:41 pm https://www.pcmag.com/news/371784/seaga ... es-in-2020

While there has been a lot of talk about new CPUs, GPUs and maybe flash storage (or "optane") technologies, hard drive progress continues to march forward. Seagate will release 18TB "Conventional" Hard Drives next year, and 20TB "Shingled" Hard drives as well (SMR has very poor write performance, but higher-capacity). WD has similarly announced 18TB and 20TB "MAMR" drives (microwave assisted). Seagate is expected to report more progress on their HAMR (heat-assisted) drives, while Toshiba is currently shipping 16TB drives. We can reasonably expect HDD progress to continue forward.

So my question to the community: what can the chess community do with such huge storage capacities?

* IIRC, the 7-man Syzygy Tablebase weigh in at 18.4TB. Since Syzygy is typically a read-only methodology, the 20TB SMR drive could feasibly be used. Maybe 2x or 3x drives could be used for multithreading purposes: different threads could get their own hard drives to help speed up the search. If you have a 1gbps connection, you can expect to download a 18.4TB file within 3 days. But SMR drives are pretty slow, 40MBps or slower, so it'd take over 5+ days to write to the SMR drive!

* It might be more practical to wait for 20TB HAMR or MAMR drives and stay away from SMR for now. Or just RAID0 2x 18TB helium drives together.

* 7-man Lomonosov Tablebase weighs in at over 100TiBs, requiring 6x 18TB hard drives to access. Larger hard-drives are still needed!

* Larger opening books: I don't know much opening-book theory, but surely bigger-is-better? The wiki suggests that an entry can be anywhere from 16-bytes to 100-bytes per position (depending on the methodology). A 18TiB hard drive could therefore store somewhere between 180-billion an 1.2 Trillion entries.

* EDIT: If we assume 500x writes per second, each 512-bytes long (the typical sectorsize of a hard drive), it would take 800+ days to fill 18 TiB drives. And 500 IOPs is on the high-end of what HDDs can do. Realistically speaking, the only way to write to 18 TiB of hard-drive space is if the HDD were written sequentially: wherein typical HDDs achieve ~200MB/s. Any pragmatic use of a HDD would almost certainly require a 1TiB+ NVMe SSD (~100,000 writes/second and 2GBps read/write speed) to "cache + reorder" data into a form that optimizes the HDD's usage. Even in this ideal "sequential" scenario, it would take over a day before all 18TiB were written to disk.

* For inspiration, it should be noted that IBM's "quantum supremacy" counter-paper leveraged 64PiB of hard drives for the 53-qubit simulation. And would require 128 PiB for 54-qubits simulations. Big-hard drives can still accelerate big computer problems, the question is "how" ??

It's long been the case that it's more cost, time, and power efficient to store every root position that's analyzed longer than a second. This includes all position not already stored in an EGTB even if 128 byte records are used. The analysis is a little more complex for non-root positions but many of these should be saved as well depending on time used to search them.

Regards,

Zenmastur

Dann Corbit · Post by **Dann Corbit** » Fri Nov 08, 2019 9:32 pm

Zenmastur wrote: ↑Fri Nov 08, 2019 8:21 pm
It's long been the case that it's more cost, time, and power efficient to store every root position that's analyzed longer than a second. This includes all position not already stored in an EGTB even if 128 byte records are used. The analysis is a little more complex for non-root positions but many of these should be saved as well depending on time used to search them.

I would simplify that to storing every pv node in the hash table.
Pv nodes are incredibly rare in an ocean of chess positions.
But if you have a gang of machines running around the clock, you will need some serious storage to hold them all.
I would store them in a database, not just a file.

dragontamer5788 · Post by **dragontamer5788** » Fri Nov 08, 2019 9:51 pm

Dann Corbit wrote: ↑Fri Nov 08, 2019 9:32 pm
Zenmastur wrote: ↑Fri Nov 08, 2019 8:21 pm
It's long been the case that it's more cost, time, and power efficient to store every root position that's analyzed longer than a second. This includes all position not already stored in an EGTB even if 128 byte records are used. The analysis is a little more complex for non-root positions but many of these should be saved as well depending on time used to search them.

I would simplify that to storing every pv node in the hash table.
Pv nodes are incredibly rare in an ocean of chess positions.
But if you have a gang of machines running around the clock, you will need some serious storage to hold them all.
I would store them in a database, not just a file.

Too small. Assuming 20-ply depth x 128 bytes per node == 2.5kB of storage per 1-second of analysis. Or only 2.5kB written every second.

Aiming at the ~200MB/s a HDD can store sequentially, maybe storing the entire transposition table per node would be a good starting point. Anything less wouldn't realistically use the 18TiB available coming next year.

1GB transposition tables would take 5-seconds to read and 5-seconds to write (at 200MB/s). Over those 10-seconds, Stockfish probably can probably traverse 1-billion nodes or so (~100Mega-nodes per second over 10 seconds)... assuming 128-bytes per node... a 1GB transposition table would represent the top 1% of nodes "cached" for future reference across a 10-second search.

-------

EDIT: LeelaZero only is on the order of 100knps. So over 10-seconds, LeelaZero will only generate 1-million positions total. So LeelaZero doesn't even generate enough data (even if the entire MCTS tree were stored) to satisfy the 200MB/s needed (or 2GBs worth of reads/writes per 10-seconds) to saturate a hard drive's read/write bandwidth.

But if you have a gang of machines running around the clock, you will need some serious storage to hold them all.

The main issue is that a gang-of-machines would probably have a 18TiB hard drive per machine. If you have 10-computers, its entirely reasonable to assume 10x 18TB hard drives (1-per-machine), or even 2x or 3x hard drives per machine.

Zenmastur · Post by **Zenmastur** » Sat Nov 09, 2019 1:32 am

dragontamer5788 wrote: ↑Fri Nov 08, 2019 9:51 pm
Dann Corbit wrote: ↑Fri Nov 08, 2019 9:32 pm
Zenmastur wrote: ↑Fri Nov 08, 2019 8:21 pm
It's long been the case that it's more cost, time, and power efficient to store every root position that's analyzed longer than a second. This includes all position not already stored in an EGTB even if 128 byte records are used. The analysis is a little more complex for non-root positions but many of these should be saved as well depending on time used to search them.

I would simplify that to storing every pv node in the hash table.
Pv nodes are incredibly rare in an ocean of chess positions.
But if you have a gang of machines running around the clock, you will need some serious storage to hold them all.
I would store them in a database, not just a file.
Too small. Assuming 20-ply depth x 128 bytes per node == 2.5kB of storage per 1-second of analysis. Or only 2.5kB written every second.

Aiming at the ~200MB/s a HDD can store sequentially, maybe storing the entire transposition table per node would be a good starting point. Anything less wouldn't realistically use the 18TiB available coming next year.

1GB transposition tables would take 5-seconds to read and 5-seconds to write (at 200MB/s). Over those 10-seconds, Stockfish probably can probably traverse 1-billion nodes or so (~100Mega-nodes per second over 10 seconds)... assuming 128-bytes per node... a 1GB transposition table would represent the top 1% of nodes "cached" for future reference across a 10-second search.

-------

EDIT: LeelaZero only is on the order of 100knps. So over 10-seconds, LeelaZero will only generate 1-million positions total. So LeelaZero doesn't even generate enough data (even if the entire MCTS tree were stored) to satisfy the 200MB/s needed (or 2GBs worth of reads/writes per 10-seconds) to saturate a hard drive's read/write bandwidth.

But if you have a gang of machines running around the clock, you will need some serious storage to hold them all.
The main issue is that a gang-of-machines would probably have a 18TiB hard drive per machine. If you have 10-computers, its entirely reasonable to assume 10x 18TB hard drives (1-per-machine), or even 2x or 3x hard drives per machine.

I don't think you are considering the whole problem. But, first things first.

Finding a hard drive that can actually sustain 200Mb/s will be an issue. Second, in order for any of the data written to be useful it must be probed AND if a record is found it's data must be from a search that is deep enough to actually be useful. If not, it will have to be updated after an appropriate depth search is done.

This implies that both reads and writes will be interleaved. Probing will be to random locations on the hard drive so the heads will be continually thrashing. Not only will you have to contend with head movement and settle time rotational latency will be a big issue. Head transfer time will be, on average, 1/3 of full stroke latency and average rotational latency will be ½ the rotation time. This will significantly limit the number of reads and writes that can be performed per second. So 200Mb/s is a pipe dream.

While writing everything in the TT may sound like a good idea. It isn't. In order for this system to save time, power, or money the time to probe the data and return useful info must be less than the time a similar search would take. Most of the info in the TT hasn't been searched to any significant depth and so it would be a waste of time to store it in a file. In addition the time to probe a position must be divided by the probability that an entry of sufficient depth is actually in the file. This means that many probes must be performed to get a single useful hit, which implies that a hit must save many times the minimum probe time to be of any use, e.g. to save time, money or power.

In short the ENTIRE process must be looked at to determine what is actually useful.

Regards,

Zenmastur

dragontamer5788 · Post by **dragontamer5788** » Sat Nov 09, 2019 2:00 am

Finding a hard drive that can actually sustain 200Mb/s will be an issue.

Virtually all hard drives sustain 200MB/s these days. Your typical 5TB or 6TB hard drive reaches these speeds, and future hard drives have more heads and platters. Only on sequential reads and writes, but 200 MiB (capital-B, full bytes) is fully possible.

Finding a hard drive that can actually sustain 200Mb/s will be an issue. Second, in order for any of the data written to be useful it must be probed AND if a record is found it's data must be from a search that is deep enough to actually be useful. If not, it will have to be updated after an appropriate depth search is done.

My proposed process is as follows:

1. Some meta-analysis program determines that Node P31415 should be analyzed (Root -> 3rd move -> 1st move -> 4th move -> 5th move).

2. You run your Alpha-Beta engine on Node P31415 for 10-seconds with a 1GB Transposition Table. This generates a PV, the "best move" for the position, as well as roughly 1GB of transposition table.

3. Node P31415, its best moves calculated thusfar, the PV, and its associated 1GB of transposition table is written to the hard drive.

4. The meta-analysis program chooses other nodes (maybe Node P27182) to analyze, 10-seconds with a 1GB Transposition table.

5. Meta-analysis continues to choose nodes, until it eventually decides to revisit Node P31415 for some reason. At this point: it reads the old 1GB Transposition Table from the hard drive, and then begins the AB-search again (which should result in a better position this time, because you'll have better move-orderings from the old Transposition Table, as well as a huge number of cached results).

This implies that both reads and writes will be interleaved. Probing will be to random locations on the hard drive so the heads will be continually thrashing. Not only will you have to contend with head movement and settle time rotational latency will be a big issue. Head transfer time will be, on average, 1/3 of full stroke latency and average rotational latency will be ½ the rotation time. This will significantly limit the number of reads and writes that can be performed per second. So 200Mb/s is a pipe dream.

That's why I like the idea of saving off the Transposition Tables in their entirety. Its always going to be a fully-sequential, idealized scan from beginning to end.

Its basically:

Code: Select all

f = fopen("tt_for_position_P31415", "rb");
fread(tt_pointer, 1, 1073741824, f);
fclose(f);

This is as ideal as it gets for hard-drives.

DustyMonkey · Post by **DustyMonkey** » Sat Nov 09, 2019 7:26 pm

Some things to keep in mind.

o) Some engines are much better than others at re-starting a search with an already-filled HT, for whatever reason(s).
o) transposition tables can now be so big (32GB, 64GB...) that holding even multi-minute searches as a non-lossy connected graph might now be a superior approach in those cases. For instance a quad core system can get less than 10 million nodes/sec, but be loaded with 128GB of memory. Apply a cuckoo hash to it, which doesnt ever need to do replacement until the HT is approaching full....

dragontamer5788 · Post by **dragontamer5788** » Mon Nov 11, 2019 4:31 pm

DustyMonkey wrote: ↑Sat Nov 09, 2019 7:26 pm Some things to keep in mind.

o) Some engines are much better than others at re-starting a search with an already-filled HT, for whatever reason(s).
o) transposition tables can now be so big (32GB, 64GB...) that holding even multi-minute searches as a non-lossy connected graph might now be a superior approach in those cases. For instance a quad core system can get less than 10 million nodes/sec, but be loaded with 128GB of memory. Apply a cuckoo hash to it, which doesnt ever need to do replacement until the HT is approaching full....

I do believe the "best hash table" award goes to linear probing with robin hood hashing these days, but I guess that's a bit off topic. Cuckoo hashing has some nifty parallel implementations for what its worth.

---

I think I get the gist of what you're saying. Storing the whole hash table as the set-of-nodes (and then maybe "recalculating the edges", so those don't need to be stored) does seem reasonable. However, 4-bits x 64-slot Mailbox + 4-bytes of misc state == 36-bytes per node. The engine needs some information, but it probably will fit in 28-bytes. (cached heuristic at depth, upper-bound vs lower-bound, a few of the best moves, etc. etc). So that rounds off to ~64-bytes per node, maybe a bit less if you run gzip on it before writing to disk. (Stream output to NVMe flash, gzip on flash, then write to disk?)

In any case, 10-million nodes/second is around 640MB/s of data generated. Since hard drives are only 200MB/s, every 100-seconds of analysis done, the hard drive will spend roughly 320 seconds writing. 4x hard drives in RAID0 will get you the bandwidth you need, but... 10-million nodes/second seems far too low of an estimate in any case.

Quad-core is also a severe underestimation of modern machines. Ryzen 3700x is $300 and 8-core/16-threads, so 8-core seems standard. Threadripper starts at 24-cores ($1300) and will probably top off at 64-cores. The year 2019 is at 100Million+ nodes/second, and the years 2020+ will be even faster.

Also consider that half your RAM should be reserved for the hard drive. Ideally, you'd want half of your RAM for analysis, while the 2nd half is used for hard-drive reads/writes, so that your CPU can work in parallel with the disk. So 128GB of RAM would only be 64GB for analysis, less than 10-seconds of search for a Threadripper.

----

As such, only a lossy-transposition table can be written to disk. Lossless hash table might work on the slowest CPUs in 2020+ with the fastest storage available... but desktops and HEDTs generate too many nodes/second to realistically store onto a hard drive (too little bandwidth) or RAM (too little capacity).

DustyMonkey · Post by **DustyMonkey** » Tue Nov 12, 2019 4:51 am

I dont know if the 36 bytes per node you are talking about is worth it, but from your comments

"assuming 20ply depth"
"for 10-seconds"

Storing these graphs in enormous memory is trivial (minimizing time), storing them far more compactly than what is time-efficient also trivial (minimizing space.)

The time/space tradeoff doesnt eliminate either option. Both can be employed as needed. Think like a demo coder

dragontamer5788 · Post by **dragontamer5788** » Thu Nov 14, 2019 1:41 am

DustyMonkey wrote: ↑Tue Nov 12, 2019 4:51 am I dont know if the 36 bytes per node you are talking about is worth it

Hmm, 128-bit hash (16-bytes) needs 2^64 nodes before there's a good chance of collision. I don't know if I'd be comfortable with anything smaller than that. Maybe somewhere around 32-bytes total per node (16-byte hash + 16-bytes of misc information) to half the bandwidth and storage requirements. Either way, the huge nodes-per-second of modern 8-core, 16-core, and 64-core machines will utterly swamp virtually any hard drive that may come out in the next 5 years.

but from your comments

"assuming 20ply depth"
"for 10-seconds"

Storing these graphs in enormous memory is trivial (minimizing time), storing them far more compactly than what is time-efficient also trivial (minimizing space.)

The time/space tradeoff doesnt eliminate either option. Both can be employed as needed. Think like a demo coder

Of course. The natural result is to ignore the leaf nodes, thus saving 92% of space. The CPU probably can recompute leaf-nodes faster than it can read from hard drive anyway. But cutting out the leaf nodes means that the system provided is no longer "a non-lossy connected graph", to put it in your words.

Each cutting of the ply reduces your space by around 92% (assuming 32-moves per ply but only ~13 of those moves visited due to Alpha-beta pruning). So 92% of space from cutting leaf nodes. Cut 99.36% of the space from cutting leaf-nodes + their immediate parents. Cut 99.95% of space when you cut leaf nodes and two-parents.

---------

I'm trying to think of how a workload could benefit from the storage of these hash-tables and search results. Lets say you had a match with 60-moves, and the computer stored the TT for all 60-moves it was responsible for (or if doing self-play, then all 120-moves would have a stored hash table). Will that stored information be useful for anybody?

Hmm... at a minimum, it would allow "retrograde analysis" of the chess engine's thinking process. Ex: "Why did the engine miss X move??". Probably useful for developers of engines, but I'm not entirely sure what it buys a typical chess player or chess analyst.

Space/Time Tradeoff in year 2020+: What can the community do with 20TB+ Hard Drives?

Space/Time Tradeoff in year 2020+: What can the community do with 20TB+ Hard Drives?

Re: Space/Time Tradeoff in year 2020+: What can the community do with 20TB+ Hard Drives?

Re: Space/Time Tradeoff in year 2020+: What can the community do with 20TB+ Hard Drives?

Re: Space/Time Tradeoff in year 2020+: What can the community do with 20TB+ Hard Drives?

Re: Space/Time Tradeoff in year 2020+: What can the community do with 20TB+ Hard Drives?

Re: Space/Time Tradeoff in year 2020+: What can the community do with 20TB+ Hard Drives?

Re: Space/Time Tradeoff in year 2020+: What can the community do with 20TB+ Hard Drives?

Re: Space/Time Tradeoff in year 2020+: What can the community do with 20TB+ Hard Drives?

Re: Space/Time Tradeoff in year 2020+: What can the community do with 20TB+ Hard Drives?

Re: Space/Time Tradeoff in year 2020+: What can the community do with 20TB+ Hard Drives?