Perft(14) Weekly Status Report

sje · Post by **sje** » Sun Aug 24, 2014 7:43 am

Perft(14) Weekly Status 2014-08-24

There are 400,068 perft(7) results so far, about 0.42% of the 96,400,068 needed.

The gzip versions of all the 965 initial work units are in the directory https://dl.dropboxusercontent.com/u/31633927/zipunits
with the names wu7.000.gz, wu7.001.gz, up to wu7.964.gz. Together, these are about 500 MB of data.

The gzip versions of all the completed work units are in the directory https://dl.dropboxusercontent.com/u/31633927/zipresults
with the names wu7.000.sum.gz, wu7.001.sum.gz, etc.

Completed work units (5): 000-003, 964
In progress (8): 004-011
Not yet started (952): 012-963

004: 65%
005: 33%
006: 32%
007: 7%
008: 51%
009: 15%
010: 23%
011: 33%

I have done more work on making an OpenCL perft() application by cobbling together some old C language source and successfully completing some tests. I now have a version running and producing correct results. However, there are some shortcomings at present:

1) Recursion not yet killed off, a requirement for OpenCL kernels.
2) No facility for setting initial FEN positions.
3) No facility for automated handling of work units.

Once these are resolved, I can distribute the source for review and operational deployment. If you have a fancy graphics card with lots of cores and a reasonable amount of memory, then you should be able to assist with the perft(14) project without too much hassle.

Ajedrecista · Post by **Ajedrecista** » Sun Aug 24, 2014 10:35 am

Hello Steven:

sje wrote:The gzip versions of all the 965 initial work units are in the directory https://dl.dropboxusercontent.com/u/31633927/zipunits
with the names wu7.000.gz, wu7.001.gz, up to wu7.964.gz. Together, these are about 500 MB of data.

The gzip versions of all the completed work units are in the directory https://dl.dropboxusercontent.com/u/31633927/zipresults
with the names wu7.000.sum.gz, wu7.001.sum.gz, etc.

Good to know! Just to be a bit more clear, the full URLs are:

https://dl.dropboxusercontent.com/u/316 ... wu7.000.gz
https://dl.dropboxusercontent.com/u/316 ... wu7.001.gz
https://dl.dropboxusercontent.com/u/316 ... wu7.002.gz

[...]

https://dl.dropboxusercontent.com/u/316 ... 000.sum.gz
https://dl.dropboxusercontent.com/u/316 ... 001.sum.gz
https://dl.dropboxusercontent.com/u/316 ... 002.sum.gz

[...]

I write this because I was stucking with incomplete URLs and I almost report that your links do not work, but I remember that these links are in in the other thread of Perft(14).

Regards from Spain.

Ajedrecista.

sje · Post by **sje** » Mon Aug 25, 2014 9:21 pm

I have also placed all of the perft(14) files on Google Drive, but have not yet figured out how to share them read-only to all.

The OpenCL perft() code is now fully non recursive. It can now also decode FEN board positions, and soon it will have an somewhat automated facility for processing work units.

But so far, the code has been tested only in a pure CPU environment; testing on a GPU is still to come. If someone doesn't have a fancy GPU, they can still run the program on a CPU as long as they have an ANSI C compiler to make an executable from the source.

If the host environment has OpenCL capability, then the CPU can run the program in OpenCL mode which will automatically parallelize the work over multiple cores. Or so they say.

No work done yet on running OpenCL on Linux.

The perft() program has three C source files:

ChessCore.h -- Definitions
ChessCore.c -- Constant values and routines
main.c -- Test driver

The test driver will be replaced by the automated work unit handler. The driver, then the handler, runs on the host CPU while the code in the ChessCore.c file runs as an OpenCL kernel. Each invocation of the kernel process one perft(n) of a single input FEN record. It is the responsibility of the handler to load a work units, send the data to the kernel, get the results, then write the result file.

The OpenCL scheduler works by dealing out the input data to the available processors and retrieving the results automatically.

The scheduler is called with two big memory buffers: a set of input records, and a place for the output results. For each perft() call, the input record is about 100 bytes long, and the output is eight bytes long. For a work unit with 100,000 records, this is about 10 MB of data; not too much.

Each processing core needs about 8 KB memory for code and data for running perft(). My best GPU has 1,344 cores and 2 GB RAM, so the total video memory uses by a bunch of simultaneous perft() runs is only a small fraction of the available GPU RAM.

Using a single core of a 2.66 GHz Xeon 5150, the program calculates perft(7) in about 440 seconds with a node frequency of about 7 MHz. This could certainly be improved with some source optimizations.

ZirconiumX · Post by **ZirconiumX** » Mon Aug 25, 2014 9:58 pm

The main problem with OpenCL on GPUs is how there is no way to make a kernel that runs equally well on nVidia and AMD. I don't know how this rule of thumb works for modern architectures, but AMD GPUs are best at integer, vector work, and nVidia GPUs are best as scalar, floating-point work.

Since Perft is integer, scalar work, neither GPU is optimal.

Matthew:out

sje · Post by **sje** » Mon Aug 25, 2014 10:58 pm

I can't say anything yet about OpenCL throughput for my little perft() -- but the time is soon.

What I can say is that I've written the code to be portable for all OpenCL capable hardware, both GPUs and CPUs. Further, all arithmetic is done with 32 bit integers with only the summation variable being 64 bits long. So from what I understand, this should help as most GPUs work better with 32 integers than 64 bit integers.

Also, the memory footprint is small and that should help. Even if it's substantially increased due to adding speed optimization code, it will still be small.

Further, the C source will be released so that others may try their hand with improvements and with implementing hardware specific versions.

--------

The fastest GPU I have is a NVIDIA GeForce GTX 775M graphics processor with 2GB of GDDR5 memory. This little board is no slouch although it's not as capable as some cards costing many hundreds of dollars. Whatever OpenCL throughput I can squeeze out of it will be much less than what others can get from better hardware.

http://www.videocardbenchmark.net/gpu.p ... e+GTX+775M

sje · Post by **sje** » Mon Aug 25, 2014 11:21 pm

The other OpenCL GPUs I have available running under Mac OS/X:

NVIDIA GeForce 320M (256MB) OpenCL: 1.0
NVIDIA GeForce 9400 (256MB) OpenCL: 1.0
INTEL HD Graphics 4000 (1024MB) OpenCL: 1.2
AMD ATI Radeon HD 5670 (512MB) OpenCL: 1.2

I have a Linux box with a fairly recent 1 GB RAM GPU, but don't know the exact details -- yet.

sje · Post by **sje** » Sun Aug 31, 2014 6:05 am

Perft(14) Weekly Status 2014-08-31

There are more than 500,000 perft(7) results so far, about 0.52% of the 96,400,068 needed.

Completed work units (6): 000-003, 008, 964
In progress (8): 004-007, 009-012
Not yet started (951): 013-963

Completion percentages of work units in progress:

004: 99%
005: 46%
006: 42%
007: 11%
009: 29%
010: 53%
011: 69%
012: 0%

--------

The OpenCL version of Oscar for Mac OS/X 10.7+ should be ready soon.

A possibly helpful site with benchmark information about OpenCL platforms: http://compubench.com/

ibid · Post by **ibid** » Tue Sep 02, 2014 10:55 pm

I know some of those machines are older dual cores, but given the amount
of hardware you are using the computation should be going MUCH faster.

As an experiment I had my program (the one used to compute perft(13), not
gperft) start computing perft(14) as 7+7 as you are doing. In 8 hours it
completed 231,000 perft(7)'s -- or a rate of about 0.72% a day. So I would
expect the entire computation to take perhaps 5 months on the one machine.
Unfortunately running 24/7 isn't practical for me right now.

I did download a couple of your work units and noticed they seem to be sorted
first by the count of how many times the position occurs. This is surely not optimal:
you want consecutive positions to be as similar as possible to increase the
effectiveness of the hash table in picking up transpositions between perft(7)
computations, not just within each one. I order the positions in the engine's
natural search order, but simply sorting by FEN sting would probably do quite
well.

Anyhow, it is perhaps something to look into before you get too far along...

-paul

sje · Post by **sje** » Wed Sep 03, 2014 12:33 am

First, Symbolic is not primarily a perft() program and is not optimized for perft() speed. It does have a very good record of accurate move operations, though. But it would take a lot of effort to make a perft() optimal version of Symbolic -- along with much testing -- and that effort is better spent on writing a perft() specific program, one with source that can be freely distributed.

When processing a perft(), Symbolic first clears its transposition table. This is done purposely for two reasons: to keep old entries from clogging the table and to isolate any errors to a single calculation. If an error were allowed to propagate through many subsequent positions, then it would be very difficult to determine the cause of the error or to re-create the conditions needed to reproduce the error. Therefore, the ordering of positions in a work unit is irrelevant.

syzygy · Post by **syzygy** » Wed Sep 03, 2014 12:56 am

sje wrote:It does have a very good record of accurate move operations, though.

Doesn't any chess engine / perft implementation that is not outright buggy?

I haven't really followed this, but what is the point of reporting on the progress made by an only average perft implementation? Wouldn't it be enough to report the final result?

Perft(14) Weekly Status Report

Perft(14) Weekly Status Report

Re: Perft(14) weekly status report.

Also on Google Drive

Re: Also on Google Drive

OpenCL throughput

Re: OpenCL throughput

Re: Perft(14) Weekly Status Report

Re: Perft(14) Weekly Status Report

Re: Perft(14) Weekly Status Report

Re: Perft(14) Weekly Status Report