Perft(14) Weekly Status Report

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Perft(14) Weekly Status Report

Post by sje »

Perft(14) Weekly Status 2014-08-24

There are 400,068 perft(7) results so far, about 0.42% of the 96,400,068 needed.

The gzip versions of all the 965 initial work units are in the directory https://dl.dropboxusercontent.com/u/31633927/zipunits
with the names wu7.000.gz, wu7.001.gz, up to wu7.964.gz. Together, these are about 500 MB of data.

The gzip versions of all the completed work units are in the directory https://dl.dropboxusercontent.com/u/31633927/zipresults
with the names wu7.000.sum.gz, wu7.001.sum.gz, etc.

Completed work units (5): 000-003, 964
In progress (8): 004-011
Not yet started (952): 012-963

004: 65%
005: 33%
006: 32%
007: 7%
008: 51%
009: 15%
010: 23%
011: 33%

I have done more work on making an OpenCL perft() application by cobbling together some old C language source and successfully completing some tests. I now have a version running and producing correct results. However, there are some shortcomings at present:

1) Recursion not yet killed off, a requirement for OpenCL kernels.
2) No facility for setting initial FEN positions.
3) No facility for automated handling of work units.

Once these are resolved, I can distribute the source for review and operational deployment. If you have a fancy graphics card with lots of cores and a reasonable amount of memory, then you should be able to assist with the perft(14) project without too much hassle.
User avatar
Ajedrecista
Posts: 1967
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Perft(14) weekly status report.

Post by Ajedrecista »

Hello Steven:
sje wrote:The gzip versions of all the 965 initial work units are in the directory https://dl.dropboxusercontent.com/u/31633927/zipunits
with the names wu7.000.gz, wu7.001.gz, up to wu7.964.gz. Together, these are about 500 MB of data.

The gzip versions of all the completed work units are in the directory https://dl.dropboxusercontent.com/u/31633927/zipresults
with the names wu7.000.sum.gz, wu7.001.sum.gz, etc.
Good to know! Just to be a bit more clear, the full URLs are:

https://dl.dropboxusercontent.com/u/316 ... wu7.000.gz
https://dl.dropboxusercontent.com/u/316 ... wu7.001.gz
https://dl.dropboxusercontent.com/u/316 ... wu7.002.gz

[...]

https://dl.dropboxusercontent.com/u/316 ... 000.sum.gz
https://dl.dropboxusercontent.com/u/316 ... 001.sum.gz
https://dl.dropboxusercontent.com/u/316 ... 002.sum.gz

[...]

I write this because I was stucking with incomplete URLs and I almost report that your links do not work, but I remember that these links are in in the other thread of Perft(14).

Regards from Spain.

Ajedrecista.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Also on Google Drive

Post by sje »

I have also placed all of the perft(14) files on Google Drive, but have not yet figured out how to share them read-only to all.

The OpenCL perft() code is now fully non recursive. It can now also decode FEN board positions, and soon it will have an somewhat automated facility for processing work units.

But so far, the code has been tested only in a pure CPU environment; testing on a GPU is still to come. If someone doesn't have a fancy GPU, they can still run the program on a CPU as long as they have an ANSI C compiler to make an executable from the source.

If the host environment has OpenCL capability, then the CPU can run the program in OpenCL mode which will automatically parallelize the work over multiple cores. Or so they say.

No work done yet on running OpenCL on Linux.

The perft() program has three C source files:

ChessCore.h -- Definitions
ChessCore.c -- Constant values and routines
main.c -- Test driver

The test driver will be replaced by the automated work unit handler. The driver, then the handler, runs on the host CPU while the code in the ChessCore.c file runs as an OpenCL kernel. Each invocation of the kernel process one perft(n) of a single input FEN record. It is the responsibility of the handler to load a work units, send the data to the kernel, get the results, then write the result file.

The OpenCL scheduler works by dealing out the input data to the available processors and retrieving the results automatically.

The scheduler is called with two big memory buffers: a set of input records, and a place for the output results. For each perft() call, the input record is about 100 bytes long, and the output is eight bytes long. For a work unit with 100,000 records, this is about 10 MB of data; not too much.

Each processing core needs about 8 KB memory for code and data for running perft(). My best GPU has 1,344 cores and 2 GB RAM, so the total video memory uses by a bunch of simultaneous perft() runs is only a small fraction of the available GPU RAM.

Using a single core of a 2.66 GHz Xeon 5150, the program calculates perft(7) in about 440 seconds with a node frequency of about 7 MHz. This could certainly be improved with some source optimizations.
ZirconiumX
Posts: 1334
Joined: Sun Jul 17, 2011 11:14 am

Re: Also on Google Drive

Post by ZirconiumX »

The main problem with OpenCL on GPUs is how there is no way to make a kernel that runs equally well on nVidia and AMD. I don't know how this rule of thumb works for modern architectures, but AMD GPUs are best at integer, vector work, and nVidia GPUs are best as scalar, floating-point work.

Since Perft is integer, scalar work, neither GPU is optimal.

Matthew:out
Some believe in the almighty dollar.

I believe in the almighty printf statement.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

OpenCL throughput

Post by sje »

I can't say anything yet about OpenCL throughput for my little perft() -- but the time is soon.

What I can say is that I've written the code to be portable for all OpenCL capable hardware, both GPUs and CPUs. Further, all arithmetic is done with 32 bit integers with only the summation variable being 64 bits long. So from what I understand, this should help as most GPUs work better with 32 integers than 64 bit integers.

Also, the memory footprint is small and that should help. Even if it's substantially increased due to adding speed optimization code, it will still be small.

Further, the C source will be released so that others may try their hand with improvements and with implementing hardware specific versions.

--------

The fastest GPU I have is a NVIDIA GeForce GTX 775M graphics processor with 2GB of GDDR5 memory. This little board is no slouch although it's not as capable as some cards costing many hundreds of dollars. Whatever OpenCL throughput I can squeeze out of it will be much less than what others can get from better hardware.

http://www.videocardbenchmark.net/gpu.p ... e+GTX+775M
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: OpenCL throughput

Post by sje »

The other OpenCL GPUs I have available running under Mac OS/X:

NVIDIA GeForce 320M (256MB) OpenCL: 1.0
NVIDIA GeForce 9400 (256MB) OpenCL: 1.0
INTEL HD Graphics 4000 (1024MB) OpenCL: 1.2
AMD ATI Radeon HD 5670 (512MB) OpenCL: 1.2

I have a Linux box with a fairly recent 1 GB RAM GPU, but don't know the exact details -- yet.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: Perft(14) Weekly Status Report

Post by sje »

Perft(14) Weekly Status 2014-08-31

There are more than 500,000 perft(7) results so far, about 0.52% of the 96,400,068 needed.

Completed work units (6): 000-003, 008, 964
In progress (8): 004-007, 009-012
Not yet started (951): 013-963

Completion percentages of work units in progress:

004: 99%
005: 46%
006: 42%
007: 11%
009: 29%
010: 53%
011: 69%
012: 0%

--------

The OpenCL version of Oscar for Mac OS/X 10.7+ should be ready soon.

A possibly helpful site with benchmark information about OpenCL platforms: http://compubench.com/
ibid
Posts: 89
Joined: Mon Jun 13, 2011 12:09 pm

Re: Perft(14) Weekly Status Report

Post by ibid »

I know some of those machines are older dual cores, but given the amount
of hardware you are using the computation should be going MUCH faster.

As an experiment I had my program (the one used to compute perft(13), not
gperft) start computing perft(14) as 7+7 as you are doing. In 8 hours it
completed 231,000 perft(7)'s -- or a rate of about 0.72% a day. So I would
expect the entire computation to take perhaps 5 months on the one machine.
Unfortunately running 24/7 isn't practical for me right now.

I did download a couple of your work units and noticed they seem to be sorted
first by the count of how many times the position occurs. This is surely not optimal:
you want consecutive positions to be as similar as possible to increase the
effectiveness of the hash table in picking up transpositions between perft(7)
computations, not just within each one. I order the positions in the engine's
natural search order, but simply sorting by FEN sting would probably do quite
well.

Anyhow, it is perhaps something to look into before you get too far along...

-paul
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: Perft(14) Weekly Status Report

Post by sje »

First, Symbolic is not primarily a perft() program and is not optimized for perft() speed. It does have a very good record of accurate move operations, though. But it would take a lot of effort to make a perft() optimal version of Symbolic -- along with much testing -- and that effort is better spent on writing a perft() specific program, one with source that can be freely distributed.

When processing a perft(), Symbolic first clears its transposition table. This is done purposely for two reasons: to keep old entries from clogging the table and to isolate any errors to a single calculation. If an error were allowed to propagate through many subsequent positions, then it would be very difficult to determine the cause of the error or to re-create the conditions needed to reproduce the error. Therefore, the ordering of positions in a work unit is irrelevant.
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: Perft(14) Weekly Status Report

Post by syzygy »

sje wrote:It does have a very good record of accurate move operations, though.
Doesn't any chess engine / perft implementation that is not outright buggy?

I haven't really followed this, but what is the point of reporting on the progress made by an only average perft implementation? Wouldn't it be enough to report the final result?