Comparison of all known Sliding lookup algorithms

Discussion of chess software programming and technical issues.

Moderator: Ras

dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms

Post by dangi12012 »

abulmo2 wrote: Sat Jan 01, 2022 7:35 pm
dangi12012 wrote: Sat Jan 01, 2022 6:23 pm Hyperbola Quintessence is in there via "XorRookSub" algo (i am happy if we get another name but Hyperbola Quintessence is the engine name and not the sliding algo lookup name)
Daniel Inführ's case is worst than I thought ... :-(
Please provide a dictionary to translate your invented name to the conventional one or stick to the conventional one. Hyperbola Quintessence is the name of an algorithm, derived from the Hyperbola project by Ryan Mack, but not an engine name. Hyperbola Quintessence applies to bishop diagonal moves, but not to rook moves on ranks, so calling it ***Rook*** is somewhat misleading.
Point taken i will rename it to Hyperbola Quintessence even if the algorithm is found and called something different on the CPW:
https://www.chessprogramming.org/Subtra ... king_Piece
This algorithm is applied to both - bishops and rooks to get the attacked squares.
Thanks for using "ü" in the name of me :)

Updated and pushed to git:

Code: Select all

AMD Ryzen 9 5950X 16-Core Processor
Megalooks Random Occupation/s:
Exploading:     145.22MOps      6 kB    Optimal perf: imul64
Reference:      70.73MOps       8 kB    Optimal perf: none
KoggeStone:     111.53MOps      0 kB    Optimal perf: none
RotatedBoard:   90.83MOps       14 kB   Optimal perf: none
QBB Algo:       177.18MOps      0 kB    Optimal perf: countr_zero, countl_zero
BobMike:        211.01MOps      8 kB    Optimal perf: countr_zero, countl_zero
SlideArithm:    250.06MOps      2 kB    Optimal perf: bzhi_u64, blsmsk_u64
SISSY BB:       231.83MOps      1409 kB Optimal perf: none
Hyperbola Qsc:  296.16MOps      2 kB    Optimal perf: bswap
Hash Variable:  410.54MOps      729 kB  Optimal perf: imul64
Hash Plain:     510.31MOps      2306 kB Optimal perf: imul64
Hash Fancy:     556.65MOps      694 kB  Optimal perf: imul64
Pext  :         908.34MOps      843 kB  Optimal perf: pext_u64
HyperCube:      312.43MOps      841 kB  Optimal perf: none
If someone wants to add a distinct algorithm that is not on the list please pm me.
Also its time for someone to post his/her own results please!
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
jdart
Posts: 4397
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Comparison of all known Sliding lookup algorithms

Post by jdart »

That's all nice but movegen performance is a small contributor the performance of most engines.
I have PEXT support but turning it on has only a single-digit performance effect, at best, and is of course worse on older AMD systems with slow PEXT.
pferd
Posts: 134
Joined: Thu Jul 24, 2014 2:49 pm

Re: Comparison of all known Sliding lookup algorithms

Post by pferd »

WSL Ubuntu with default Makefile

Code: Select all

Verify Engines...OK!

Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
Megalooks Random Occupation/s:
Exploading: 	81.17MOps	6 kB	Optimal perf: imul64
Reference: 	92.74MOps	8 kB	Optimal perf: none
KoggeStone: 	84.66MOps	0 kB	Optimal perf: none
RotatedBoard: 	124.30MOps	14 kB	Optimal perf: none
QBB Algo: 	140.98MOps	0 kB	Optimal perf: countr_zero, countl_zero
BobMike: 	200.66MOps	8 kB	Optimal perf: countr_zero, countl_zero
SlideArithm: 	204.73MOps	2 kB	Optimal perf: bzhi_u64, blsmsk_u64
SISSY BB: 	105.26MOps	1409 kB	Optimal perf: none
Hyperbola Qsc: 	235.29MOps	2 kB	Optimal perf: bswap
Hash Variable: 	332.45MOps	729 kB	Optimal perf: imul64
Hash Plain: 	389.00MOps	2306 kB	Optimal perf: imul64
Hash Fancy: 	419.22MOps	694 kB	Optimal perf: imul64
Pext  : 	503.25MOps	843 kB	Optimal perf: pext_u64
HyperCube: 	154.92MOps	841 kB	Optimal perf: none

Megalooks Simulated Game / s:
Exploading: 	43.70MOps	6 kB	Optimal perf: imul64
Reference: 	34.74MOps	8 kB	Optimal perf: none
KoggeStone: 	78.30MOps	0 kB	Optimal perf: none
RotatedBoard: 	45.41MOps	14 kB	Optimal perf: none
QBB Algo: 	104.42MOps	0 kB	Optimal perf: countr_zero, countl_zero
BobMike: 	149.43MOps	8 kB	Optimal perf: countr_zero, countl_zero
SlideArithm: 	134.23MOps	2 kB	Optimal perf: bzhi_u64, blsmsk_u64
SISSY BB: 	139.99MOps	1409 kB	Optimal perf: none
Hyperbola Qsc: 	152.74MOps	2 kB	Optimal perf: bswap
Hash Variable: 	262.36MOps	729 kB	Optimal perf: imul64
Hash Plain: 	261.20MOps	2306 kB	Optimal perf: imul64
Hash Fancy: 	286.30MOps	694 kB	Optimal perf: imul64
Pext  : 	352.58MOps	843 kB	Optimal perf: pext_u64
HyperCube: 	225.64MOps	841 kB	Optimal perf: none
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms

Post by dangi12012 »

BIG UPDATE!

The comparison now also tests all threads for every algorithm!
https://github.com/Gigantua/Chess_Movegen

Please run it and post here - I am very interested in Intel and old Ryzen!

For example thread scaling with PEXT lookup is like this (16 actual cores, 32 threads):
Pext - 843 kB - 10 Billion Queen lookups/s!

Code: Select all

        threads speed   scaling Megalookups/s
        1       1.00    100.00% 722.24
        2       2.28    113.99% 1646.53
        3       3.38    112.56% 2438.97
        4       4.56    113.96% 3292.25
        5       5.22    104.46% 3772.37
        6       6.21    103.47% 4483.97
        7       7.34    104.85% 5301.12
        8       8.42    105.24% 6080.64
        9       8.64    96.01%  6240.68
        10      10.04   100.37% 7249.13
        11      10.64   96.68%  7681.30
        12      11.54   96.14%  8332.61
        13      12.42   95.57%  8973.10
        14      12.65   90.32%  9133.02
        15      13.22   88.15%  9550.34
        16      14.14   88.36%  10210.59
        17      9.65    56.74%  6966.21
        18      10.06   55.88%  7264.44
        19      10.35   54.46%  7473.77
        20      10.10   50.50%  7294.35
        21      11.03   52.50%  7962.76
        22      11.13   50.59%  8038.07
        23      11.60   50.44%  8379.71
        24      12.26   51.09%  8856.50
        25      12.34   49.34%  8909.64
        26      13.10   50.37%  9458.50
        27      13.36   49.48%  9649.75
        28      13.52   48.28%  9764.09
        29      14.04   48.42%  10142.08
        30      13.93   46.44%  10062.72
        31      14.21   45.83%  10260.82
        32      13.99   43.71%  10101.17
        33      11.21   33.98%  8098.11
        34      11.54   33.95%  8336.76
        35      10.80   30.86%  7800.18
        36      10.94   30.38%  7899.89
        37      12.27   33.16%  8860.31
        38      10.29   27.07%  7429.71
        39      11.83   30.33%  8542.42
        40      10.78   26.94%  7782.48
It seems that my thoughts were unwarranted and there is no such thing as L2 congestion for any of the algorithms. Of course the behaviour could be different if there are other lookups happening in the same time in tables that in sum dont fit in L2 anymore. But as it stands not most algos get a pretty good scaling up the the 50% one would expect with hyperthreading. (Some get even 60% in the end - thats why hyperthreading was invented in the first place)

Image

So generally it seems its best to either commit to the number of physical cores. Or go to the full number of threads. Anything in between seems to be not optimal.

Here are the maximum numbers per algo:
And I am looking forward to pms if anyone likes to add another algo!

Code: Select all

Exploading:     2038.30MOps     31Threads
Reference:      1241.28MOps     31Threads
KoggeStone:     1478.40MOps     32Threads
RotatedBoard:   1900.48MOps     32Threads
QBB Algo:       2322.44MOps     32Threads
BobMike:        2909.39MOps     32Threads
SlideArithm:    3298.31MOps     32Threads
SISSY BB:       2313.30MOps     30Threads
Hyperbola Qsc:  3786.93MOps     32Threads
Hash Variable:  5282.95MOps     32Threads
Hash Plain:     4960.57MOps     32Threads
Hash Fancy:     6906.90MOps     16Threads
Pext  :         10260.82MOps    31Threads
HyperCube:      4315.78MOps     32Threads
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
abulmo2
Posts: 460
Joined: Fri Dec 16, 2016 11:04 am
Location: France
Full name: Richard Delorme

Re: Comparison of all known Sliding lookup algorithms

Post by abulmo2 »

dangi12012 wrote: Sun Jan 02, 2022 7:04 pm [Please run it and post here - I am very interested in Intel and old Ryzen!
On an old Ryzen 1700x at 3.5Ghz under Fedora linux 35 :

Code: Select all

Summary:
Exploading: 	653.50MOps 	15Threads
Reference: 	613.16MOps 	14Threads
KoggeStone: 	745.83MOps 	16Threads
RotatedBoard: 	913.70MOps 	24Threads
QBB Algo: 	1129.91MOps 	24Threads
BobMike: 	1428.38MOps 	22Threads
SlideArithm: 	1668.16MOps 	16Threads
SISSY BB: 	1056.31MOps 	22Threads
Hyperbola Qsc: 	2195.63MOps 	24Threads
Hash Variable: 	2945.87MOps 	23Threads
Hash Plain: 	2630.34MOps 	24Threads
Hash Fancy: 	3407.62MOps 	22Threads
Pext  : 	341.14MOps 	21Threads
HyperCube: 	1469.87MOps 	15Threads

Megalooks Simulated Game Single Thread/ s:
Exploading: 	47.31MOps	6 kB	Optimal perf: imul64
Reference: 	36.96MOps	8 kB	Optimal perf: none
KoggeStone: 	82.68MOps	0 kB	Optimal perf: none
RotatedBoard: 	46.52MOps	14 kB	Optimal perf: none
QBB Algo: 	119.15MOps	0 kB	Optimal perf: countr_zero, countl_zero
BobMike: 	126.89MOps	8 kB	Optimal perf: countr_zero, countl_zero
SlideArithm: 	129.22MOps	2 kB	Optimal perf: bzhi_u64, blsmsk_u64
SISSY BB: 	172.24MOps	1409 kB	Optimal perf: none
Hyperbola Qsc: 	201.79MOps	2 kB	Optimal perf: bswap
Hash Variable: 	284.75MOps	729 kB	Optimal perf: imul64
Hash Plain: 	266.36MOps	2306 kB	Optimal perf: imul64
Hash Fancy: 	302.37MOps	694 kB	Optimal perf: imul64
Pext  : 	33.11MOps	843 kB	Optimal perf: pext_u64
HyperCube: 	231.74MOps	841 kB	Optimal perf: none
Richard Delorme
User avatar
Ras
Posts: 2694
Joined: Tue Aug 30, 2016 8:19 pm
Full name: Rasmus Althoff

Re: Comparison of all known Sliding lookup algorithms

Post by Ras »

dangi12012 wrote: Sun Jan 02, 2022 7:04 pmPlease run it and post here - I am very interested in Intel and old Ryzen!
Ryzen 5 3400G, 3.7GHz, 2x16GB DDR4-2933, Linux Mint 20.2 with kernel 5.4, compiled from makefile (Clang 10), no other applications running during the test. The CPU has four physical and eight logical cores, but the optimum thread count is much higher. I guess the limit here isn't computing, but the memory interface.

Code: Select all

Verify Engines...OK!

AMD Ryzen 5 3400G with Radeon Vega Graphics    
Megalooks Random Positions/s:
Megalookups Multithreaded Random Positions/s:

Exploading - 6 kB	
	threads	speed	scaling	Megalookups/s
	1	1.00	100.00%	66.25
	2	1.82	91.15%	120.76
	3	2.78	92.73%	184.29
	4	3.07	76.84%	203.62
	5	3.54	70.82%	234.57
	6	4.25	70.80%	281.39
	7	5.04	72.05%	334.10
	8	5.67	70.89%	375.68
	9	4.92	54.69%	326.10
	10	4.82	48.19%	319.26
	11	5.35	48.60%	354.18
	12	5.49	45.77%	363.84
	13	5.49	42.24%	363.77
	14	5.52	39.40%	365.45
	15	5.49	36.62%	363.88
	16	5.72	35.78%	379.22

Reference - 8 kB	
	threads	speed	scaling	Megalookups/s
	1	1.00	100.00%	60.08
	2	2.16	107.85%	129.58
	3	2.31	77.09%	138.94
	4	2.93	73.19%	175.88
	5	3.88	77.63%	233.20
	6	4.54	75.72%	272.96
	7	5.24	74.91%	315.05
	8	6.17	77.14%	370.75
	9	5.42	60.18%	325.40
	10	5.50	55.05%	330.71
	11	5.44	49.43%	326.68
	12	5.64	47.04%	339.14
	13	6.16	47.37%	370.00
	14	5.59	39.96%	336.10
	15	5.82	38.80%	349.63
	16	6.37	39.84%	382.93

KoggeStone - 0 kB	
	threads	speed	scaling	Megalookups/s
	1	1.00	100.00%	89.31
	2	1.99	99.44%	177.63
	3	3.04	101.32%	271.46
	4	2.64	65.97%	235.68
	5	2.93	58.70%	262.12
	6	3.66	61.08%	327.28
	7	4.08	58.35%	364.78
	8	4.71	58.90%	420.85
	9	4.45	49.50%	397.86
	10	4.39	43.87%	391.84
	11	4.39	39.94%	392.38
	12	4.23	35.23%	377.58
	13	4.71	36.26%	420.99
	14	4.91	35.05%	438.22
	15	4.61	30.73%	411.64
	16	4.90	30.60%	437.31

RotatedBoard - 14 kB	
	threads	speed	scaling	Megalookups/s
	1	1.00	100.00%	71.59
	2	2.12	106.25%	152.13
	3	3.10	103.44%	222.15
	4	3.68	92.09%	263.71
	5	4.75	94.91%	339.71
	6	5.58	93.08%	399.82
	7	6.53	93.22%	467.16
	8	7.28	90.96%	520.97
	9	6.09	67.66%	435.91
	10	6.31	63.07%	451.49
	11	6.59	59.90%	471.68
	12	6.81	56.76%	487.65
	13	7.14	54.91%	510.98
	14	7.43	53.05%	531.68
	15	7.27	48.44%	520.18
	16	7.52	46.99%	538.26

QBB Algo - 0 kB	
	threads	speed	scaling	Megalookups/s
	1	1.00	100.00%	121.08
	2	2.06	102.81%	248.97
	3	3.08	102.54%	372.46
	4	4.15	103.84%	502.90
	5	3.17	63.45%	384.15
	6	3.72	61.93%	449.92
	7	4.29	61.34%	519.92
	8	4.97	62.13%	601.85
	9	4.73	52.59%	573.02
	10	4.68	46.76%	566.17
	11	4.96	45.06%	600.10
	12	4.92	41.03%	596.10
	13	4.89	37.63%	592.34
	14	5.26	37.57%	636.91
	15	5.19	34.63%	628.94
	16	5.15	32.21%	624.00

BobMike - 8 kB	
	threads	speed	scaling	Megalookups/s
	1	1.00	100.00%	157.83
	2	2.20	109.96%	347.12
	3	3.02	100.79%	477.26
	4	2.50	62.42%	394.05
	5	3.07	61.35%	484.14
	6	3.84	64.05%	606.57
	7	4.19	59.83%	660.99
	8	4.78	59.69%	753.69
	9	4.61	51.25%	727.98
	10	4.85	48.46%	764.83
	11	4.75	43.20%	750.07
	12	5.07	42.25%	800.21
	13	5.20	40.03%	821.36
	14	5.25	37.48%	828.11
	15	4.79	31.96%	756.67
	16	5.00	31.23%	788.73

SlideArithm - 2 kB	
	threads	speed	scaling	Megalookups/s
	1	1.00	100.00%	149.60
	2	2.30	115.12%	344.44
	3	3.51	116.98%	525.02
	4	2.79	69.83%	417.86
	5	3.38	67.60%	505.67
	6	4.00	66.71%	598.84
	7	4.82	68.90%	721.53
	8	5.61	70.15%	839.63
	9	5.74	63.73%	858.07
	10	6.00	60.01%	897.82
	11	5.45	49.54%	815.24
	12	5.85	48.76%	875.42
	13	5.86	45.05%	876.23
	14	6.10	43.58%	912.86
	15	6.24	41.59%	933.19
	16	6.39	39.91%	955.34

SISSY BB - 1409 kB	
	threads	speed	scaling	Megalookups/s
	1	1.00	100.00%	124.47
	2	1.72	85.84%	213.69
	3	3.18	106.10%	396.18
	4	2.99	74.82%	372.52
	5	3.38	67.67%	421.12
	6	4.03	67.19%	501.81
	7	4.57	65.35%	569.37
	8	4.59	57.32%	570.82
	9	4.27	47.40%	531.01
	10	4.39	43.88%	546.14
	11	4.52	41.14%	563.23
	12	4.59	38.23%	571.05
	13	4.94	38.00%	614.90
	14	4.84	34.57%	602.35
	15	4.98	33.22%	620.24
	16	4.85	30.28%	603.10

Hyperbola Qsc - 2 kB	
	threads	speed	scaling	Megalookups/s
	1	1.00	100.00%	196.88
	2	2.12	105.89%	416.96
	3	3.25	108.31%	639.71
	4	4.21	105.37%	829.80
	5	3.89	77.81%	765.94
	6	4.19	69.79%	824.39
	7	5.16	73.71%	1015.83
	8	5.86	73.25%	1153.63
	9	5.67	62.98%	1115.92
	10	5.49	54.91%	1081.13
	11	5.51	50.13%	1085.68
	12	5.68	47.31%	1117.67
	13	6.12	47.07%	1204.60
	14	6.76	48.28%	1330.73
	15	6.96	46.39%	1369.96
	16	6.80	42.52%	1339.37

Hash Variable - 729 kB	
	threads	speed	scaling	Megalookups/s
	1	1.00	100.00%	242.14
	2	1.93	96.29%	466.30
	3	2.88	95.88%	696.53
	4	2.96	73.93%	716.11
	5	3.99	79.80%	966.15
	6	5.63	93.76%	1362.17
	7	5.73	81.89%	1388.09
	8	5.67	70.87%	1372.90
	9	6.18	68.61%	1495.30
	10	5.95	59.48%	1440.37
	11	6.28	57.05%	1519.55
	12	6.79	56.59%	1644.25
	13	6.83	52.55%	1654.31
	14	7.29	52.07%	1765.04
	15	7.76	51.74%	1879.25
	16	7.38	46.13%	1787.05

Hash Plain - 2306 kB	
	threads	speed	scaling	Megalookups/s
	1	1.00	100.00%	295.67
	2	1.53	76.46%	452.14
	3	2.11	70.34%	623.90
	4	2.68	66.99%	792.26
	5	3.25	65.02%	961.22
	6	3.79	63.24%	1121.81
	7	4.13	59.00%	1221.12
	8	4.93	61.60%	1457.17
	9	4.32	47.95%	1275.96
	10	4.70	46.96%	1388.55
	11	4.95	44.98%	1462.84
	12	4.94	41.19%	1461.35
	13	4.76	36.62%	1407.76
	14	5.07	36.21%	1498.89
	15	5.19	34.58%	1533.79
	16	5.05	31.55%	1492.67

Hash Fancy - 694 kB	
	threads	speed	scaling	Megalookups/s
	1	1.00	100.00%	309.87
	2	1.76	88.01%	545.46
	3	2.13	71.10%	660.98
	4	3.52	88.02%	1090.93
	5	4.14	82.73%	1281.82
	6	4.03	67.17%	1248.84
	7	4.81	68.71%	1490.28
	8	5.33	66.58%	1650.40
	9	5.02	55.77%	1555.33
	10	5.16	51.60%	1598.87
	11	5.38	48.95%	1668.58
	12	5.60	46.63%	1733.97
	13	5.39	41.48%	1670.84
	14	5.68	40.59%	1761.05
	15	5.77	38.50%	1789.41
	16	6.16	38.48%	1907.84

Pext   - 843 kB	
	threads	speed	scaling	Megalookups/s
	1	1.00	100.00%	34.48
	2	2.00	100.02%	68.98
	3	2.96	98.78%	102.18
	4	3.03	75.77%	104.51
	5	3.63	72.70%	125.34
	6	4.54	75.61%	156.43
	7	5.20	74.21%	179.14
	8	5.87	73.39%	202.45
	9	5.11	56.82%	176.34
	10	5.25	52.46%	180.90
	11	5.33	48.45%	183.78
	12	5.89	49.12%	203.24
	13	5.88	45.19%	202.60
	14	5.86	41.87%	202.15
	15	5.71	38.05%	196.78
	16	5.94	37.15%	204.95

HyperCube - 841 kB	
	threads	speed	scaling	Megalookups/s
	1	1.00	100.00%	133.31
	2	1.88	93.89%	250.33
	3	2.57	85.73%	342.87
	4	2.86	71.49%	381.21
	5	3.72	74.39%	495.81
	6	4.34	72.32%	578.42
	7	5.07	72.43%	675.83
	8	5.93	74.18%	791.14
	9	5.02	55.76%	669.01
	10	5.36	53.60%	714.56
	11	5.23	47.52%	696.88
	12	5.88	49.01%	784.06
	13	6.34	48.81%	845.79
	14	6.26	44.70%	834.21
	15	6.41	42.70%	853.92
	16	6.37	39.79%	848.68


Summary:
Exploading: 	379.22MOps 	16Threads
Reference: 	382.93MOps 	16Threads
KoggeStone: 	438.22MOps 	14Threads
RotatedBoard: 	538.26MOps 	16Threads
QBB Algo: 	636.91MOps 	14Threads
BobMike: 	828.11MOps 	14Threads
SlideArithm: 	955.34MOps 	16Threads
SISSY BB: 	620.24MOps 	15Threads
Hyperbola Qsc: 	1369.96MOps 	15Threads
Hash Variable: 	1879.25MOps 	15Threads
Hash Plain: 	1533.79MOps 	15Threads
Hash Fancy: 	1907.84MOps 	16Threads
Pext  : 	204.95MOps 	16Threads
HyperCube: 	853.92MOps 	15Threads

Megalooks Simulated Game Single Thread/ s:
Exploading: 	51.75MOps	6 kB	Optimal perf: imul64
Reference: 	39.92MOps	8 kB	Optimal perf: none
KoggeStone: 	104.30MOps	0 kB	Optimal perf: none
RotatedBoard: 	47.84MOps	14 kB	Optimal perf: none
QBB Algo: 	127.06MOps	0 kB	Optimal perf: countr_zero, countl_zero
BobMike: 	200.18MOps	8 kB	Optimal perf: countr_zero, countl_zero
SlideArithm: 	138.68MOps	2 kB	Optimal perf: bzhi_u64, blsmsk_u64
SISSY BB: 	197.16MOps	1409 kB	Optimal perf: none
Hyperbola Qsc: 	206.97MOps	2 kB	Optimal perf: bswap
Hash Variable: 	334.35MOps	729 kB	Optimal perf: imul64
Hash Plain: 	297.91MOps	2306 kB	Optimal perf: imul64
Hash Fancy: 	353.24MOps	694 kB	Optimal perf: imul64
Pext  : 	36.08MOps	843 kB	Optimal perf: pext_u64
HyperCube: 	255.97MOps	841 kB	Optimal perf: none
Rasmus Althoff
https://www.ct800.net
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms

Post by dangi12012 »

Ras wrote: Mon Jan 03, 2022 1:25 pm Ryzen 5 3400G, 3.7GHz, 2x16GB DDR4-2933, Linux Mint 20.2 with kernel 5.4, compiled from makefile
We can observe - dont use pext on Ryzen 1/2.
Second interestingly - almost all algorithms seem to scale well above the cores and threads on lower to mid end hardware. Keep in mind here we have a limit at 8 Threads / 4 Cores. More threads than that (16) do still add a lot of performance. I think its because more threads can issue their command to the MMU and that can keep up. 1600 vs 1900 MLU is significant.

Image
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
Ras
Posts: 2694
Joined: Tue Aug 30, 2016 8:19 pm
Full name: Rasmus Althoff

Re: Comparison of all known Sliding lookup algorithms

Post by Ras »

dangi12012 wrote: Mon Jan 03, 2022 9:57 pmWe can observe - dont use pext on Ryzen 1/2.
The 3400G is Zen+, not Zen2. AMD's numbering scheme for the 3000 APUs was somewhat misleading.

I also have a Zen2 laptop, but I'm hesitating to use that for benchmarks because it's more of a cooling system test than anything else. With my 3400G desktop, even sustained 100% CPU load doesn't exceed 43°C. Its CPU cooler is quite overdimensioned in order to stay silent.
I think its because more threads can issue their command to the MMU and that can keep up.
Yeah, that's my guess for smaller systems - overdimensioned memory bandwidth.
Rasmus Althoff
https://www.ct800.net
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms

Post by dangi12012 »

Update:
Preliminary CUDA result for fancy magic: 51933.8 Million Lookups/s
So a RTX 3080 is 7.5x faster than 16 Zen3 Cores.

I suspect that you can lineary apply these numbers to other hardware of the same generation. (Zen3 and Ampere)
8704 Cuda Cores == 120 Zen3 Cores.

I suspect that using the tricks of old times and manually writing 32 bit code would be a lot faster. Gpus integer math is inherently 32 bit - and a uint64_t is only an enulation.

I am eager to see how some other algos will perform. Especially those without a lookup table. I wont translate all of them. Its quite a bit of work to get everything working. Incremental algorithms like my own would only be testable inside a whole gpu engine.

Disclaimer
This applies to code that run fully on the gpu. passing 1 Million occupancies and squares to the gpu and getting the attacks back is memory bound and quite a bit slower than running everything exclusively on each device.
So its time for a gpu only engine - the only thing that does not run on the gpu is the UCI interface.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms

Post by dangi12012 »

If you want to use cuda yourself:
cu_common.h

Code: Select all

#pragma once
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand.h>
#include <curand_kernel.h>
#include <device_functions.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <vector_functions.h>

#ifndef __CUDACC__
#define __CUDACC__
#endif
#include "device_atomic_functions.h"


#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char* file, int line, bool abort = true)
{
    if (code != cudaSuccess)
    {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}

#ifdef __INTELLISENSE__
//    Reverse the bit order of a 32 bit unsigned integer. 
__device__ unsigned int __brev(unsigned int  x) {};

//Reverse the bit order of a 64 bit unsigned integer.
__device__ unsigned long long int __brevll(unsigned long long int x) {};


//Return selected bytes from two 32 bit unsigned integers.
__device__ unsigned int __byte_perm(unsigned int  x, unsigned int  y, unsigned int  s) {};


//Return the number of consecutive high - order zero bits in a 32 bit integer.
__device__ int __clz(int  x) {};


//Count the number of consecutive high - order zero bits in a 64 bit integer.
__device__ int __clzll(long long int x) {};


//Find the position of the least significant bit set to 1 in a 32 bit integer.
__device__ int __ffs(int  x) {};


//Find the position of the least significant bit set to 1 in a 64 bit integer.Concatenate hi : lo, shift left by shift & 31 bits, return the most significant 32 bits.
__device__ int __ffsll(long long int x) {};


//Concatenate hi : lo, shift left by shift & 31 bits, return the most significant 32 bits.
__device__ unsigned int __funnelshift_l(unsigned int  lo, unsigned int  hi, unsigned int  shift) {};


//Concatenate hi : lo, shift left by min(shift, 32) bits, return the most significant 32 bits.
__device__ unsigned int __funnelshift_lc(unsigned int  lo, unsigned int  hi, unsigned int  shift) {};


//Concatenate hi : lo, shift right by shift & 31 bits, return the least significant 32 bits.
__device__ unsigned int __funnelshift_r(unsigned int  lo, unsigned int  hi, unsigned int  shift) {};


//Concatenate hi : lo, shift right by min(shift, 32) bits, return the least significant 32 bits.
__device__ unsigned int __funnelshift_rc(unsigned int  lo, unsigned int  hi, unsigned int  shift) {};


//Compute average of signed input arguments, avoiding overflow in the intermediate sum.
__device__ int __hadd(int, int) {};


//Calculate the least significant 32 bits of the product of the least significant 24 bits of two integers.
__device__ int __mul24(int  x, int  y) {};


//Calculate the most significant 64 bits of the product of the two 64 bit integers.
__device__ long long int __mul64hi(long long int x, long long int y) {};


//Calculate the most significant 32 bits of the product of the two 32 bit integers.
__device__ int __mulhi(int  x, int  y) {};


//Count the number of bits that are set to 1 in a 32 bit integer.
__device__ int __popc(unsigned int  x) {};


//Count the number of bits that are set to 1 in a 64 bit integer.
__device__ int __popcll(unsigned long long int x) {};


//Compute rounded average of signed input arguments, avoiding overflow in the intermediate sum.
__device__ int __rhadd(int, int) {};


//Calculate | x − y | +z, the sum of absolute difference.
__device__ unsigned int __sad(int  x, int  y, unsigned int  z) {};


//Compute average of unsigned input arguments, avoiding overflow in the intermediate sum.
__device__ unsigned int __uhadd(unsigned int, unsigned int) {};


//Calculate the least significant 32 bits of the product of the least significant 24 bits of two unsigned integers.
__device__ unsigned int __umul24(unsigned int  x, unsigned int  y) {};


//Calculate the most significant 64 bits of the product of the two 64 unsigned bit integers.
__device__ unsigned long long int __umul64hi(unsigned long long int x, unsigned long long int y) {};


//Calculate the most significant 32 bits of the product of the two 32 bit unsigned integers.
__device__ unsigned int __umulhi(unsigned int  x, unsigned int  y) {};


//Compute rounded average of unsigned input arguments, avoiding overflow in the intermediate sum.
__device__ unsigned int __urhadd(unsigned int, unsigned int) {};


//Calculate | x − y | +z, the sum of absolute difference.
__device__ unsigned int __usad(unsigned int  x, unsigned int  y, unsigned int  z) {};

//////////////////////////////////////////////////////
//atomic functions

int atomicAdd(int* address, int val) {};
unsigned int atomicAdd(unsigned int* address, unsigned int val) {};
unsigned long long int atomicAdd(unsigned long long int* address, unsigned long long int val) {};
float atomicAdd(float* address, float val) {};
double atomicAdd(double* address, double val) {};

typedef int __half2;
typedef short __half;
__half2 atomicAdd(__half2* address, __half2 val) {};
__half atomicAdd(__half* address, __half val) {};

int atomicSub(int* address, int val) {};
unsigned int atomicSub(unsigned int* address, unsigned int val) {};

int atomicExch(int* address, int val) {};
unsigned int atomicExch(unsigned int* address, unsigned int val) {};
unsigned long long int atomicExch(unsigned long long int* address, unsigned long long int val) {};
float atomicExch(float* address, float val) {};

int atomicMin(int* address, int val) {};
unsigned int atomicMin(unsigned int* address, unsigned int val) {};
unsigned long long int atomicMin(unsigned long long int* address, unsigned long long int val) {};

int atomicMax(int* address, int val) {};
unsigned int atomicMax(unsigned int* address, unsigned int val) {};
unsigned long long int atomicMax(unsigned long long int* address, unsigned long long int val) {};

unsigned int atomicInc(unsigned int* address, unsigned int val) {};

unsigned int atomicDec(unsigned int* address, unsigned int val) {};

int atomicCAS(int* address, int compare, int val) {};
unsigned int atomicCAS(unsigned int* address, unsigned int compare, unsigned int val) {};
unsigned long long int atomicCAS(unsigned long long int* address,
    unsigned long long int compare,
    unsigned long long int val) {};
unsigned short int atomicCAS(unsigned short int* address,
    unsigned short int compare,
    unsigned short int val) {};

int atomicAnd(int* address, int val) {};
unsigned int atomicAnd(unsigned int* address,
    unsigned int val) {};
unsigned long long int atomicAnd(unsigned long long int* address,
    unsigned long long int val) {};

int atomicOr(int* address, int val) {};
unsigned int atomicOr(unsigned int* address,
    unsigned int val) {};
unsigned long long int atomicOr(unsigned long long int* address,
    unsigned long long int val) {};

int atomicXor(int* address, int val) {};
unsigned int atomicXor(unsigned int* address, unsigned int val) {};
unsigned long long int atomicXor(unsigned long long int* address, unsigned long long int val) {};

template <typename T>
unsigned int __match_any_sync(unsigned mask, T value) {};
template <typename T>
unsigned int __match_all_sync(unsigned mask, T value, int* pred) {};

template <typename T>
T __shfl_sync(unsigned mask, T var, int srcLane, int width = warpSize) {};
template <typename T>
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width = warpSize) {};
template <typename T>
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int width = warpSize) {};
template <typename T>
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width = warpSize) {};
#endif
SlideArithm.h

Code: Select all

#pragma once
#include "Cu_Common.h"

namespace SlideArithm {
    __device__ uint64_t MSB(uint64_t value)
    {
        return 63ull - __clzll(value);
    }

    __device__ uint64_t LSB(uint64_t value)
    {
        return __ffsll(value);
    }

    /* return the bitboard with the rook destinations */
    __device__ uint64_t Rook(uint64_t sq, uint64_t occupation)
    {
        uint64_t piece = 1ULL << sq;
        occupation ^= piece; /* remove the selected piece from the occupation */
        uint64_t piecesup = (0x0101010101010101ULL << sq) & (occupation | 0xFF00000000000000ULL); /* find the pieces up */
        uint64_t piecesdo = (0x8080808080808080ULL >> (63 - sq)) & (occupation | 0x00000000000000FFULL); /* find the pieces down */
        uint64_t piecesri = (0x00000000000000FFULL << sq) & (occupation | 0x8080808080808080ULL); /* find pieces on the right */
        uint64_t piecesle = (0xFF00000000000000ULL >> (63 - sq)) & (occupation | 0x0101010101010101ULL); /* find pieces on the left */
        return (((0x8080808080808080ULL >> (63 - LSB(piecesup))) & (0x0101010101010101ULL << MSB(piecesdo))) |
            ((0xFF00000000000000ULL >> (63 - LSB(piecesri))) & (0x00000000000000FFULL << MSB(piecesle)))) ^ piece;
        /* From every direction find the first piece and from that piece put a mask in the opposite direction.
           Put togheter all the 4 masks and remove the moving piece */
    }

    /* return the bitboard with the bishops destinations */
    __device__ uint64_t Bishop(uint64_t sq, uint64_t occupation)
    {  /* it's the same as the rook */
        uint64_t piece = 1ULL << sq;
        occupation ^= piece;
        uint64_t piecesup = (0x8040201008040201ULL << sq) & (occupation | 0xFF80808080808080ULL);
        uint64_t piecesdo = (0x8040201008040201ULL >> (63 - sq)) & (occupation | 0x01010101010101FFULL);
        uint64_t piecesle = (0x8102040810204081ULL << sq) & (occupation | 0xFF01010101010101ULL);
        uint64_t piecesri = (0x8102040810204081ULL >> (63 - sq)) & (occupation | 0x80808080808080FFULL);
        return (((0x8040201008040201ULL >> (63 - LSB(piecesup))) & (0x8040201008040201ULL << MSB(piecesdo))) |
            ((0x8102040810204081ULL >> (63 - LSB(piecesle))) & (0x8102040810204081ULL << MSB(piecesri)))) ^ piece;
    }

    __device__ uint64_t Queen(uint64_t sq, uint64_t occupation) {
        return Rook(sq, occupation) | Bishop(sq, occupation);
    }
}
You can see that when no lookup table is needed is really simple to translate! I think every algo can be translated with high performance. Except pext - and maybe Hyperbola Qsc will be slow with bswap64. 100% wont translate (useless for gpu): SISSY BB, Hash Plain, Hash Variable, RotatedBoard

Bonus:
If the algorithms lookup fits in 64KB of space then a __constant__ memory location can be used - which is a manual entry in each cuda cores L2 cache. That will be extremely fast!

So I will continue to translate c++ to cuda and keep updates here.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer