First, I define a benchmark that is completely fair, magnifies the measure, and is depolluted from system noise as much as possible:
- SF vs. SF, Hash=16, Threads=1, 5ply EPD openings
- same adjudication rules (Draw: movenumber=40, movecount=8, score=20; Resign: movecount=3, score=500).
- 1000 games, concurrency 7
- no I/O to magnify the sensitivity of the measure: redirect stdout to /dev/null, and do not write a PGN
- depth=1: cutechess-cli: 9.41", zinc: 54.9". overhead: 45.5"
- depth=2: cutechess-cli: 11.6", zinc: 62.3". overhead: 50.7"
- depth=3: cutechess-cli: 12.6", zinc: 60.8". overhead: 48.2"
- depth=4: cutechess-cli: 15.5", zinc: 63.8". overhead: 48.3"
- depth=5: cutechess-cli: 20.8", zinc: 69.2". overhead: 48.4"
- depth=6: cutechess-cli: 31.1", zinc: 79.5". overhead: 48.4"
- depth=7: cutechess-cli: 51.7", zinc: 101". overhead: 49.3"
- depth=8: cutechess-cli: 89", zinc: 141". overhead: 52"
- depth=9: cutechess-cli: 163", zinc: 214". overhead: 51"
I'm sure I could work on optimizing this, and find plenty of low handing watermelons. Eventually, I would reach a min overhead imposed by the choice of Python vs. C++. But 5ms per game is totally negligible in practice.