Transformer Progress, Lc0 Blog, 2024-02-28
https://lczero.org/blog/2024/02/transformer-progress/
The CPW article in regard of CNN arch is completely outdated, maybe team Lc0 finds some time to give an update?
From the blog post:
Our strongest transformer model, BT4, is nearly 300 elo stronger in terms of raw policy than our strongest convolution-based model, T78, with fewer parameters and less computation. We’ve tested dozens of modifications to get our transformer architecture to where it is today.
This module, which we call “smolgen”, allows the model to play as if it were an additional 50% larger with only a 10% reduction in throughput.
Here is a short summary of our timeline of progress. BT1 was our first transformer model, performing roughly on par with T78, our strongest convolution-based model. BT2 improved on BT1 by adding smolgen and increasing head count. BT3 further improved on BT2 by increasing head count again and adding the new embedding layer. BT4 built on BT3 by doubling model size to push our architecture to the limit.
Interesting, first CNN, then Transformers, who knows what's next.The future of Leela is bright. Early experiments with relative positional encodings show that our architecture still has plenty of room for improvement. Also, we’re finally having success with INT8 quantization, which could speed up our models by 50% without quality degradation.
Further papers:
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
https://arxiv.org/abs/2406.00877
Attention Is All You Need
https://arxiv.org/abs/1706.03762
--
Srdja