Ironically, a lot of the tricks Stockfish is using here are reminiscent of tricks that were used in the original AlphaGo and later discarded in AlphaGoZero.
In particular, the AlphaGo paper mentioned four neural networks of significance:
- a policy network trained on human pro games.
- a RL-enhanced policy network improving on the original SL-trained policy network.
- a value network trained on games generated by the RL-enhanced policy network
- a cheap policy network trained on human pro games, used only for rapid rollout simulations.
The cheap rollout policy network was discarded because DeepMind found that a "slow evaluations of the right positions" was better than "rapid evaluations of questionable positions". The independently trained value network was discarded because co-training a value and policy head on a shared trunk saved a significant amount of compute, and helped regularize both objectives against each other. The RL-enhanced policy network was discarded in favor of training the policy network to directly replicate MCTS search statistics.
The depth and branching factor in chess and Go are different, so I won't say the solutions ought to be the same, but it's interesting nonetheless to see the original AlphaGo ideas be resurrected in this form.
The incremental updates are also related to Zobrist Hashing, which the Stockfish authors are certainly aware of.
It's also different because Stockfish uses Alpha-beta treesearch instead of MCTS:
MCTS tries to deal with an explosion in search-space by only sampling very small parts of the searchspace and relying on a very good heuristic to guid that search process. In this case it is crucial to find the most relevant subset of the tree to explore it, so spending more time on your policy makes sense.
Alpha-beta pruning however always explores the entire searchtree systematically (up to a certain depth using itd) and prunes the searchtree by discarding bad moves. In this case you don't need as good of an evaluation function because you search the entire tree anyways. Rather you need the function to be fast, as it is evaluated on many more states.
In general AB-pruning only needs the heuristic for estimating the tail-end of the tree and for sorting the states based on usefulness, while MCTS uses all the above plus guiding the whole search process.
Spending tons of time on the heuristic is not useful as even the optimal search order can only double your searchdepth. Don't get me wrong that still a lot (especially considering exponential blowup) but MCTS can surpass this depth easily. The disadvantage is that MCTS loses a lot of the guarantees of AB-pruning and tends to "play down to his opponent" when trained using self-play because the exploration order is entirely determined by the policy.
Right now there is a lot of experimentation to try adjusting the network architecture. The current leading approach is a much larger net which takes in attack information per square (eg. is this piece attacked by more pieces than it's defended by?). That network is a little slower, but the additional information seems to be enough to be stronger than the current architecture.
Btw, the original Shogi developers really did something amazing. The nodchip trainer is all custom code, and trains extremely strong nets. There are all sorts of subtle tricks embedded in there as well that led to stronger nets. Not to mention, getting the quantization (float32 -> int16/int8) working gracefully is a huge challenge.
This is very nice. Reminds me a lot of tricks used for simple linear models. And it seems to work, given that Leela Chess Zero is losing with quite a gap.
Most of the times one would learn a model by just changing a single feature and then doing the whole sum made no sense.
A good example is learning a sequence of decisions, where each decision might have a cost associated to it, you can then say that the current decision depends on a previous one and vary the previous one to learn to recover from errors. If previous decision was bad, then you'd still like to make the best decision for current state.).
So even if your training data does not have this error-recovery example, you can just iterate through all previous decisions and make the model learn to recover.
An optimization in that case would be to just not redo the whole sum (for computing the decision function of a linear model).
"Leela Chess Zero is losing with quite a gap"
Check https://tcec-chess.com/ ATM Leela is in front again.
That being said it is not clear which approach is better e.g. a very smart but relatively slow evaluation function (Leela) or SFs approach to use a relatively dumb eval function with a more sophisticated search. It is pretty clear that Leelas search can be improved (Check https://github.com/dje-dev/Ceres for example)
That being said, assuming you can fully saturate multiple GPUs then the SF approach has the disadvantage that it cannot use the performance improvements for GPUs/TPUs which still are growing fast, whereas CPU performance only grows slowly these days.
The Stockfish mentality is: not everybody owns a GPU and Stockfish should be available for everybody and perform well . So they went for a CPU micro-arch optimized neural net, which is great. Maybe this is ultimately not the best for tournaments.
What I would find interesting is if they could give engines an energy budget instead of a time limit. Maybe that makes CPU vs GPU games more interesting & fair.
IMHO they did this because that was the only incremental way to improve strength. They always relied an a fast search with a relatively simplistic eval function. For this approach alpha beta search works well. What is not clear whether it works well for the Leela approach. All attempts so far at least have not been successful(could be for other reasons, such as training approach). SF strength seems to be their well optimized search function. Rewriting that would IMHO be equivalent to creating a new chess engine. That being said I still think that the success of SF NNUE is very remarkable.
Interesting point. Nvidia have been improving the int performance for quantized inference on their GPUs a lot. It might be a lot of work but could it be possible to scale up this NNUE approach to the point where it would be worthwhile to run on a GPU?
For single-input "batches" (seems like this is what's being used now?) it might never be worthwhile but perhaps if multiple positions could be searched in parallel and the NN evaluation batched this might start to look tempting?
Not sure what the effect of running PVS with multiple parallel search threads is. Presumably the payoff of searching with less information means you reach the performance ceiling quite a lot quicker than MCTS-like searches as the pruning is a lot more sensitive to having up-to-date information about the principal variation.
Disclaimer: My understanding of PVS is very limited.
Sure if someone can come up with an approach to run an NNUE (efficiently updatable) network on GPUs that might really be another breakthrough. But at a first glance it looks to me that this could be very difficult. Because AFAIK the SF search is relatively complicated. Even for Leela implementing an efficient batched search on multiple GPUs seems to be difficult (some improvements coming with Ceres). And Leela is using a much simple MCTS search. That doesn't that Leela's search could not be improved. It does not give higher priority necessarily for forced sequence of moves (at least not explicitly) or high risk moves. Which is IMHO why sometimes she does not see relatively simple tactics.
Generally the GPU vs CPU gap on the evaluation side isn't nearly as big as in training. In theory the gap should remain large, but in practice things like the inability to have data ready for batch evaluation means it is harder to saturate the GPU (which as you note is important).
But I'm not sure how much of this applies to chess engines. I see some comments noting that the search part makes it hard to generate batches, but I'm not an expert in this.
It seems that the strategy is to use the neural network to score various moves and then a search strategy to try to find moves that result in a favourable score. And this post focuses on some of the technical engineering details to design such a scoring network. In particular the scoring is split into two parts: 1. A matrix multiplication by a sparse input vector to get a dense representation of a position, and 2. Some nonlinear and further layers after this first step. And it seems that step 1 is considered more expensive.
The way this is made cheap is by making it incremental: given some board s and output of the first layer b + Ws, it is cheap to compute b + Wt where a t is a board that is similar to s (the difference is W(t-s) but the vector t-s is 0 in almost every element.)
This motivates some of the engineering choices like using integers instead of floats. If you used floats then this incremental update wouldn’t work.
It seems to me that a lot of the smarts of stockfish will be in the search algorithm getting good results, but I don’t know if that just requires a bit of parallelism (surely some kind of work-stealing scheduler) and brute force or if it mostly relies on some more clever strategies. And maybe I’m wrong and the key is really in the scoring of positions.
There is a LOT of cleverness built into the stockfish search as well.
In fact, while neural networks have now won the position evolution game, it is still an open discussion in the chess programming community if the Alpha Zero / Leela search (NN + Monte Carlo) is as good as Stockfish's PVS.
The point is that integer weights allow for incremental updates to be correct. If you used floats then they would drift away from correct as you applied more incremental updates. And neural networks can be quite sensitive to small errors
Maybe the network uses integers for other reasons. Perhaps due to the history of the program. I don’t know how the search works but I would imagine it would consider more than 50 moves with lots of backtracking and trying different possible moves.
It's difficult to organise Leela-Stockfish as a fair fight, because they run on different hardware (CPU vs GPU) and both get substantial improvements by playing on better hardware.
Traditionally this wasn't a big problem as every engine was more or less optimised for a fast Intel CPU with a moderate to large amount of RAM. The organisers would decree the specs of the championship hardware some time in advance. Now, (at least for TCEC, the other major engine tournament) they pick two hardware configurations, one CPU-heavy and one GPU-heavy, and give each team the choice.
How do you balance those? It's been suggested you should make them equal in terms of watts of power, or dollar cost to buy, but neither of those are obviously best. In practice the TCEC organisers pick something close to what they picked last time but shade it against the winning engines, making the contest more even. Chess.com likely do something similar though they're less rigorous about the details.
Yes, there are several computer chess rating lists, but you run into the same problems once you want to compare a CPU engine with a GPU one: how do you account for the hardware? Really a rating, as a measure of playing strength, should apply to a hardware/software combination, but we almost always want to compare the software. The most respected lists used to be the SSDF  and CCRL  lists but they both ignore the GPU issue. There's now the CEGT list  which I don't know anything about, but does seem to use heterogenous hardware.
Note that the numbers aren't calibrated to human players, so there's no claim that an engine with a rating of 3100 on some particular hardware should score 85% against a human rated 2800, or whatever.
The gap between white and black piece performance is massive in these top engines if I'm reading it right. LCZero won 0/ lost 4 as black and won 24 / lost zero as white (with lots of draws)? I had no idea the split was so big now. Do human tournaments look like this too these days?
Not really---similar for sure, but not as bad for black. Consider that computer chess tournaments do not start with the pieces in their initial positions. The first 5-15 moves are predetermined by human master players to achieve positions that are imbalanced while not having a side with a clearer advantage. Otherwise it would be even more of a draw fest.
This makes things a bit worse for black, typically, because playing for the win as black means making some positional concession (that white can exploit after successfully defending) in exchange for the initiative and a chance for an attack.
In human play, black has a little more ability than white to steer the game towards their opening preparation. If they want to win, they will try to get have a position that they know in and out from (computer-assisted) home preparation. But there are still a lot of draws in classical (many hours per game) chess.
 Chess openings that white chooses are typically called "attack" (King's Indian Attack) or "game" (Scotch Game), while those that black chooses are called "defense" (Sicilian defense). You'll find that there are a lot more "defenses" that black can choose from than "attacks".
> Consider that computer chess tournaments do not start with the pieces in their initial positions. The first 5-15 moves are predetermined by human master players to achieve positions that are imbalanced while not having a side with a clearer advantage.
I've never heard this before, and it seems strange! Do you have a link to tornament rules with this?
Computer chess is traditionally really strong in openings because their opening book is just database lookups. What is getting a human to play the same opening book supposed to achieve?
Search for the TCEC opening books or read the FAQ at https://wiki.chessdom.org/Openings_FAQ. Opening books haven't been used for a few years now. They were only needed because computers had issues evaluating opening positions.
Chess matches usually involve multiple games, so you don't need to win every game to win the match. It's not uncommon to play trying to avoid a loss (going for a draw), instead of playing riskier moves in an attempt to force a win.
Because they may be content with drawing as black and seek an (ever so slightly) easier win as white. It depends on many factors: relative player strength, duration of the match, how players are paired in tournament rounds, current standings in the tournament, and so on. A computer won't do this kind of reasoning.
Current divp uses very unbalanced openings for engines to be able to score. Using starting position/more optimal openings inevitably leads to draw. So it's actually exactly the opposite -- the advantage is so low and draw tendencies are so strong that we need to force position where white has advantage to be able to compare two strong engines
Using a previous Stockfish scorer they trained an NN without any labelling effort. This is also similar to how unsupervised translation is done in some methods. They start from word->word dictionary results and iteratively train lang1->lang2 and lang2->lang1 models feeding on each other's output.
One thing I don't understand is why it would be smarter to augment the inputs with some of the missing board information - particularly the availability of castling. Even though this network is a quick-and-dirty evaluation, seems like there's room for a few additional bits in the input and being able to castle very much changes the evaluation of some common positions.
I am not an expert (and don’t have any idea what I am talking about), but wouldn’t this be captured the same way other future positional advantages are when evaluating the next “level” of possible board positions? Ie, the fact that a piece can later give check, or castle, or anything else if you first move it here is a function of that next move’s resulting position score.
This network is just to signal to the final search algorithm how good the board looks in a given spot on tree of possible moves.
Yes, the overall algorithm will do something much more precise and comprehensive, so it's not like this scoring function needs to be perfect. Still, it matters in a quick evaluation so I'm surprised it isn't in there.
A few examples come to mind where, for instance, white has a bishop and knight attacking f7 and it's black to move. If black castles nothing interesting happens and white is likely overextended. But if black cannot castle white will win a rook.
I can make up a similar situation with en passant, and this does come up, but a lot less frequently.
So anyway, surprised you wouldn't toss a few bits in there, at least the 2 for ability to castle by each player.
I don't know a lot about chess but I have one question: isn't chess a solved game (in the sense that given a board state we always can compute the right move)? why use a neural network that can introduce, a very small percentage of the time, mistakes? I guess it is for performance reasons?
No, it's not close to being solved. I made a new estimate recently at there are about 8.7E+45 unique positions: https://github.com/lechmazur/ChessCounter. You don't need to analyze all of them to generate the proof but it's still far beyond our computational and storage capacities to solve chess.
Chess is estimated to have 10^120 possible games. Observable universe is estimated to have between 10^78 to 10^82 atoms. So, unless you had some way of dramatically compressing or simplifying this search space, you'd be needing to store 38 bits+ information, if you had every atom in the observable universe as your memory. Suffice to say, this has not yet been achieved.
You don't need to consider every state individually to solve a game. For example if you have a game on a 100x100 board and the only goal is to move your piece to the other side you can know you have found the optimal solution if you can do it in 99 moves (assuming that you move 1 per turn).
In the chess scenario if you can prove that there is a set of moves where you win no matter what the other colour does then you have solved the game. You don't need to consider every possible game, just the games reachable within this set of moves.
Solving chess would be proving that "From starting position black can win every time" (or the same for white). You don't need to prove how to win from every possible board state.
I agree, and kind of included this in the "so, unless you had some way of dramatically compressing or simplifying this search space..." disclaimer.
Having said this, the nature of many chess endgames suggests that such a proof is not really possible, or at least would not be "simple". As an example, tempo / opposition flips games from draws to wins, etc.
You don't need to care about possible games, but game states, what's the best result and the move to achieve it in each. Another user computed an upper bound at 8.7E+45 positions: https://github.com/lechmazur/ChessCounter . I pointed out that our planet has about 1E+50 atoms. And we can further compress the database by perhaps two orders of magnitude if we don't store symmetric positions or positions close to mate. A Kardashev 2 civilization could play perfect chess.
You can imagine playing chess as a tree. You start with a game state, and every possible move is an edge to a new game state. Unfortunately, the amount of game states grows approximately exponentially (e.g. for every game state there are approximately 40 moves to play, thus every extra move multiplies #(game states) by 40, approximately). This neural engine is a trick so that Stockfish does not have to simulate all 40 moves every game states. The neural net outputs the moves that are most likely to be strong moves, and Stockfish will only consider those moves. Of course, this is a very basic explanation and there are many more optimisations Stockfish uses, though the ultimate goal of almost every optimisation is to reduce the amount of simulation Stockfish has to do.
Chess is not a solved game (though I think the stalemate rules make it finite, there are far too many positions). All computer chess programs have to try to determine if positions will be good without playing out all possible ways the game could go. Usually this is some combination of search strategies and scoring positions (the article describes how stockfish scores positions).
Right, that would make more sense to me. Although it should be stressed that neither threefold repetition nor the 50-move rule work automatically. They need to be claimed by one of the players. If neither player does, the game can go on.
Frankly, I wasn't aware of the 75-move rule! This indeed caps the length of possible games.
The threefold repetition rule doesn't have an equivalent (there's no, say, "fivefold repetition" draw that could be enforced despite the players' will), but it's really irrelevant from the perspective of solving chess, because there's no point in analyzing what happens if game goes through the same loop 1, 10 or 100 times.
And the 75-move rule sets a hard limit anyway, so you couldn't loop infinitely, even if it mattered.
You're correct, in that given a board state the right move can always be exhaustively computed, because the game has no randomness or hidden information. But enough computing power doesn't exist on the planet to iterate through all the possible states.
The word you're looking for is "deterministic", rather than solved. Solved is usually used to mean the exhaustive computation has been done. Deterministic would mean it could be if enough computing power existed, but for chess it doesn't.
A game could be solved even if its state space was too large to enumerate, for example if you could show that there was some latent structure that let you prove things about large parts of the state space without completely exploring it.
I have to disagree with the others.
Since AlphaZero I think chess is closed to being solved.
Technically it is not, but practically is is.
Chess is a hard game to solve completely, and I think
one reason is that there are many states in chess where there is not one superior move, but only a probable optimal move, in the sense that the game-tree for winning against the move is smaller than other moves.
Then the agent/player has to guess what the opponents strategy is given that move, and that depends on the opponent.
I plays are truly creative, and its results speak for itself.
"In AlphaZero's chess match against Stockfish 8 (2016 TCEC world champion), each program was given one minute per move. Stockfish was allocated 64 threads and a hash size of 1 GB, a setting that Stockfish's Tord Romstad later criticized as suboptimal. AlphaZero was trained on chess for a total of nine hours before the match. During the match, AlphaZero ran on a single machine with four application-specific TPUs. In 100 games from the normal starting position, AlphaZero won 25 games as White, won 3 as Black, and drew the remaining 72. In a series of twelve, 100-game matches (of unspecified time or resource constraints) against Stockfish starting from the 12 most popular human openings, AlphaZero won 290, drew 886 and lost 24."
You're making a jump from "AlphaZero plays better than the best programs of its time" to "chess is close to being solved" but that's not justified without much more evidence. There are still a lot of improvements and new ideas added to the top programs and even if they reach a plateau, you'd have to show that increasing time per move 100x doesn't result in a significant improvement in playing strength. We don't know what would be the Elo difference between a 32-piece tablebase and the best existing program but I don't think anybody would bet that it's less than 200 Elo.
A solved game is categorically different from merely teaching a machine to be better at it than humans are presently.
Cepheus  is approximately a solution to Heads-Up (two player) Limit (ie the amount you can bet is specified in the game, you don't get to just bet arbitrary money) Texas Hold 'em poker.
In contrast Libratus is an AI that plays Heads-Up No Limit Hold 'em very well against humans.
You can examine the Cepheus strategy for yourself, if you could memorize it (it's too complicated) and were capable of making truly random decisions (Poker is a game of chance and so your strategy needs random action) you could reproduce it and be exactly as good at the game as Cepheus is. You can examine individual strategy elements and reason about them. For example if Cepheus gets an 8 and a 3 and you open with a bet, it will call your bet if they're the same suit, otherwise it will fold. On the other hand if those suited cards were an 8 and a Queen, it would raise a bit more than nine times out of ten.
You can't do anything about it (you might be thinking surely knowing that e.g. Cepheus folds 8-3 off here is valuable so I can benefit from that, er, no, "hiding" that by sometimes playing it loses more money than Cepheus gives up by folding it that's why this is a perfect strategy), if playing Cepheus, even with an incrementally "more" approximate solution you're not going to win a significant amount of money reliably in reasonable time, that's why it's approximately solved.
But Libratus isn't like that, it's playing some strategy that we know beat world class human players, but a further incrementally better AI might crush Libratus just as badly.
I think you misunderstood, and that the point was that in chess, especially during the mid-game phase, there's very often not a single clearly superior move but rather several moves that apparently improves your position by roughly the same amount.
The uncertainty is due to the inability to scan the entire sub-tree for each move.
The theorem you linked to would only be applicable in certain endgame situations.
> there's very often not a single clearly superior move
Maybe it's not clear to human players or current engines, but that does not mean it doesn't exist, and indeed the theorem mentioned above states that at least one player has an optimal strategy (either forcing a draw or winning). Obviously, that does not necessarily mean that we'll ever be able to design an engine that can compute this strategy in a reasonable time.
I just think that OP meant something other than the usual definition of "solved", maybe they meant that engines are plateauing, but as others have pointed out, there is also little evidence for that for now.