Officers on the GPU
March 8, 2025 / code, cuda, officers
Officers is a two-player “take-and-break” game played with heaps of
coins (or if you prefer, piles of beans, or officers and their
subordinates, or groups of people who form indivisible couplesIn
the equivalent game Couples-are-Forever, a move is to choose a heap
and split it into two, with the proviso that you may not split a heap
of size 2. An Officers heap of size n is equivalent to a
Couples-are-Forever heap of size n+1.
, or …). The two players
alternate turns, and every turn consists of removing a coin from a
heap and leaving the remainder in either one or two heaps. In
particular, taking away a lone coin is not a valid move. The winner is
the last player who can make a move.
For example, a game starting with a single pile of 10 coins might proceed
[10] \xrightarrow{\mathcal{L}} [3, 6] \xrightarrow{\mathcal{R}} [3, 1, 4] \xrightarrow{\mathcal{L}} [1, 1, 1, 4] \xrightarrow{\mathcal{R}} [1, 1, 1, 1, 2] \xrightarrow{\mathcal{L}} [1, 1, 1, 1, 1]
at which point player \mathcal{R} is stuck, so player \mathcal{L}
has won.In fact, he was following the winning strategy.
Although
the total number of coins decreases by 1 each turn, the outcome of the
game is not determined simply by the parity of the starting number of
coins: the moment that the game ends also depends on how many piles
are created by the players making splitting moves. As we will soon
see, the winning move from any given position is extremely
unpredictable.
We can solve a game of this kind by calculating the Grundy value for each position, and in this post I’m going to discuss my attempt at calculating these values for Officers as fast as possible.
Officers is the only unsolved “single-digit octal game”,Achim
Flammenkamp has the most up-to-date status page, to the best of my
knowledge.
and like all finite octal games, its values are
conjectured to be eventually periodic. J. P. Grossman has calculated
140 trillion values, with no signs of periodicity yet! In the first
few sections I’ll recapitulate some of the strategies Grossman used,
before moving to the world of CUDA.
I’m a newcomer to GPU programming, so I would appreciate any comments or ideas.
Sprague–Grundy Theory
Here’s a very compressed outline of Sprague-Grundy theory, just what
we need to work on Officers.See the references for more details.
The Grundy value of an Officers position is the equivalent size of heap in the game of Nim. This this value must exist thanks to the Sprague–Grundy theorem, which states that every “impartial game” (both players have the same options from any position) under the “normal play convention” (a player who can’t make a move loses) is equivalent to a one-heap game of Nim. The winning strategy in Nim is easy, so the strategy for Officers becomes to convert the position to Nim, determine the winning move, and convert back.
Nim is also played with heaps of coins, but the rules are simpler. In Nim, a turn consists of choosing a pile and removing any number of coins, possibly all of them. For a single pile, the strategy for the first player is obvious: take the entire pile and you win. For two equal piles, the second player can win: replicate whatever move is made on one pile to the other, and eventually the first player will run out of options.
To get at the general strategy, suppose we have a “false” Nim heap where all the options are ordinary Nim heaps, say, a “heap” where we can move to one of 0, 1, 4 or 5. This behaves like a heap of size 2, but where we also have the option to increase the heap to 4 or 5. But! These latter two moves cannot be part of a winning strategy, because the opponent can always make a reversing move back to the (true) Nim heap of size 2, and thereby force you to make a move to 0 or 1. And so our false Nim heap is equivalent to the heap of size 2, the minimal excluded value of all of its options. (Sanity check: for a true Nim heap of size n, indeed n is the first excluded option.)
To determine the value x ⊕ y of Nim games with two piles of size x
and y, we apply this rule inductively. From 1 ⊕ 2 we can move to
any of \{0 ⊕ 2, 1 ⊕ 1, 1 ⊕ 0\}, which have known values \{2, 0,
1\}. The minimum excluded value is 3, so 2 ⊕ 1 = 3. Continuing a
little further,Computing these MEXes also reveals the winning
strategy for each sum: move to a position with value 0. So for these
sums, 1 ⊕ 3 \rightsquigarrow 1 ⊕ 1 and 2 ⊕ 3 \rightsquigarrow 2 ⊕
2.
\begin{align*} 1 ⊕ 3 &= \mathrm{mex}(0⊕3, 1⊕2, 1⊕1, 1⊕0) &&= \mathrm{mex}(3, 3, 0, 1) &&= 2 \\ 2 ⊕ 3 &= \mathrm{mex}(1⊕3, 0⊕3, 2⊕2, 2⊕1, 2⊕0) &&= \mathrm{mex}(2, 3, 0, 3, 2) &&= 1 \end{align*}
In general, one can show that the Nim sum x ⊕ y is binary XOR, or if you like, vector addition over \mathbb{F}_2.
Now how do we apply this to Officers? We consider each Officer heap as a false Nim heap and identify the minimal excluded value.
From Officers heaps of size 0 and 1 there are no moves, so G(0) = G(1) = 0. From position 2, there is a unique move to 1, so G(2) = \mathrm{mex}(G(1)) = \mathrm{mex}(0) = 1. From position 3, we now have a choice after removing a coin: whether or not to split the result. The two options are therefore G(2) or G(1) ⊕ G(1), which evaluate to 1 or 0, so G(3) = 2. Generally,
G(x) = \mathrm{mex}(G(i) ⊕ G(x - i - 1) \mid 0 \leq i \leq x-1)
with G(0)=0 as a base case. The first few values are:
0, 0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 4, 0, 3, 4, 2, 1, 3, 2, 1, \dots
This sequence appears on the OEIS as A046695. It quickly climbs out of the single digits and then bounces around, mostly in the range [0, 256].
Calculating Naively
The above formula is easily turned into a naive, O(n^2) algorithm for computing values one by one:
int main() {
std::vector<int> values;
.push_back(0); // G(0) = 0 base case
values.push_back(0); // G(1) = 0 base case
values
int pos = 2;
while(true) {
std::unordered_set<int> seen;
for (int i = 0; i < pos; i++) {
.insert(values[i] ^ values[pos - i - 1]);
seen}
int mex = 0;
while (seen.count(mex)) mex++;
.push_back(mex);
valuesstd::cout << "G(" << pos << ") = " << mex << std::endl;
++;
pos}
}
There are a couple of basic improvements we can make. Firstly, we only need to run through around half the loop, because past the midpoint we are computing the XOR of the same values but swapped.
Secondly, all known values (into the trillions) are less than 512, so
we are better off using a std::array<bool, 512>
or similar to record
the values that we are applying MEX to. We are willing to take the
gamble that we will never calculate values for long enough to see them
reach 512.
The result is an algorithm that calculates values at around 11K values per second. (And which slows down as time passes.)
The Rare-Value Algorithm
Empirically, some values appear far more often than others. This is
underselling it: there are a handful of values that occur a finite
number of times and then never seem to appear again.Notably the
value 0 is one of these, so there are only finitely many (known)
positions that are losing for the first player. These are positions 0,
1, 4, 12, 20, 30, 46, 72, 98, 124, 150, 176, 314 and 408.
There are
exactly 1584 (known) positions that have “rare” values, the final
one occurring at G(20627)=277.
The sequence of rare values (not positions) begins
0, 1, 6, 7, 10, 11, 12, 13, 16, 17, 22, 23, 26, 27, 28, 29, \dots
The pattern here is not so obvious. It turns out that a rare value is one that has an even number of set bits after throwing out the 0th and 4th bits.
bool is_rare(uint16_t x) {
uint16_t mask = 0b10001;
return (__builtin_popcount(x & ~mask) % 2) == 0;
}
Many octal games have an analogous property that divides rare values from common ones. There is no rhyme or reason to which bits end up the relevant ones for this property, they seem to emerge from the chaos of the initial handful of values. But once a pattern like this is established it is almost impossible to dislodge, for the following reason. Because the condition is based on parity, it is not hard to check the rules:
\begin{align*} &\text{common} &&\oplus \,\text{common} &&= \text{rare} \\ &\text{rare} &&\oplus \,\text{rare} &&= \text{rare} \\ &\text{rare} &&\oplus \,\text{common} &&= \text{common} \end{align*}
Now consider the collection \{G(i) ⊕ G(x - i - 1) \mid 0 \leq i \leq x-1\} that we are taking the MEX of to determine G(x). As x becomes large, the values G(i) ⊕ G(x - i - 1) will be dominated by those of the form (\text{common} \oplus \text{common}), i.e. rare values, because there are vastly more common values than rare ones. The first missing value is therefore likely to be common, because only members of the smaller collection of sums (\text{rare} \oplus \text{common}) are capable of landing on common values.
Knowing that values are likely to be common suggests the following approach. First, calculate all (\text{rare} \oplus \text{common}) sums, and thereby determine the first missing common value. This is overwhelmingly likely to be the correct value in the end, but to verify this we’ll have to check sums not of this form. Here’s the key: we only have to rule out the rare values below this missing common value, and as soon as we’ve done this we can stop. Usually, the rare possibilities can be eliminated after checking a small number of additional sums. No further sums can possibly eliminate that first missing common value.
Here’s how this might look.
int main() {
std::vector<uint16_t> values;
.push_back(0); // G(0) = 0 base case
values.push_back(0); // G(1) = 0 base case
values
std::vector<std::pair<uint64_t, uint16_t>> rares;
.push_back({0, 0});
rares.push_back({1, 0});
rares
uint64_t pos = 2;
while (true) {
std::array<bool, 512> seen{};
// First calculate the sums that result in common values:
for (auto [rare_pos, rare_value] : rares) {
uint16_t option = rare_value ^ values[pos - 1 - rare_pos];
[option] = true;
seen}
// Determine the first missing common value:
uint16_t mex_common = 0;
while (is_rare(mex_common) || seen[mex_common]) {
++;
mex_common}
// Count how many rare values below that have to be excluded:
uint16_t missing_rare = 0;
for (int i = 0; i < mex_common; i++) {
if (is_rare(i) && !seen[i])
++;
missing_rare}
// Calculate all sums until those rare values are covered:
for (int i = 0; i < pos; i++) {
uint16_t option = values[i] ^ values[pos - 1 - i];
if (option < mex_common && !seen[option]) {
[option] = true;
seen--;
missing_rare
if (missing_rare == 0)
break;
}
}
if (missing_rare == 0) {
// If we got them all, the result is indeed common:
.push_back(mex_common);
values
std::cout << "G(" << pos << ") = " << mex_common << std::endl;
} else {
// Otherwise, we have found a new rare value:
// Calculate the true mex:
uint16_t mex_rare = 0;
while (seen[mex_rare])
++;
mex_rare
.push_back(mex_rare);
values.push_back({pos, mex_rare});
rares
std::cout << "G(" << pos << ") = " << mex_rare
<< " New rare!" << std::endl;
}
++;
pos}
}
This is a big improvement: we are now generating values at a rate of
50K/s. For now, we’re going to put aside the verification step and
focus on calculating the speculative common values as fast as
possible. There is a good parallelisable algorithm for verifying these
common valuesYou can do the verification for each value completely
independently, under the assumption that the values that came before
it were correct. Only rarely (i.e. apparently never) is this
assumption violated, at which point you can redo the later ones that
relied on that incorrect value. And we had better hope it never
actually happens, because a new rare value out past 140 trillion will
make solving the game basically impossible.
, so if our goal is push
the calculation further than Grossman then getting these candidate
values as fast as possible is the whole ballgame.
Values as Bytes
Grossman observes that, because a) all known values are less than 512 and b) common values obey a parity constraint, we can compress common values into 8 bits instead of 9. This is a big win: we can store each value in one byte rather than two. He does this by chopping off the highest bit and reconstructing it when necessary. Importantly, this doesn’t affect the calculation of the Nim sum:
\mathrm{compress}(x \oplus y) = \mathrm{compress}(x) \oplus \mathrm{compress}(y)
Here we can make a minor improvement. The downside of removing the highest bit is that this doesn’t preserve order. The difference has to be accounted for when calculating the MEX, to avoid picking an incorrect “smallest value”.
Instead, we can cut out the 1st bit and shift all higher bits down by one. This does preserve order, and so we only ever have to think about compression when we want to present the output.
uint8_t compress(uint16_t x) {
return ((x >> 1) & ~0b1) | (x & 0b1);
}
uint16_t uncompress(uint8_t x) {
auto pop = __popc(x & 0b11110110);
uint16_t bit1 = (pop % 2 == 0) ? 0b10 : 0;
return ((x & 0b11111110) << 1) | bit1 | (x & 0b1);
}
(Or if you like, that line could be bit1 = (~pop & 1) << 1
.)
Calculating a Block of Values
The above few sections have just been recapitulating previous work, now it’s time to do something GPU specific.
In CUDA, operations are performed in “blocks” of “threads”, with a
block executed collectively on a “streaming multiprocessor”.I am
going to stop with the scare quotes now.
For hardware reasons, it is
best for a block to have some multiple of 32 threads; a typical number
might be 256. Our plan is to have each thread compute one additional
Officers value. Let’s focus on a single block to keep things simple
for now.
Suppose we already know all values at positions smaller than n and
we are now trying to simultaneously compute the values for the range
[n, n+B), where B is the number of threads in our block. Our goal
is to run the first loop of the above algorithm simultaneously over
the block. So that all threads can make progress at the same time, we
want the computation to be as uniform as possible across those
threads. For each rare position p, all threads will calculate G(p)
\oplus G(x - 1 - p) for that rare valueThese rare values and their
locations can be stored in __constant__
memory, luckily it is fast
to broadcast a single value from constant memory to all threads in a
block.
, and record the result as ruled out for the MEX. In other
words, at each step we are using a B sized window over the previous
values, whose location is determined by the position of each rare
value. We’ll run through the rare positions largest to smallest, so
that the range of previous values we’re inspecting moves from to
oldest to newest, with the window starting on the range [n-M, n-M+B)
and ending on the range [n, n+B), where M=20627 is the position of
the last known rare value.
Of course, we have to be careful: the values in the range [n, n+B)
aren’t actually known yet, they’re what we’re calculating right now.
The computation naturally splits into two phases:These diagrams may
not be very helpful, but it was satisfying to get TiKZ working on this
page.
Backwards: For all threads simultaneously, compute the sums involving a rare value and an already-known common value. So for the rare value at position p, we are reading the window:
If for a thread the relevant previous value is still unknown, do nothing. This happens when the rare value has position p \in [0, B] and the current thread x has x > p, because then n + x - p - 1 \geq n is an unknown position. And so the window is shorter:
Forwards: One thread x \in [0, B) at a time, calculate the value for thread x then announce that value forwards to all the threads > x, to cover the cases missed by the backwards phase.
Here’s how this looks in code.
// The position and value of all (known) rare values
constexpr unsigned RARE_VALUE_COUNT = 1584;
uint16_t rare_positions[RARE_VALUE_COUNT];
__constant__ uint8_t rare_values[RARE_VALUE_COUNT];
__constant__
// The first BLOCK_SIZE values, with common values marked by 0xFF
constexpr unsigned BLOCK_SIZE = 256;
uint32_t low_rare[BLOCK_SIZE];
__constant__
// Initialised with the first SHARED_BUFFER_SIZE positions
uint8_t buffer[SHARED_BUFFER_SIZE];
__shared__
uint32_t mex_array[256 / 32] = {};
// The Backward phase:
for (unsigned rare_idx = 0; rare_idx < RARE_VALUE_COUNT; rare_idx++)
{
uint16_t rare_pos = rare_positions[rare_idx];
uint8_t rare_value = rare_values[rare_idx];
if (threadIdx.x < rare_pos+1) {
uint8_t prev = buffer[(position - 1 - rare_pos) % SHARED_BUFFER_SIZE];
uint8_t option = prev ^ rare_value;
(mex_array, option);
set_bit}
}
// The Forward phase:
for (unsigned done_idx = 0; done_idx < BLOCK_SIZE; done_idx++) {
// If it's our turn, calculate the MEX
uint8_t final_value;
__shared__ if (threadIdx.x == done_idx) {
= lowest_unset(mex_array);
final_value [position % SHARED_BUFFER_SIZE] = final_value;
buffer}
();
__syncthreads
// Communicate this value to threads computing later positions
if (threadIdx.x > done_idx) {
uint8_t rare_value = low_rare[threadIdx.x - done_idx - 1];
if (rare_value != 0xFF)
(mex_array, final_value ^ rare_value);
set_bit}
();
__syncthreads}
With 256 threads this goes at about 470K/s, which is a nice improvement.
Tracking the MEX in a Thread
Above we used some functions set_bit
and lowest_unset
, how do we
implement those operations efficiently? Because each value fits into a
byte, we will have each thread keep track of the values it has seen by
storing a 256-bit mask, and setting single bits as we go.
void set_bit(uint32_t array[8], const unsigned i) {
__builtin_assume(i < (1u<<8));
[i/32] |= 1u << (i % 32);
array}
Because the index array[i/32]
is not a compile-time constant,
there’s a risk that this array
will end up in local memory rather
than in registers if the compiler can’t figure out this can be written
as an unrolled loop.Despite the name, local memory is actually a
thread-specific piece of global memory. It’s where, for example,
spilled registers get stored.
For reasons mysterious to me, sometimes
the compiler can do this and sometimes not. Just in case, we’ll do the
optimisation by hand.
We can manually write the operation as a loop:
#pragma unroll 8
for (unsigned j = 0; j < 8; j++) {
if (j == i/32) {
[j] |= 1u << (i % 32);
array}
}
This compiles to a straight-line program, using predicated instructions to operate on the correct entry of the array.
Here’s a alternative trick which comes out slightly ahead and, for reasons again mysterious to me, seems to cause less register pressure.
#pragma unroll 8
for (unsigned j = 0; j < 8; j++) {
[j] |= __funnelshift_lc(0, 1, i - (32 * j));
array}
We use a “clamped funnel shift” to produce the correct mask for each
array entry directly. For this operation, shifting by more than 32 is
clamped to a shift of exactly 32 rather than wrapping around back to
zero. So, for an array entry j
that is too small, the 0 value is
fully shifted into the result. And when j
is too large, the shift
distance underflows and again the 0 is fully shifted in.
At the end of the main loop, the MEX can be computed from the mask by
using the __ffs
(“find first set”) intrinsic to pick out the first 0
bit.
int lowest_unset(uint32_t array[8]) {
for (unsigned i = 0; i < 256/32; i++) {
if (array[i] != 0xFFFFFFFF) {
int bit_pos = __ffs(~array[i]) - 1; // __ffs returns 1-based position
return i * 32 + bit_pos;
}
}
__builtin_unreachable();
}
While we’re at it, we’ll can shove the MEX array into a struct
rather than using an array directly, and manually unroll the loop into
8 separate struct
members. For reasons unclear to me, this does
cause a slight speed improvement.
Multiple Blocks
Of course, we don’t want to restrict ourselves to only one block: the GPU has several multiprocessors that can run blocks simultaneously. We might have K blocks going at once, so that block 0 is responsible for computing the range [n, n+B), block 1 is responsible for the range [n+B, n+2B), and so on.
We have to modify the backwards phase, because a block may catch up to values currently being computed by earlier blocks and have to wait for that to complete. On a step with rare position p, the latest position in the current range, i.e. n + B - 1, will need to read from the buffer at position (n + B - 1) - p - 1. We can be certain other blocks aren’t working on that position as long as it is strictly less than n - B(K-1). A little rearrangement gives the condition p < B K - 1.
We are safe again as soon as p < B-1: all the relevant values are going to be computed local to the current block in the forward phase.
To make this work, we need to synchronise the blocks somehow. Here I am going to simply store a value in global memory that records the overall progress, i.e., up to what position the values are known. Each thread will also know up to what position its block’s shared buffer is correct, and if a still-unknown value is needed, we’ll have the block spin until that value is available.
The backward phase now looks like:
// The Backward phase:
for (unsigned rare_idx = 0; rare_idx < RARE_VALUE_COUNT; rare_idx++)
{
uint16_t rare_pos = rare_positions[rare_idx];
uint8_t rare_value = rare_values[rare_idx];
// Test whether we are in the danger zone:
if (rare_pos < BLOCK_SIZE * BLOCK_COUNT - 1 && rare_pos > BLOCK_SIZE) {
// Until we have the value we need:
while (buffer_correct <=
+ (BLOCK_SIZE - 1) - 1 - rare_pos) {
block_offset
// Wait for the next block to be available:
if (threadIdx.x == 0) {
while (*global_progress == buffer_correct) {
}
}
();
__syncthreads
// And copy that block of values in:
[(buffer_correct + threadIdx.x) % SHARED_BUFFER_SIZE] =
buffer[(buffer_correct + threadIdx.x) % GLOBAL_BUFFER_SIZE];
global_buffer+= BLOCK_SIZE;
buffer_correct }
();
__syncthreads}
// Now proceed as before:
if (threadIdx.x < rare_pos+1) {
uint8_t prev = buffer[(position - 1 - rare_pos) % SHARED_BUFFER_SIZE];
uint8_t option = prev ^ rare_value;
(mex_array, option);
set_bit_8}
}
With 8 blocks of size 256, this does 942K/s. Getting better!
Speculating in the Forward Phase
As observed by Grossman, the speed of the calculation is ultimately limited by how quickly each block can get through its critical segment, that is, the part where it computes the MEX. While this is happening, all other blocks are likely waiting for it to finish. Our priority now should be to make that forward phase as fast as possible.
Once we’ve completed the backwards phase, we can have each thread
speculate on what its MEX will be given the past values it’s seen so
far. This can be done simultaneously over all threads. Then, we can
use these speculative values to mark the MEX arrays, and recalculate
the MEXes (again simultaneously).This is so obviously the right
thing to do that it’s embarrassing I didn’t do it to begin with.
Here’s the trick: if we get the same answer as the first round of
MEXes, then in fact this is the right answer. If our blocks are of
size 256, we’ve replaced 256 sequential MEX calculations and 256
forward broadcasts with 2 parallel MEX calculations and 51 backwards
lookups (the number of rare positions below 256).
If we don’t get the same answer, we simply repeat the process. The
values get monotonically more correct from left to right after each
round, and so this is guaranteed to reach a fixed point. In principle
we could get very unlucky, but most of the time it only takes two or
three rounds.In a quick test, the chance of a block being done after
n rounds was 0%, 10%, 62%, 84%, 96%, 99%.
Because the values within the block are changing, we have to store a
second mex_array
for these final values so that it can be easily
reset between rounds.
Here’s how the new forwards phase looks:
// Do the first round of speculation using purely the values from
// the backwards phase.
uint32_t prev_mex = lowest_unset(mex_array);
[position % SHARED_BUFFER_SIZE] = prev_mex;
buffer();
__syncthreads
bool done;
__shared__ if (threadIdx.x == 0)
= false;
done
while (!done) {
// Initialise a mex array just for these final values
uint32_t intra_mex_array[256 / 32];
for (unsigned i = 0; i < 256 / 32; i += 1) {
[i] = 0;
intra_mex_array}
// Do a backward phase calculation as before but storing the
// result in this separate array
for (unsigned rare_idx = 0; rare_idx < LAST_RARES; rare_idx++) {
uint16_t rare_pos = rare_positions_rev[RARE_COUNT - 1 - rare_idx];
uint8_t rare_value = rare_values_rev[RARE_COUNT - 1 - rare_idx];
uint8_t prev = buffer[(position - 1 - rare_pos) % SHARED_BUFFER_SIZE];
uint8_t option = prev ^ rare_value;
(intra_mex_array, option);
set_bit_8}
// Calculate the new MEX from the combined arrays
uint32_t total_mex_array[256 / 32];
for (unsigned i = 0; i < 256 / 32; i += 1) {
[i] = mex_array[i] | intra_mex_array[i];
total_mex_array}
uint32_t current_mex = lowest_unset(total_mex_array);
[position % SHARED_BUFFER_SIZE] = current_mex;
buffer
// If any of the values changed, we have to retry
if (threadIdx.x == 0)
= true;
done ();
__syncthreadsif (current_mex != prev_mex)
= false;
done ();
__syncthreads
= current_mex;
prev_mex }
// Copy results out
[position % GLOBAL_BUFFER_SIZE]
global_buffer= buffer[position % SHARED_BUFFER_SIZE];
This modest change gets us 3.6M positions per second, a 4× improvement. Wow!
Following this reasoning a bit further, it seems reasonable to try and
avoid doing the entire loop over the LAST_RARES
, only looping over
the suffix of the values that are still unconfirmed. Unfortunately,
the extra bookkeeping involved makes this slightly slower than just
doing the whole loop every time.It is extremely demoralising when
this happens, when making the code smarter ends up being a total
waste.
Synchronising Properly
This code works on my machine and with these particular parameters, but it possibly shouldn’t. The CUDA programming model makes no guarantees about when blocks get run in relation to each other, so the strategy I’ve used here of synchronising via a global variable could deadlock if for some reason one of the blocks never gets scheduled.
As far as I can tell there are three official methods for synchronising across different blocks, even those that are part of the same kernel:
- Wait for all blocks to run to completion then launch a new computation.
- Use
cuda::barrier<cuda::thread_scope_device>
. - Use the
grid.sync()
method in the Cooperative Groups library.
It would be nice to implement things “properly” and use one of these
methods. The one that fits best is to use cuda::barrier
.
Unfortunately, this ends up significantly slower than the cheating
method I’m using now, so I’ll keep that in my back pocket in case
deadlocks become an actual issue. I haven’t tracked down exactly why
this slowdown occurs.
The State of Play
Everything described above is available at
https://github.com/mvr/officers. At the time of writing it
calculates values at 3.6M/s on my (dinky little) machine. This is
pathetic: Grossman’s code gets 5M/s on 8 (CPU) threads, and I’m only
getting 3.6M/s on vastly more (GPU) threads. We should be able to do
better. Clock speed goes some of the way (but not all the way!) to
explaining the speed difference: Grossman was using a 2.66GHz CPU
whereas this GPU runs at 900MHz. The __ffs
intrinsic used for
calculating the MEX is also more expensive than you would expect,
with a SM not even able to calculate an entire warp’s worth per cycle.
We’ll need a better strategy. The next idea, and what I’ll leave for next time, is to scale up speculation to the level of entire blocks, allowing each block to compute a speculative set of values based on unconfirmed previous ones. If those unconfirmed previous values turn out to be correct, the speculative values themselves are also correct.
At worst this would be as fast as the single-block version, because the earliest unconfirmed block never has to actually do any speculation. Again, it will have to be determined empirically how many rounds of speculation each block needs on average before all mistakes are fixed.
PS. My personal suspicion is that Officers is periodic but with a pre-period so long that humans will never live to find out what it is. This is more about the journey than the destination.
References
- Berlekamp, Elwyn R.; Conway, John H.; Guy, Richard K. Winning Ways for Your Mathematical Plays. Vol. 1. Second edition. A K Peters, Ltd., Natick, MA, 2001. xx+276 pp. ISBN: 1-56881-130-6
- Caines, Ian; Gates, Carrie; Guy, Richard K.; Nowakowski, Richard J. Periods in Taking and Splitting Games. Amer. Math. Monthly 106 (1999), no. 4, 359–361.
- Flammenkamp, A. Sprague-Grundy Values of Octal-Games. 2021
- Gangolli, A.; Plambeck, T. A Note on Periodicity in Some Octal Games. Internat. J. Game Theory 18 (1989), no. 3, 311–320
- Grossman, J. P. Searching for periodicity in Officers. Games of No Chance 5, 373–385, Math. Sci. Res. Inst. Publ., 70, Cambridge Univ. Press, Cambridge, 2019.