Advent of Code 2021 in Rust before Python can start
I did Advent of Code for the second time this year. For those of you who haven't heard of it, it's a set of short programming puzzles that are released two at a time for the first 25 days of December. People tend to have different goals when solving the problems — some people race to finish quickly and make the leaderboard, some try to learn new languages, and some do ridiculous things (like a person who solved most of the puzzles in Scratch!). Rather than focusing on quickly submitting solutions, I tried to optimize the runtime of my solutions with an initial goal of solving all puzzles in 800ms (about the time it takes the JVM to cold start).
As you may have guessed from this article's title, that goal was achieved - I solved all 49 puzzles in ~65ms without multithreading on an AMD 5950x. This greatly exceeded my expectations and ended up being faster than even python can start (~70ms on the same machine)! Additionally, three days (23, 25, and 15) took the vast majority of that time. Without those days, the remaining solutions run in a handful of milliseconds total.
Note: The title of this article is a bit clickbaity — with python not resident in the page cache, it takes the 5950x-based system ~70ms to run time python3 -c "exit()", but it only takes ~10ms with a hot page cache.
Show me the code!
I had the idea to try using rustdoc as a way to present my code along with explanations of what I found interesting about the problems and how I optimized them. This format should allow you to browse through my solutions to problems, see the types I used and read my thoughts, and view the actual code I wrote. It's the first time I've tried using rustdoc like this, so I'd appreciate feedback.
You can find my writeups and solutions here, and you should probably read them before continuing. You'll get the most value out of them if you've already solved AoC 2021, but you can also follow along with the problem descriptions for part one of each day.
Now that you've done that...
Optimizing code
An unexpected problem with writing efficient AoC solutions is the inability to effectively benchmark them, making it hard to make informed decisions on what to improve. Many performance analysis tools use approaches similar to perf, which involve sampling the program runtime every few milliseconds (totally non-biased shoutout to FunctionTrace as a great non-sampled Python profiler!). As the slowest puzzle (day 23) ran in ~31ms, it's hard to gather enough data to make optimization decisions off of. As a result, I generally optimized based on a few heuristics and used cargo-criterion to validate my hypotheses.
I used two main heuristics, which are things to always keep in mind when you care about performance:
- The fastest code is code that doesn't exist
- CPUs are fast, and memory is very slow
My initial goal was always to find a faster algorithm that allowed me to avoid some computations. For example, for Day 17 Part 2, any optimization we do to computing steps of the projectile movement will still be slower than avoiding computing most of the steps via triangle numbers. Note that this doesn't need to be a faster algorithm in order to do less work — writing code so the compiler can elide bounds checks, or memoizing the results of some computations can often be very effective.
After finding an efficient algorithm, I focused on minimizing the amount of time spend waiting on reads for main memory. This is typically done via the processor's cache, which will transparently cache accesses to main memory. However, caches have a fixed size that tends to be substantially smaller than the amount of RAM on a system — my development machine with 32GB of RAM only has ~1MB of cache! To efficiently use the cache, it's important to minimize the size of data that's being used, as well as use patterns that can utilize the cache.
Day 23 has a good example of minimizing the size of data. We need to represent many different game states and iterate through them, but a game naively consists of 27 different spaces each requiring a byte to store, which is then padded to 32 bytes by the compiler. By removing unreachable states and compressing some information into bits, we're able to reduce this to 16 bytes, allowing us to double the number of game states we can fit in our cache before taking the long trip to main memory. If we remove only this 16 byte optimization, day 23 is more than 30% slower!
While optimizing the size of data structures can help fit more data into the cache and allow the CPU to spend its time computing rather than waiting for memory, it's also important to ensure the program effectively uses the cache. To be feasible to implement in hardware, CPU caches store lines (typically 64 bytes each) rather than individual bytes. This means that after you access a location in memory, accessing anything soon after it will likely hit the cache, avoiding a round-trip to memory. As an example, we utilize this in Day 9 by iterating over each row in order rather than using something like BFS. As we're iterating over each u16 (taking 2 bytes of space), we'll load 32 entries into the cache for memory access we do. If we instead iterated over each column, we'd end up loading many different lines into the cache but not immediately using them, which would cause us to access main memory more often. Day 15 is a good example of a case where we can't necessarily do this well; with Djikstra's the next neighbor may be located anywhere in the grid (512KB in memory), so we're rarely able to share memory accesses via the cache and instead spend most of our time waiting for data to load from main memory.
Thoughts on Rust
I continue to really enjoy Rust. Many of my early solutions look fairly similar to the Python solutions, except with enforced error checking, no worries that they'll be wrong, and blazingly fast performance (it wouldn't be a Rust article if I didn't use this phrase somewhere). By ensuring I'll rarely need to debug runtime errors and can still have native performance, Rust has become my favorite practical language.
Most of my solutions had two phases, where I first solved the problem without particularly caring about performance, then where I optimized it until it had satisfactory performance.
During the first phase, I frequently used fancy data structures like vectors and hashtables, then converted them to simpler data structures like fixed-sized arrays. Rust's strong type system enabled me to refactor with confidence, knowing that I couldn't miss something and end up with annoying issues at runtime.
Performance-wise, it's really nice to have the ability to control memory-usage by specifically choosing my data types, while also not needing to manually manage memory via malloc() and free(). I specifically chose integer sizes to optimize for memory layout and cache efficiency on many days, but almost never thought about allocations (and even without thought they only make up a tiny fraction of the flamegraph for my solutions).
Lifetimes are one of the things that new Rust programmers seem to find confusing, but the Rust team has done a great job here, I used references all over my code, but only needed to use explicit lifetime annotations twice. The compiler does a much better job getting out of your way than in the pre-1.0 days.
I didn't need to touch unsafe at all, and from some quick profiling in VTune, I don't feel like I missed any performance as a result. This continues to reinforce my belief that the average programmer considering unsafe should consider refactoring their code to use a different pattern instead.
I didn't use nightly Rust at all. This mostly worked out, but there were some unstable APIs that I would've loved having. Off the top of my head, BtreeMap::pop_first would've simplified my hacky priority queue implementation, and there were various unstable features related to const generics that I expected to exist.
Const generics seem like a neat idea, and it was my first time actually applying them to a Rust program. I used them in a few places, but due to some limitations I never felt like they made a big improvement over passing an additional argument. In particular, the inability to use const generics function arguments in data structures meant I couldn't make the optimizations around memory layout that I would've expected.
Integer operations seem to be horribly annoying in Rust. If you care about the amount of memory your integers use, you'll end up casting back and forth between usize to index collections, isize to handle potentially negative cases, and whatever integral type you actually want. This is a reasonable design decision, but it's incredibly frustrating to litter code with as usize and the like when I know the conversion will be safe.
rustdocs is amazing. The ability to browse through generated documentation for everything in the ecosystem, including useful features like searching for functions by return type, makes onboarding to new libraries (or even the standard library) a breeze. Something like Hoogle would be still be nice to have, but I find Rust docs to be a massively more pleasant experience than the standard Python or Java docs.
A challenge appears
I thought using Rust to solve Advent of Code 2021 was enjoyable, and it was satisfying to write some high performance solutions with it. I felt like I got initial solutions to the problems in a similar amount of time that a dynamic language like Python would've required (though optimization obviously took longer), and it's nice to feel like my CPU is being effectively utilized.
I'd love to see more people attempting to solve AoC with minimal runtime, and think a comparison with other high-performance languages like C or C++ would be fun.
From a quick scan of the Advent of Code subreddit, I believe I have the fastest public implementation for all days. If anyone believes that's wrong (particularly if benchmarked on one of my available systems), let me know!