Introduction

photo_2026-02-01 02.58.46.jpeg

With the release of Claude Opus 4.5, Anthropic famously retired “a notoriously difficult take-home exam.” Why? Because the model scored higher than any human candidate ever in just two hours of work. Recently, the company released a version of this assignment along with a blog post describing its design principles.

The author, Tristan Hume, wanted to create "something genuinely engaging that would make candidates excited to participate," not "filled with generic problems." And boy, did he pull it off. I spent way more time on this task than I planned simply because it was fun to figure out. Even though I never even aimed for the performance optimization team.

In this post, I will break down the challenge from start to finish. We will look at provided processor architecture, the algorithm we need to optimize, and three key techniques that yield the biggest performance boost (a 65x increase!). You don't need a background in performance optimization or processor architecture. I will explain everything from scratch, including SIMD, VLIW, and other scary-sounding things.

The Hardware

Of course, the company doesn’t mail you a physical chip. The whole "processor" is just a Python script that models the behavior of a fictional device.

Formally, the task is to optimize code for a fake accelerator with characteristics that resemble TPUs. It is a sandbox that mimics real AI chips from Google. These are the chips used for training and inferencing neural networks (usage has expanded well beyond Google—Anthropic and even OpenAI use them now).

The simulated machine includes features that make accelerator optimization interesting. It also gives hints about the types of optimizations we will need to apply:

Two types of memory: An unlimited capacity of slow Dynamic Random-Access Memory (DRAM) and a small but fast Scratchpad (or registers that hold values the logical units work with directly). Unlike CPUs, accelerators often require explicit memory management to unlock efficiency gains. That explains the separation in this task.
Very Long Instruction Word, or VLIW: Multiple execution units running in parallel each cycle. A cycle is the minimal unit of processor time. In VLIW architecture, instructions for all engines "fire" at once in a single tick. This boosts efficiency (speed) by executing several operations simultaneously.
- Example: In one cycle, the processor can add a pair of numbers and read another pair from memory (for the next iteration). Importantly, the results of all operations in the current cycle only become available at the start of the next one.
Single instruction, multiple data, or SIMD: Vector operations on many elements per instruction. This allows us to speed up certain types of calculations several times over.

Don't sweat it if this sounds too complex. We will break it down in detail. Here is a diagram to help:

This is a schematic view of our machine. On the left, we have DRAM, which stores everything. Getting data from this memory is faster than reading from a hard drive. However, there is no hard drive in this task since the data volumes are small, and that is not the point. In the provided simulator code, DRAM is just a list[int].

DRAM is connected to the Scratchpad, the memory living inside the accelerator core.

It is crucial to understand that the Scratchpad size is limited. We can't store everything in it at once. So, we have to be smart about the sequence of loading and offloading data. We get 1,536 cells of 32-bit values. Generally, these could be floats or integers, but for the rest of this post, we will assume we are working only with 32-bit integers.

In the code, the Scratchpad is list[int] = [0]*1536. 1536 is a lot if you are working with 2, 3, or 4 elements. But it's tiny if your algorithm needs to process hundreds of thousands of items.

A Bus connects the Scratchpad and DRAM. It is a communication system that transfers data. Even though the diagram shows Load / Store connected to "Free Space," they can actually read from and write to any register cell. This includes overwriting values. Here is how it looks in the code:

(here and are just abstractions for storing the necessary info, including operation arguments like load and store addresses)

(here core and slot are just abstractions for storing the necessary info, including operation arguments like load and store addresses)

A couple of things worth noting:

load and store support vectorized versions: vstore and vload. As you can see from the code, these operations work with VLEN sequential items instead of just one. SEQUENTIAL is the keyword here. If your data is scattered across memory, you can't use these operations.