by Igor Kotenkov Feb 3rd, ‘26 You can find the accompanying code on GitHub

Introduction

photo_2026-02-01 02.58.46.jpeg

With the release of Claude Opus 4.5, Anthropic famously retired “a notoriously difficult take-home exam.” Why? Because after just two hours of work, the model scored higher than any candidate ever. Recently, the company released this assignment publicly, along with a blog post detailing their design principles.

The author, Tristan Hume, wanted to create "something genuinely engaging that would make candidates excited to participate," rather than a test "filled with generic problems." And boy, did he pull it off. I ended up spending way more time on this task than I planned simply because it was fun to solve—even though I never intended to join the performance optimization team.

In this post, I will break down the challenge from start to finish. We will look at the provided processor architecture, the algorithm to optimize, and three key techniques that yield the biggest performance boost (a 65x increase!). You don't need a background in performance optimization or processor architecture to follow along; I’ll explain everything from scratch, including SIMD, VLIW, and other intimidating acronyms.

The Hardware

Of course, the company doesn’t mail you a physical chip. The "processor" is actually a Python script that simulates the behavior of a fictional device.

Formally, the task is to optimize code for a fake accelerator. While a general-purpose CPU is designed to handle any logic you throw at it, an accelerator strips away that flexibility to devote every transistor to raw mathematical throughput.

In this case, the simulator mimics Google’s TPUs. These chips are primarily used for AI training and inference, though they have expanded well beyond Google—Anthropic and even OpenAI have started to use them.

The simulated machine includes features that make optimizing for accelerators particularly interesting. It also hints at the specific types of strategies we’ll need to apply:

Two types of memory: We have access to an unlimited capacity of slow Dynamic Random-Access Memory (DRAM) and a small, blazing-fast Scratchpad (or registers that hold values the logical units work with directly). Unlike CPUs, accelerators often require explicit memory management to unlock efficiency gains, which explains the two-level memory hierarchy in this task.
Very Long Instruction Word, or VLIW: This means multiple execution units run in parallel during each cycle. A cycle is the smallest unit of processor time. In VLIW architecture, instructions for all engines "fire" simultaneously in a single tick. This boosts throughput by executing several operations at once.
- Example: In one cycle, the processor can add a pair of numbers and read another pair from memory (preparing for the next iteration). Importantly, the results of all operations in the current cycle only become available at the start of the next one.
Single instruction, multiple data, or SIMD: Vector operations on many elements per instruction. This allows us to speed up specific types of calculations significantly.

Don't worry if this sounds complex right now—we will break it down in detail. Here is a diagram to help visualize the layout:

This is a schematic view of our machine. On the left, we have DRAM, which acts as the main storage. While reading from DRAM is certainly faster than reading from a hard drive, it’s still slow relative to the processor. However, there is no hard drive in this task since the data volumes are small, and that is not the point of the take-home. In the provided simulator code, DRAM is just a list[int].

DRAM connects to the Scratchpad, which is the memory residing directly inside the accelerator core.

It’s crucial to understand that the Scratchpad size is limited. We can't store everything there at once, so we have to be strategic about the sequence of loading and offloading data. We get 1,536 cells of 32-bit values. While these could technically be floats or integers, for the rest of this post, we will assume we’re working exclusively with 32-bit integers.

In the code, the Scratchpad is represented as list[int] = [0]*1536. While 1,536 registers sound like plenty if you are juggling two or three variables, it's tiny when your algorithm needs to process hundreds of thousands of items

A Bus connects the Scratchpad and DRAM, acting as the bridge for data transfer. Although the diagram shows Load and Store connected to "Free Space," they can actually read from and write to any register cell—including overwriting existing values. Here is what that looks like in the code:

(Note: and are abstractions used here to store necessary state, including arguments like load and store addresses)

(Note: core and slot are abstractions used here to store necessary state, including arguments like load and store addresses)

A couple of important details: