Introduction

photo_2026-02-01 02.58.46.jpeg

With the release of Claude Opus 4.5, Anthropic famously retired “a notoriously difficult take-home exam.” Why? Because the model scored higher than any human candidate ever in just two hours of work. Recently, the company released a version of this assignment along with a blog post describing its design principles.

The author, Tristan Hume, wanted to create "something genuinely engaging that would make candidates excited to participate," not "filled with generic problems." And boy, did he pull it off. I spent way more time on this task than I planned simply because it was fun to figure out. Even though I never even aimed for the performance optimization team.

In this post, I will break down the challenge from start to finish. We will look at provided processor architecture, the algorithm we need to optimize, and three key techniques that yield the biggest performance boost (a 65x increase!). You don't need a background in performance optimization or processor architecture. I will explain everything from scratch, including SIMD, VLIW, and other scary-sounding things.

The Hardware

Of course, the company doesn’t mail you a physical chip. The whole "processor" is just a Python script that models the behavior of a fictional device.

Formally, the task is to optimize code for a fake accelerator with characteristics that resemble TPUs. It is a sandbox that mimics real AI chips from Google. These are the chips used for training and inferencing neural networks (usage has expanded well beyond Google—Anthropic and even OpenAI use them now).

The simulated machine includes features that make accelerator optimization interesting. It also gives hints about the types of optimizations we will need to apply:

Don't sweat it if this sounds too complex. We will break it down in detail. Here is a diagram to help:

image.png

This is a schematic view of our machine. On the left, we have DRAM, which stores everything. Getting data from this memory is faster than reading from a hard drive. However, there is no hard drive in this task since the data volumes are small, and that is not the point. In the provided simulator code, DRAM is just a list[int].

DRAM is connected to the Scratchpad, the memory living inside the accelerator core.

It is crucial to understand that the Scratchpad size is limited. We can't store everything in it at once. So, we have to be smart about the sequence of loading and offloading data. We get 1,536 cells of 32-bit values. Generally, these could be floats or integers, but for the rest of this post, we will assume we are working only with 32-bit integers.

In the code, the Scratchpad is list[int] = [0]*1536. 1536 is a lot if you are working with 2, 3, or 4 elements. But it's tiny if your algorithm needs to process hundreds of thousands of items.

A Bus connects the Scratchpad and DRAM. It is a communication system that transfers data. Even though the diagram shows Load / Store connected to "Free Space," they can actually read from and write to any register cell. This includes overwriting values. Here is how it looks in the code:

(here  and  are just abstractions for storing the necessary info, including operation arguments like load and store addresses)

(here core and slot are just abstractions for storing the necessary info, including operation arguments like load and store addresses)

A couple of things worth noting: