LLMs struggle to explain themselves; or, StackBee: a stack language with 4-bit instructions for procedurally generating number sequences

LLMs can sometimes recognize number patterns, but can they explain their reasoning? See for yourself!

The interactive demo below generates a random program and uses that to compute three number sequences. The LLM is given two of those as examples and asked to pick the third out of a lineup. Click expand to see the actual messages sent to/from the LLM. You can run things yourself if you click Settings and enter an Anthropic API key: check out the FAQ for more details!

FAQ

Q. Why didn’t you ask the LLM to write out its reasoning first/give it tools/multiple turns before it outputs its final answer?

If you click expand next to any of the LLM rows, you can all the messages sent to the API. In all of the examples, Alice takes multiple turns and uses the eval_js tool. In some of the examples (e.g. #1) you can see that she even writes out a hypothesis before testing it with code, which I think is pretty clever.

It’s a good idea to experiment with the prompts though! You don’t need to change any of the code to do that: just change the value of prompts key in the settings. See “Q. How can I run this myself” for details.

Q. What’s going on here?

When you click Run:

We generate programs in a compact stack language where each instruction is encoded as 4 bits. See the stack language section for more information. The difficulty option in the settings corresponds to the maximum generated program length. To search for programs, we just iterate over or sample integers from between zero and $2^{4 \times \text{difficulty}}$ (equivalently, all bitstrings with up to $4 \times \text{difficulty}$ bits).
We reject programs that are identified by static analysis to be “obviously” boring or confusing. See the static analysis section for more information. The current analysis is relatively unsophisticated, but it is still able to approximate that e.g. the program - 1 bgtz 8 - will place $\{t_1 \; t_0 \; -\}$ at the top of the stack. $t_i$ denotes a symbolic value corresponding to the $i$-th value to be popped off the top of the stack. The curly braces denote that the top of the stack might correspond to any symbolic expression from a set of possible symbolic expressions. In this case, the analysis is able to detect that the bgtz will always branch because the constant $1$ is pushed immediately beforehand.
We search for a problem (a problem is a program, examples, and a list of choices, some incorrect) by evaluating programs on seeds, which are the first two numbers in each sequence. It’s possible for boring programs to slip through the static analysis, so we take this opportunity to reject programs that e.g. generate constant or almost-constant number sequences. We also reject programs that generate non-integer terms, or terms that are too big; we don’t want to be too mean to the LLM!
We ask an LLM (which I will happily anthropomorphize as Alice) to pick the sequence out of a lineup and to tell us what it thinks the pattern is. The demo uses the API configured under the api settings key (currently Anthropic’s Claude 3.5 Sonnet).
We ask a second LLM (Bob) to pick the sequence out of a lineup, but we only give it the pattern output by Alice. It doesn’t get the two examples.
We ask a third LLM (Charlie, who checks) whether Alice and Bob’s choices (if any) match the two examples and their choices.

Q. How can I run this myself?

First, some big warnings (repeated in the settings panel!):

🚨 Many of the values in the settings can contain arbitrary JavaScript which is run inside of your browser!

🚨 The default eval_js tool allows the LLM itself to run arbitrary JavaScript in your browser!

So it’s probably a good idea not to connect this demo to a malevolent superintelligent LLM that knows zero-day exploits for your browser. Having said that:

To provide an Anthropic API key, go to api.headers.x-api-key and replace the {{TODO: ...}} placeholder. The settings are just YAML. The api section comes after the prompts so you’ll have to scroll a little.

You can change a lot through the settings: not just the prompts, but also the HTTP requests the demo makes (by default it’s configured to use Claude 3.5 Sonnet by Anthropic). For example, to remove the eval_js tool, you can set "tools": [] in the value of api.body.

You should be able to configure the demo to hit e.g. OpenAI or Mistral’s APIs just by changing settings. You’d have to change the JavaScript under api.parse function to handle their output. Just make sure that your function hews to the same format:

Return { action: 'stop', append, output } to return the string output from the call to the LLM and append append to the list of messages.
Return { action: 'reply', append } to append append to the list of messages and then hit the API again. This is for tool calling.

The demo does not save anything, so be careful about refreshing! I’d recommend working on the settings YAML in your editor of choice so you get syntax highlighting etc.

Q. What do you think this all means?!

My takeaway is that although LLMs are powerful, their reasoning ability is limited in ways that are often unintuitive. Even when the LLM picks the right number sequence, the pattern it gives is generally nonsense. And even Bob and Charlie generally recognize that Alice’s pattern is wrong! Also, as usual, the LLM sounds very confident even when it’s dead wrong.

It’s interesting to me that in spite of the fact that the LLM does so poorly with number sequences in general, it does pretty well with variations of the Fibonacci sequence.

I was also surprised that the patterns (in a sense explanations) were so low-quality. Given the architecture of autoregressive LLMs, if the explanation is output after the choice, the text of the explanation has no impact on its choice. But it’s still surprising that the explanations can be so bad:

even after multiple rounds of conversation, where the explanation does appear before its final choice, and
given that the explanation and choice are generated by the same LLM.

Q. Why did you do any of this?

One day in March, I was walking my dog, saw a house numbered 1347, and thought it was a funny pattern. It’s the Fibonacci sequence with a different seed (13 instead of 11).¹ I texted a friend about it and then got interested in the problem of automatically generating number patterns.

At the end of March I was going to try set up a system so Amazon Mechanical Turk workers could compete against LLMs. The infrastructure for that was so soul-crushingly boring I dropped the project. Last week I decided I should just make a simple UI and release it for people to play around with it themselves.

Q. Isn’t it obvious that LLMs wouldn’t be able to do this because…

I’m glad it’s obvious to you!

I heard that LLMs did pretty well on standardized tests like the SAT and GRE and could already reason as well as college students, but I might have just hallucinated that. I’m glad I did: it would have been kinda misleading for people to say things like that!

I am optimistic that computers will be better at this task someday (perhaps soon)! Obviously it’s incredibly impressive that LLMs are able to do any of this at all, but even good science fiction authors discuss their fictional technologies’ limitations. In particular, seeing as the LLM is good at recognizing variations on the Fibonacci sequence, I wonder how well a model trained exclusively to predict number sequences generated by programs like this could do. How would e.g. the maximum difficulty where an LLM could solve at least half of the problems correlate to the model’s size?

It’s worth noting that if you wanted to cheat at this demo, you could trivially brute-force it: at the default difficulty of 5 you would only need to iterate over $2^{4 \times 5} \approx 1\text{M}$ programs. Your smart fridge could do that.

Q. Isn’t it obvious that LLMs wouldn’t be able to recognize these patterns when I, a human, can’t recognize them?

Maybe you should conclude that you aren’t an (A)GI either! Just kidding, ha ha.

It doesn’t make much sense to argue about whether something is intuitive or not! But I’m happy to walk through how I think about the first example:

Program: - 1 bgtz 8 - (51f85)

Here are 2 number sequences that follow a common pattern:

6, 8, -2, 10, -12, 22, -34, 56
8, 2, 6, -4, 10, -14, 24, -38

Which one of the following sequences follows the same pattern?

5, 10, 17, 25, 34, 44, 55, 67
5, 10, 6, 2, -2, -6, -10, -14
5, 10, -5, 15, -20, 35, -55, 90
5, 10, 7, 4, 1, -2, -5, -8

Personally, when I look at the two examples, I notice that:

The terms after the first two terms alternate in sign. Note that the default prompt (click Settings) does tell the LLMs that the first two terms are fixed! (A) and (D) don’t do this.
The absolute values of the terms always increases. (B) and (D) don’t do this.

At this point I still don’t know a mathematical rule for the pattern. But by process of elimination I’m left with (C), which turns out to be correct.

The mathematical rule is that you subtract the last term from the second-to-last term. The program is far from canonical: in fact, everything after the first instruction is a noop. But remember that we’re not asking anyone to come up with a precise mathematical rule or to reverse-engineer the original program!

Among many other potential experiments, it would be interesting to see if it’s useful to tell the LLM about some of these tricks in the prompt. Given that it has an eval_js tool to evaluate arbitrary JavaScript, we could even give it a library of functions that would be useful for evaluating number sequences. Alas, I already dropped this project for months due to getting bored—the backend was already finished in March—so I’m just releasing it as-is.

Q. Didn’t this blog post used to have a shorter title?

Indeed it did! The original title didn’t mention StackBee (the stack language). When I was about to submit this post to Hacker News², I spent some time trying to make the title as short as possible. Recently I discovered that there was a paper that was submitted to ICLR in 2023 (before I started work on StackBee!) with the title “Language Models Struggle to Explain Themselves”! Their approach is also similar! Instead of generating number patterns, they generate sequences of words using rules like:

The word ‘lizard’ appears in the input: matches “dog cat lizard goat sheep”, does not match “cat capybara dog bird koala”
Contains the word ‘cat’ and a digit: matches “pig dog 4 cat lizard”, does not match “pig dog 4 gato lizard”

Stack language

The “top” of the stack is on the right. When I write

$$\cdots \rightarrow \cdots 0$$

I mean that the when the operator/instruction 0 is executed, then $0$ is pushed to the top of the stack. I tried to stick to Forth names and semantics.

Kind	Hex	Operator	Stack semantics
Constants	`0`	`0`	$\cdots \rightarrow \cdots 0$
	`1`	`1`	$\cdots \rightarrow \cdots 1$
	`2`	`2`	$\cdots \rightarrow \cdots 2$
	`3`	`3`	$\cdots \rightarrow \cdots 3$
Arithmetic	`4`	`+`	$\cdots x y \rightarrow \cdots (x + y)$
	`5`	`-`	$\cdots x y \rightarrow \cdots (x - y)$
	`6`	`*`	$\cdots x y \rightarrow \cdots (x \times y)$
	`7`	`/`	$\cdots x y \rightarrow \cdots (x \div y)$
	`8`	`%`	$\cdots x y \rightarrow \cdots (x\mod y)$
Stack	`9`	`dup`	$\cdots x \rightarrow x x$
	`a`	`drop`	$\cdots x \rightarrow \cdots$
	`b`	`swap`	$\cdots x y \rightarrow \cdots y x$
	`c`	`rot`	$y \cdots \rightarrow \cdots y$
	`d`	`unrot`	$\cdots x \rightarrow x \cdots$
	`e`	`len`	$\cdots \rightarrow \cdots \operatorname{len}(\cdots)$
Control	`f`	`bgtz n`	$\cdots x \rightarrow \cdots$

bgtz is the only control flow operator and the only multi-word operator (i.e. the only operator that takes more than 4 bits). The following word in the program defines the branch offset and is interpreted as a number, not an opcode. This means that you can skip up to 15 instructions following the offset; bgtz 0 is a noop. Execution jumps by n instructions when $x > 0$ and continues otherwise.

The sequence generated by a program is not just the stack after the program has executed. To generate the $(n+1)$-th term of a sequence, we initialize the stack with the $n$ previous terms; the $n$-th term is at the top of the stack. After running the program, the top of the stack becomes the $(n+1)$-th term of the sequence. In the demo, the stack is initialized with the same two seed numbers for all of the choices presented to the LLM.

You can play with the interpreter and analyzer in your browser’s developer console. Use stackbee.evaluate("+", "1 2") to evaluate the program + on the stack 1 2. You can write rational numbers too:

stackbee.evaluate("/ 1 +", "1 2")
// '3/2'

// Tracing execution
stackbee.trace("/ 1 +", "1 2")
// stack   ops
// 1 2     / 1 +
// 1/2     1 +
// 1/2 1   +
// 3/2
//
// '3/2'

// Static analysis
stackbee.analyze("/ 1 +")
// stack             ops
// ⋯                / 1 +
// ⋯ {t₁ t₀ /}      1 +
// ⋯ {t₁ t₀ /} {1}  +
// ⋯ {t₁ t₀ / 1 +}
//
// ' ⋯ {t₁ t₀ / 1 +}'

Example programs

Doubling

2 *

This program just doubles whatever’s on top of the stack. Given the seed $0, 4$, for example, it will generate the number sequence: $0, 4, 8, 16, \ldots$

Fibonacci

This program adds the top two numbers on the stack. A simple program for a simple sequence: and one it seems likely that LLMs have learned a specific circuit for, given how often they miscategorize sequences as Fibonacci! Given the seed $1, 2$ we recover the classic Fibonacci sequence: $1, 2, 3, 5, 8, \ldots$

Catalan numbers

len 2 * 1 - 2 * * len 1 + /

Catalan numbers show up a lot when counting things.

When the stack contains the $n-1$ previous Catalan numbers, after this program is executed, the top of the stack will be the $n$-th Catalan number. That means when the seed is $1, 1$ (the first two Catalan numbers), the program will generate the Catalan numbers: $1, 1, 2, 5, \ldots$

We use the following recurrence:

C_n = \frac{2(2n-1)}{n+1}C_{n-1}

The execution trace is:

Stack	Program
$\cdots \quad C_{n-1}$	`len 2 * ⋯`
$\cdots \quad C_{n-1} \quad 2n$	`1 - ⋯`
$\cdots \quad C_{n-1} \quad 2n-1$	`2 * ⋯`
$\cdots \quad C_{n-1} \quad 2(2n-1)$	`* ⋯`
$\cdots \quad 2(2n-1)C_{n-1}$	`len 1 + ⋯`
$\cdots \quad 2(2n-1)C_{n-1} \quad n+1$	`/`
$\cdots \quad C_n$

Technically only the top of the stack needs to be the $(n-1)$-th Catalan number, the other $n-2$ values can be whatever.

Collatz sequence

dup 2 % bgtz 5 2 / 1 bgtz 4 3 * 1 +

To apply the Collatz function $C$ to a number $n$:

if the number is even, divide it by 2
if the number is odd, multiply it by 3 then add 1

The Collatz sequence (or hailstone sequence) for a number $n$ is $n, C(n), C(C(n)), \cdots$. The Collatz conjecture is that the Collatz sequence for every positive integer eventually reaches 1.

The program looks funky but is actually simpler than the Catalan numbers program:

Is even?	Even	(go to end)	Odd
`dup 2 % bgtz 5`	`2 /`	`1 bgtz 4`	`3 * 1 +`

The 1 bgtz 4 jumps to the end of the program so that the even case doesn’t fall through to the odd case. I’m sure a better instruction set is possible; it’s just tough to fit all the instructions in 4 bits.

This program is 14 instructions long (so 56 bits). Unfortunately this is just one over the maximum size of generated programs in the demo (13), which is limited by MAX_SAFE_INTEGER in JavaScript ($2^{53} - 1$). I apologize if you are reading this in the year 2525 and your computer would otherwise be fast enough to generate this program and test it against an LLM. That would have been fun.

Static analysis

The symbolic executor exists for two purposes:

To help humans (namely, me!) understand and verify programs, which—by virtue of being randomly generated—are often inscrutable.
To help the computer generate good problems: namely, ones that aren’t “boring” or “confusing”, which will be defined later.

The language and the programs are both puny, which means that my primitive static analysis does OK. Conveniently, we can represent symbolic expressions as programs themselves! For example, the symbolic expression representing the first unknown value popped from the stack is $t_0$; the symbolic expression representing that value plus two is $t_0 \; 2 \; +$.

SymExpr = SymVal[]
SymVal = Op | Sym
Sym = $t_i, \; b_j, \; \ell_k$
$t_i$ is the $i$-th unknown value popped from the stack
$b_j$ is the $j$-th unknown value from the bottom of the stack
$\ell_k$ is the length of the stack at the $k$-th time the len operator was evaluated

Values on the stack are represented as sets of symbolic expressions; branching execution paths enlarge sets.

SymExprSet = SymExpr[]

So a symbolic expression set of $\{0, 1\}$ means that the concrete value might be either zero or one.

The symbolic stack itself is represented as two lists: a list of known values at the top, and a list of known values at the bottom.

We can write the stack like this:

x y \cdots z w

This means $x$ and $y$ are at the bottom and $z$ and $w$ are at the top. The analysis assumes that the length of the stack is unknown, so there may be zero or more values separating the top and bottom.

An implication of that is that the following stack might have only one value $x$ at both the bottom and the top:

\cdots x

The length of the stack being unknown has a few implications. For example, rot has two obvious cases:

\dfrac{x \cdots \quad {\tt rot}}{\cdots x} \; \text{rot-1}

\dfrac{\cdots \quad {\tt rot}}{\cdots b_i} \; \text{rot-3}

… and then a third case (which I’m calling $\text{rot-2}$ to indicate precedence), which is a little funky:

\dfrac{\cdots x \quad {\tt rot}}{\cdots b_i} \; \text{rot-2}

The first rule says “if the stack starts out with $x$ at the bottom, then after rot, $x$ will be (rotated) to the top.” The second rule says “if we don’t know anything about the stack, then after rot, we generate a new symbolic value $b_i$ to place at the top of the stack.”

In the third rule, why is there an $x$ at the top of the stack? And why does it disappear afterwards? The reason is that $\cdots x$ might actually be a one-value stack $x$; remember that we don’t know anything about the size of $\cdots$. So $x$ is at both the bottom and top of the stack! If we didn’t delete $x$ and instead set the symbolic stack to $\cdots x b_i$ afterwards, we might mistakenly think the stack contains at least two values now.

Boring

We say that a program is boring if any of the following are true:

it’s surely constant, i.e. the top symbolic expression set is something like $\{1\}$ (and not $\{1, 2\}$)
the symbolic stack has no values at the top
the top of the stack has exactly one value with exactly one symbolic expression, e.g. $\cdots \{t_0\}$

There’s no reason there couldn’t be more or fewer rules; this is just what worked OK in testing.

Confusing

We say that a program is confusing if it will definitely underflow the stack. We can only approximate this, but we currently look at the top of the stack and check whether

i + j + 2 > {\tt SEED\_LEN}

where $i$ is the highest index of the symbolic $t_i$ value we see, and $j$ is the highest index of the symbolic $b_j$ values we see. If we see $t_i$ that means at most $i+1$ values were consumed from the top (and likewise with $b_j$ and the bottom), so $i + j + 2$ gives us an estimate of the number of values consumed from the stack.

There is also no particular reason for this rule other than that it works OK. The reason I introduced it is because I noticed that because programs which underflow the stack tend to make for confusing number sequences. pop on an empty stack evaluates to $0$, so you can get number sequences which suddenly seem to change rules some number of terms in, after the stack has had a chance to grow. It’s also a little bit silly because we could track more information about the symbolic stack and avoid having to guess.

Improvements

The static analysis of this program illustrates many of the current system’s shortcomings:

Program: rot * unrot 2 + (c6d24)

Analysis: $\{t_1 \; 2 \; +\}$

Stack	Ops
$\cdots$	`rot * unrot 2 +`
$\cdots \quad \{b_0\}$	`* unrot 2 +`
$\cdots \quad \{t_0 \; b_0 \; *\}$	`unrot 2 +`
$\{t_0 \; b_0 \; *\} \quad \cdots$	`2 +`
$\{t_0 \; b_0 \; * \} \quad \cdots \quad \{2\}$	`+`
$\cdots \quad \{t_1 \; 2 \; + \}$

Example number sequences:

12, 11, 134, 13, 136, 15, 138, 17
6, 6, 38, 8, 40, 10, 42, 12

Because the size of the stack $\cdots$ is unknown, when we see the final + we have to trash the bottom of the stack, viz. $\{t_0 \; b_0 \; * \}$.

Even if the stack size is unknown, we could use a set of two symbolic expressions, {t₀ b₀ * 2 +, t₁ 2 +}, to represent the value at the top of the stack (although this would make the analyses of other programs noisier).

The analysis is still valid; the next entry is always t₁ 2 +, insofar as t₁ just means the second item popped off the top of the stack.

We should arguably detect that this program is confusing because its behavior differs depending on whether the stack is initialized with two vs. three values. We already try to reject programs that underflow the stack, but (again) because the analysis currently doesn’t make any assumptions about the stack size, we approximate this by just looking at the subscripts of $t_i$ and $b_j$ values in the final expression.

Miscellanea

The values on the stack are actually rational numbers $p/q$, but I figured this would make things too complicated for the LLM, so the problems that I generate only involve integers.
We do a little bit of constant folding (e.g. $1 \; 1 \; /$ and even $t_i \; t_i \; /$ evaluate to $1$) which seems to help a bit.
Evaluating bgtz results in two symbolic stacks, which we merge. This can create horrifying symbolic stacks. At the default difficulty of 5 you will very rarely see programs with bgtzs that actually do anything. The symbolic stack for the Collatz program is pretty, though:
```
\cdots \{t_0 \; 3 * 1 \; +, t_0 \; 2 \; /\}
```
It’s disappointing that as you grow the length of the generated programs, their complexity (seems to) grow at a slower rate. By the time you get to programs as long as the Collatz function many programs consist mostly of noops. A better ISA might fix this. I experimented with term-rewriting systems at the beginning of the project but they didn’t seem too promising.

Random reflections

YAML was nice to avoid getting bogged down in writing a settings UI. But YAML’s multiline string sigils are very confusing.
eval for templates is horrible, but it definitely saved a lot of time. It’s cool that browsers are hardened to the point where I’m fine with it.
Using JSON.stringify as a hack for doing structural comparisons (when you can control key ordering) is good for demos. But it really makes you miss BEAM values.
I tried writing the UI without React; it’s been a long time since I’ve used jQuery. It would have helped a little, but the UI is very simple; the main interactivity is with the <details> elements. Having to manually restore whether they are open or not after I re-render is definitely kind of annoying. At least half of the benefit would have been just JSX, even without React.

Time to say goodbye

There’s a lot more I could write about but I figure very few people will read this far anyways. If you did, you’re amazing and I appreciate you!

This was a side project to distract from actual work so I don’t know how much more time I can spend on it, unfortunately.

The source code for the demo (including program execution, analysis, and generation) is here. I’ve licensed it under the BSD 3-clause license. Copyright acknowledgments for libraries used are here.

Lastly, I’m writing this blog post on the third anniversary of the unexpected death of my dad in the year he was planning to retire. I’m embarrassed to foist unsolicited advice upon others, but the anniversary reminds me to think about whether I’d regret it if today were the day I died. If this is my last day and I never retire, I’d prefer to save my energy for the many good people out there.

A helpful Hacker News commenter pointed out that in an earlier version of this post I managed to write “31” instead of “13” twice, which is ironic. Now I’m also unsure whether the house actually was 3147. Then the pattern would have been “add the first term to the last term” (unrot +) instead of the Fibonacci sequence (+). That’s what I get for writing at 3am.

The reason I don’t link to either the Hacker News or Lobsters discussions of this post is because both devolved into flamewars! As far as I can tell, no one in the comments actually tried any prompts of their own.