Short answers revisited

As someone who follows open-weight (and locally-runnable) language model releases fairly closely nowadays, it's been interesting to see how things have developed so quickly. Having had a real motivation to prompt for short answers previously, due to having more limited hardware at the time, I still seem to keep returning to that idea with newer models mostly out of curiosity. It's worth noting that the popular chain-of-thoughts approach for improving response quality by adding context runs completely counter to this; as such, you may not even want a short answer. But there are reasons you might, and if nothing else, I think it's a nice option to have.

Originally, back in the dim and distant past of (checks notes) eight months ago, I had a couple of loosely-indicative prompts that worked well. Stone tablets, telegrams, simple enough stuff - but on many 7B models around that time it would be more effective than a literal approach. For smaller models than that, strongly-defined user/assistant roles with a few-shot prompt giving examples tended to be more effective, but I find few-shot can be troublesome because it seems particularly prone to having the model pick up on some subtle aspect of the examples that you'd rather it didn't. In any case, that approach did seem better for sub-7B models.

After getting modest hardware slightly more suited to running LLMs, I've been using models like Google's Gemma 2 27B, and Qwen 2.5 14B, at pretty standard quantisation levels. Usually with models of that size you don't need to bother with roundabout approaches or few-shot, you can tell it what to do more directly. So a short-answer system prompt (or initial prompt) can just be something as basic as this:

You give the shortest possible neutral responses as a helpful assistant.

Boring but effective. What I find interesting about this now is how it helps illustrate recent developments. A few months ago (as of October 2024), this approach started giving good results on 7B/8B models. Then an occasional 3B model would work. Most recently, I tried it on Meta's Llama 3.2 1B (via llama.cpp with a Q6_K quant). Just 1.23 billion parameters. And... it worked. Sure, the model doesn't always give the most perfect responses to say the least, but it's smaller than one of the GPT-2 models was. Here's how the length looks:

> Describe GPT-3.
GPT-3 is the third version of the company's GPT multimodal large language model. It was developed and tested by OpenAI.
> Explain how leap years work.
A leap year is an extra day added to the calendar every 4 years to keep it in sync with Earth's orbit around the Sun.
> How many Rs are there in r/localllama?
None

(I find "Describe GPT-3" to be a particularly handy way to check response length, as so many models will happily go on and on about it if you let them.)

It's not the accuracy or otherwise of the answers which interests me - probably for the best in this case - more the fact that a tiny 1B model appears to be following that system prompt to a reasonable degree, or at least giving that impression. (The 3B model does too, and gives much better responses as you might imagine.) I distinctly remember that, earlier in the year, a certain high-profile 14B model struggled to do anything sensible with such a prompt.

When I wrote my previous short-answers page, I fully expected it to quickly become dated, and sure enough that now feels about as current as a 19th-century newspaper. But for me, given the implied progress, my "old" prompting workaround becoming completely obsolete in well under a year is oddly satisfying.

(As ever, I should note that this page was probably already outdated by the time I finished writing it. I should also clarify that I'm really referring to the instruct tunes of the models mentioned, as you might imagine given the context.)

Last updated on 2024 Oct 25