Getting a typical large language model (perhaps the most famous form of AI) to give short answers can be quite tricky. Often they have effectively been trained and/or prompted to give relatively verbose answers, sometimes including the odd bit of sensible but quite tedious and repetitive lecturing. When running a language model locally, especially on limited hardware, waiting for such long and unnecessarily wordy responses can get a bit frustrating, so I think it can be worth using a prompt which encourages brevity and neutrality.
I've played about with quite a few approaches to this personally, and at the time of writing the system prompt I sometimes use for shorter and more neutral answers is this:
There's an example of this being used later on. Obviously the key sentence is the "Shortest" one, so to speak, but the others do seem to help - even if this means it's essentially saying "short neutral short short short". Other possible options are to say e.g. "for actual telegram" rather than a stone tablet, replace the other sentences with "Terse with essentials only", add "Costs you billions of dollars per word", and so on. It all quickly gets very silly, yet there can still be an impact.
The prompt above seems to work fairly well in my experience, and you might even have some luck using it in situations where you can't set a system prompt. That said, there's no guarantee that a prompt like this will always give useful results, as so much depends on the model being used. For example, one model I tried would give multiple lengthy "explanatory notes" after the short answer, defeating the purpose of having it give such an answer in the first place.
The language model I've been using most with this (again at the time of writing, in February 2024) is dolphin-2.6-mistral-7b, which I believe is a fine-tune of the seven-billion-parameter "Mistral 7B" model. I use a heavily-quantised "Q3_K_S" version barely squeezed onto a Raspberry Pi 400, running it with llama.cpp's "main" in interactive and ChatML modes with "-i -cml", and using the "-p" option to supply the prompt in quoted form (as well as "-c 1024" for an enlarged but still quite small context window). The quantisation does affect the quality of the responses given - usually you wouldn't use a quantisation this severe, but it's more or less the only way to run a 7B model on this slightly retro Pi.
Here are responses to a few test questions in a single session, in this case using zero temperature and a fixed seed with "--temp 0 -s 1". The responses give some idea of how the (quantised) model performs with the prompt, and may possibly also hint at some of the limits of the training data and fine-tuning:
(As can often happen with language models, some of these answers are incorrect or misleading.)
I don't know if I'd actually want to carve that lot into a stone tablet, but I think generally it's not a bad result.
So, if you end up running an LLM locally on something which can feel like it's giving you about one token per week, maybe try a prompt like the one above if you haven't already. It might help, even if it'll likely need some tweaking to suit the model you're using.
(NB: This page may quickly get outdated, not that I mind having the odd bit of outdated content on here. At any rate, it's possible that the general approach may remain useful at least.)