Naptastic

This is an old revision of the document!

DISCLAIMER

No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out.

Nap's AI/LLM generation FAQ

This page is going to be a mess. I wish there were an easy way to just say ${model} works with ${hardware} using ${software} but

It's a really fast-moving target
It's really complicated

None of us are happy about the state of things, but we have to break compatibility and invalidate documentation constantly just to keep up. Information older than about a week should be considered obsolete and re-checked.

What's the "best" model? How do I pick one?

Many relevant considerations.

Probably most important is what hardware you have.
- GPUs are much faster IF you can load your entire model into VRAM. (Don't ask yet; we'll get there.)
- CPUs usually have access to more RAM, but it's very slow.

Anecdotal reports

Please create a section for yourself using double-equals and put your notes there.

If your source of truth is someplace else (e.g., your blog or social media feed) we could just add a link, and then your report here never has to get updated and it never goes stale.

naptastic

Using an RTX 3070, I can load Mistral 7b models with ExLlama2 at 4-bit precision and a context of 16k before running out of VRAM. If there's no context, generation runs at 50-60 tokens/second. Less-polite interactions run more slowly. More context runs more slowly. Making a standard set of prompts for benchmarking speed is on my to-do list.

I have many more observations to put here but most of them aren't going to be relevant until the A770 work is done.

What hardware works?

Nvidia newer than
- 30XX, 40XX. (10XX? 20XX? Idk)
AMD newer than
- things known not to work: multi-GPU
Intel
- I don't have anything working yet, but it's supposed to be possible to run things on an A770 with Intel's SDK.
CPU

So what's up with LLM formats?

HF - “HF means unquantised, float16 format. You almost never want that. It's the format that models come in when first released. It's good for further fine tuning/training, or as a base for quantisation, but isn't really recommended to be used for inference” -TheBloke
AWQ - you almost certainly don't want this.
GPTQ - “GPTQ is a format designed for GPU inference. If you have a decent GPU with enough VRAM for the model you want to use, it will provide the fastest inference at a decent quality” -TheBloke
GGML - “GGML is a format that was first designed exclusively for CPU-only inference. It is very accessible because it doesn't require a GPU at all, only CPU + RAM. But CPU inference is quite a bit slower than GPU inference. […]
- \n\nRecently the situation has been complicated by the fact that GGML now also supports some GPU acceleration. It's still not as fast as GPTQ in situations where you have enough VRAM to fully load a model. But it can enable you to load a model larger than your GPU could support on its own, but still at decent speeds” -TheBloke
GGUF - CPU only
exl2 - new hotness. (Why though?) At time of writing, TheBloke isn't publishing exl2 quants. (Remember this info might be out of date.)

How much VRAM do I need?

Usually about 1 GB more than you have.

How fast does it go?

GPUs are fast; CPUs not so much. There is a serious shortage of information about inference speed; AFAIK nobody is doing apples-to-apples benchmarks of different systems against each other. Doing so will have to be a community effort, since probably none of us has the hardware or time to do all the testing ourselves.

How do I make it faster?

Assuming that a hardware upgrade isn't an option:

flash-attention
option tuning
tips for saving context
I haven't investigated the –sdp-attention option but it says it makes things faster. Needs testing!

Table of Contents