Naptastic

This is an old revision of the document!

DISCLAIMER

No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out.

Nap's AI/LLM generation FAQ

This page is going to be a mess. I wish there were an easy way to just say ${model} works with ${hardware} using ${software} but

It's a really fast-moving target
It's really complicated

None of us are happy about the state of things, but we have to break compatibility and invalidate documentation constantly just to keep up. Information older than about a week should be considered obsolete and re-checked.

What's the "best" model? How do I pick one?

Many relevant considerations.

Probably most important is what hardware you have.
- GPUs are much faster IF you can load your entire model into VRAM. (Don't ask yet; we'll get there.)
- CPUs usually have access to more RAM, but it's very slow.

What hardware works?

Nvidia newer than
- 30XX, 40XX. (10XX? 20XX? Idk)
AMD newer than
- things known not to work: multi-GPU
Intel
- I don't have anything working yet, but it's supposed to be possible to run things on an A770 with Intel's SDK.
CPU

So what's up with LLM formats?

GGUF - CPU only
GPTQ - GPU only; if it doesn't fit in VRAM you can't load it. Works with

How much VRAM do I need?

Usually about 1 GB more than you have.

How fast does it go?

GPUs are fast; CPUs not so much. There is a serious shortage of information about inference speed; AFAIK nobody is doing apples-to-apples benchmarks of different systems against each other. Doing so will have to be a community effort, since probably none of us has the hardware or time to do all the testing ourselves.

How do I make it faster?

Assuming that a hardware upgrade isn't an option:

flash-attention
option tuning
tips for saving context

Table of Contents