ai:formats-faq
This is an old revision of the document!
Table of Contents
DISCLAIMER
No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out.
Nap's AI/LLM generation FAQ
This page is going to be a mess. I wish there were an easy way to just say ${model} works with ${hardware} using ${software} but
- It's a really fast-moving target
- It's really complicated
None of us are happy about the state of things, but we have to break compatibility and invalidate documentation constantly just to keep up. Information older than about a week should be considered obsolete and re-checked.
What's the "best" model? How do I pick one?
Many relevant considerations.
- Probably most important is what hardware you have.
- GPUs are much faster IF you can load your entire model into VRAM. (Don't ask yet; we'll get there.)
- CPUs usually have access to more RAM, but it's very slow.
What hardware works?
- Nvidia newer than
- 30XX, 40XX. (10XX? 20XX? Idk)
- AMD newer than
- things known not to work: multi-GPU
- Intel
- I don't have anything working yet, but it's supposed to be possible to run things on an A770 with Intel's SDK.
- CPU
So what's up with LLM formats?
- GGUF - CPU only
- GPTQ - GPU only; if it doesn't fit in VRAM you can't load it. Works with
How much VRAM do I need?
- Usually about 1 GB more than you have.
How fast does it go?
- GPUs are fast; CPUs not so much. There is a serious shortage of information about inference speed; AFAIK nobody is doing apples-to-apples benchmarks of different systems against each other. Doing so will have to be a community effort, since probably none of us has the hardware or time to do all the testing ourselves.
How do I make it faster?
Assuming that a hardware upgrade isn't an option:
- flash-attention
- option tuning
- tips for saving context
ai/formats-faq.1699719231.txt.gz · Last modified: 2023/11/11 16:13 by naptastic