Naptastic

This is an old revision of the document!

Welcome to naptastic's LLM/Generative AI FAQ

DISCLAIMER

No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out.

This page is going to be a mess. I wish there were an easy way to just say ${model} works with ${hardware} using ${software} but

It's a really fast-moving target
It's really complicated

None of us are happy about the state of things, but we have to break compatibility and invalidate documentation constantly just to keep up. Information older than about a week should be considered obsolete and re-checked.

PHILOSOPHY

I very much subscribe to the “Stone Soup” philosophy of open-source. The problem is that everyone wants to be the person bringing the stone. But stone soup only needs one stone! We need tables, utensils, meat, vegetables, seasonings, firewood, and people to tend the fire and stir the pot and cut up the ingredients…

Please consider how many people have put how much time into generating and assembling this information. Yes it's naptastic organizing the page (at least right now) but all the info is coming from other people. I do not want to scare people off from asking questions; otherwise I don't know what to put in the FAQ! But if you are going to bring questions, please also be willing to put some time in to test things and report back when you have successes.

Important note: YOU CAN USE AI TO HELP YOU WITH THIS!!! It's not cheating!

How can I help?

SUCCEED!!! Get something working, even if it's not working as well as you want. Getting better and faster results is part of this FAQ too.
Tell me about it! What hardware worked? What models? What problems did you encounter and how did you solve them? How fast does it generate?
Contribute to the actual open-source projects. That is where the most work needs to be done.

What's the "best" model? How do I pick one?

Many relevant considerations.

Probably most important is what hardware you have.
- GPUs are much faster IF you can load your entire model into VRAM. (Don't ask yet; we'll get there.)
- CPUs usually have access to more RAM, but it's very slow.

Anecdotal reports

naptastic

Using an RTX 3070, I can load Mistral 7b models with ExLlama2 at 4-bit precision and a context of 16k before running out of VRAM. If there's no context, generation runs at 50-60 tokens/second. Less-polite interactions run more slowly. More context runs more slowly. Making a standard set of prompts for benchmarking speed is on my to-do list.

I have many more observations to put here but most of them aren't going to be relevant until the A770 work is done.

What hardware works?

Nvidia newer than
- 30XX, 40XX. (10XX? 20XX? Idk)
AMD newer than
- things known not to work: multi-GPU
Intel
- I don't have anything working yet, but it's supposed to be possible to run things on an A770 with Intel's SDK.
CPU

LLM Format Comparison

Different formats vary in a handful of relevant ways:

Loader compatibility: not all loaders support all formats.
Disk space: something not yet quantized, or quantized to a higher precision, will take more space on-disk.
VRAM usage (calculating this is highly non-trivial)
Quantization: More bits (bpw) usually gives better results; the current consensus (2023-11) seems to be that 5-6bpw is the best option. Below 5bpw, quality drops sharply; above 6, quality doesn't improve much. (There's a Reddit post with more information; I need to find it.)

So choosing the best format to download for your use case is going to depend on all the same factors: what hardware will be running the model, what you're going to be doing with it, etc. So here are the model formats:

HF - “HF means unquantised, float16 format. You almost never want that. It's the format that models come in when first released. It's good for further fine tuning/training, or as a base for quantisation, but isn't really recommended to be used for inference” -TheBloke
AWQ - you almost certainly don't want this.
GPTQ - “GPTQ is a format designed for GPU inference. If you have a decent GPU with enough VRAM for the model you want to use, it will provide the fastest inference at a decent quality” -TheBloke
- GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. I don't know if this affects ExLlama2 or _HF variations; I'm not going to waste bandwidth downloading a model I already know won't work with what I have.
GGML - Generally not used anymore; for CPU-only inference, GGUF is the new standard. (2023-11)
- “GGML is a format that was first designed exclusively for CPU-only inference. It is very accessible because it doesn't require a GPU at all, only CPU + RAM. But CPU inference is quite a bit slower than GPU inference. […]
- Recently the situation has been complicated by the fact that GGML now also supports some GPU acceleration. It's still not as fast as GPTQ in situations where you have enough VRAM to fully load a model. But it can enable you to load a model larger than your GPU could support on its own, but still at decent speeds” -TheBloke
GGUF - CPU only. Replaces GGML.
exl2 - GPU only. New hotness. (Why though?) At time of writing (2023-11-11), TheBloke isn't publishing exl2 quants.

How much VRAM do I need?

Usually about 1 GB more than you have.

How fast does it go?

GPUs are fast; CPUs are not. There is a serious shortage of information about inference speed; AFAIK nobody is doing apples-to-apples benchmarks of different systems against each other. Doing so will have to be a community effort, since probably none of us has the hardware or time to do all the testing ourselves.

How do I make it faster?

Assuming that a hardware upgrade isn't an option:

flash-attention
option tuning
tips for saving context (RAG?)
I haven't investigated the –sdp-attention option but it says it makes things faster. Needs testing!

Table of Contents