Table of Contents

DISCLAIMER

No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out.

This page is going to be a mess. I wish there were an easy way to just say ${model} works with ${hardware} using ${software} but

  1. It's a really fast-moving target
  2. It's really complicated

None of us are thrilled with the state of things, but we have to break compatibility and invalidate documentation constantly just to keep up. Information older than about a week should be considered obsolete and re-checked.

Anecdotal reports

naptastic

Using an RTX 3070, I can load Mistral 7b models with ExLlama2 at 4-bit precision and a context of 16k before running out of VRAM. If there's no context, generation runs at 50-60 tokens/second. By the time context reaches 2k, generation is down to 10-20 tokens/second. Making a standard set of prompts for benchmarking speed is on my to-do list. (2023-11-13)

Still on the 3070, I got ~12% increase by installing flash-attention. The error message when it fails to load says that it must be installed from source; I found it was already in pip. With an empty context I get 70~73 tokens/second now.

I have many more observations to put here but most of them aren't going to be relevant until the A770 work is done.

Your Report Here!

FIXME

What hardware works?

Hoffman's Iron Law applies here. You can choose at most two of three desirable properties: cheap, fast, easy. Nvidia is the easiest to use, best-supported, and most expensive. AMD cards are generally less well-supported but still useful, and less expensive. Intel GPUs cost the least but (as of 2023-12 at least) can do the least. CPU-based inference frequently requires no additional investment

How much DRAM/VRAM do I need?

Just to keep things clear I will use the term DRAM to refer to CPU-attached RAM, which is generally DDR4 or DDR5, and VRAM to refer to GPU-attached RAM. Being much closer to the GPU itself, and on the same circuit board, GDDR can connect with a much faster, wider bus. High-bandwidth memory goes even faster and wider.

(Or, “what models am I limited to?”)

What fits in 4GB?
What fits in 6GB?
What fits in 8GB?
What fits in 12GB?
What fits in 16GB?
What fits in 20GB?

Seriously c'mon. If you have this much VRAM, you probably don't need help with the calculations.

Can I use my CPU and GPU together?

Yes. Please someone write this section.

LLM Format Comparison

This information is honestly as much for my own use as anyone else's. But here's a tldr:

“have plenty vram on modern nvidia = use exllamav2+gptq/exl models. for any other scenario, including offloading, its llama.cpp+gguf” -TheLeastMost, 2023-11

Different formats vary in a handful of relevant ways:

So choosing the best format to download for your use case is going to depend on all the same factors: what hardware will be running the model, what you're going to be doing with it, etc. So here are the model formats: