ai:formats-faq
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
ai:formats-faq [2023/11/11 15:51] – add hardware section and some snarky wit naptastic | ai:formats-faq [2023/12/16 16:13] (current) – [What hardware works?] naptastic | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | =====DISCLAIMER===== | + | ====DISCLAIMER==== |
No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out. | No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out. | ||
- | ====What hardware works?==== | + | This page is going to be a mess. I wish there were an easy way to just say ${model} works with ${hardware} using ${software} but |
+ | - It's a really fast-moving target | ||
+ | - It's really complicated | ||
+ | |||
+ | None of us are thrilled with the state of things, but we have to break compatibility and invalidate documentation constantly just to keep up. **Information older than about a week should be considered obsolete and re-checked.** | ||
+ | |||
+ | =====Anecdotal reports===== | ||
+ | ==naptastic== | ||
+ | Using an RTX 3070, I can load Mistral 7b models with ExLlama2 at 4-bit precision and a context of 16k before running out of VRAM. If there' | ||
+ | |||
+ | Still on the 3070, I got ~12% increase by installing flash-attention. The error message when it fails to load says that it must be installed from source; I found it was already in pip. With an empty context I get 70~73 tokens/ | ||
+ | |||
+ | I have many more observations to put here but most of them aren't going to be relevant until the A770 work is done. | ||
+ | |||
+ | ==Your Report Here!== | ||
+ | FIXME | ||
+ | |||
+ | =====What hardware works?===== | ||
+ | Hoffman' | ||
+ | |||
+ | * CPU | ||
+ | * Your CPU **must** support (AVX? SSE4.X?) even if you do GPU-only inference. | ||
+ | * Intel: Newer than (???) | ||
+ | * AMD: Zen architecture. | ||
+ | * William Schaub has [[https:// | ||
+ | * Most users have more CPU-attached DRAM than GPU-attached VRAM, so more models can run via CPU inference. | ||
+ | * CPU/DRAM inference is orders of magnitude slower than GPU/VRAM inference. (More info needed.) | ||
* Nvidia newer than FIXME | * Nvidia newer than FIXME | ||
- | * 30XX, 40XX. (10XX? 20XX? Idk) | + | * I know one person got a 6GB 1060ti to run a 7b Mistral model |
+ | * Oddly no reports of success on 20XX cards have reached my eyes | ||
+ | * 3070, 3080, 3090 all work. 3090 seems to be the current | ||
+ | * 40XX. | ||
* AMD newer than FIXME | * AMD newer than FIXME | ||
* things known not to work: multi-GPU | * things known not to work: multi-GPU | ||
+ | * Not all quantization formats are supported. | ||
* Intel | * Intel | ||
* I don't have anything working yet, but it's supposed to be possible to run things on an A770 with Intel' | * I don't have anything working yet, but it's supposed to be possible to run things on an A770 with Intel' | ||
+ | * Partial offload (CPU + GPU) | ||
+ | * This is possible; there' | ||
+ | |||
+ | =====How much DRAM/VRAM do I need?===== | ||
+ | |||
+ | **Just to keep things clear** I will use the term **DRAM** to refer to CPU-attached RAM, which is generally DDR4 or DDR5, and **VRAM** to refer to GPU-attached RAM. Being much closer to the GPU itself, and on the same circuit board, GDDR can connect with a much faster, wider bus. [[https:// | ||
+ | |||
+ | (Or, "what models am I limited to?") | ||
+ | * If you're using the same GPU for your OS, find out how much that takes and subtract it from your available VRAM. | ||
+ | * The model itself requires (bits * weights) RAM. | ||
+ | * Context requires RAM. If a model that " | ||
+ | * Memory requirements work the same way for CPU-attached RAM. | ||
+ | |||
+ | ==What fits in 4GB?== | ||
+ | * Nothing larger than 3b. I have a few; probably worth testing. | ||
+ | ==What fits in 6GB?== | ||
+ | * Anything in the Mistral 7b GPTQ family at 4-bit quantization. Context will be very limited. | ||
+ | ==What fits in 8GB?== | ||
+ | * Mistral 7b GPTQ at 4-bit quantization. I've tested context sizes up to 16k. | ||
+ | ==What fits in 12GB?== | ||
+ | ==What fits in 16GB?== | ||
+ | ==What fits in 20GB?== | ||
+ | Seriously c'mon. If you have this much VRAM, you probably don't need help with the calculations. | ||
+ | |||
+ | =====Can I use my CPU and GPU together? | ||
+ | Yes. Please someone write this section. | ||
+ | |||
+ | =====LLM Format Comparison===== | ||
+ | This information is honestly as much for my own use as anyone else' | ||
+ | |||
+ | > "have plenty vram on modern nvidia = use exllamav2+gptq/ | ||
+ | |||
+ | Different formats vary in a handful of relevant ways: | ||
+ | * Loader compatibility: | ||
+ | * Disk space: something quantized with more bits (or not quantized) will take more disk space. | ||
+ | * RAM usage: Ditto | ||
+ | * Quantization: | ||
- | ====So what's up with LLM formats? | + | So choosing the best format to download for your use case is going to depend on all the same factors: |
- | * GGUF - CPU only | + | |
- | * GPTQ - GPU only; if it doesn' | + | |
- | ====How much VRAM do I need?==== | + | * HF - //"HF means unquantised, |
- | * Usually about 1 GB more than you have. | + | * AWQ - you almost certainly don't want this. FIXME |
+ | * GPTQ - //" | ||
+ | * GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. | ||
+ | * GGML - Generally not used anymore; GGUF is the new standard. (2023-12) | ||
+ | * //" | ||
+ | * GGUF - CPU-based with optional GPU offload. Successor to GGML. | ||
+ | * .gguf doesn' | ||
+ | * Quantization can vary within the same model. | ||
+ | * The filename tells you what you need to know about a model before you download it. The end will be something like: | ||
+ | * Q♣_0 use the same bpw as they say on the tin for all tensors. | ||
+ | * Q♣_K_S (" | ||
+ | * Q♣_K_M (" | ||
+ | * Q♣_K_L (" | ||
+ | * exl2 - GPU only. New hotness. (Why though?) At time of writing (2023-11-11), |
ai/formats-faq.1699717911.txt.gz · Last modified: 2023/11/11 15:51 by naptastic