ai:formats-faq
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
ai:formats-faq [2023/11/11 16:21] – [So what's up with LLM formats?] put all known relevant info in here naptastic | ai:formats-faq [2023/12/16 16:13] (current) – [What hardware works?] naptastic | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | =====DISCLAIMER===== | + | ====DISCLAIMER==== |
No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out. | No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out. | ||
- | |||
- | ====Nap' | ||
This page is going to be a mess. I wish there were an easy way to just say ${model} works with ${hardware} using ${software} but | This page is going to be a mess. I wish there were an easy way to just say ${model} works with ${hardware} using ${software} but | ||
Line 8: | Line 6: | ||
- It's really complicated | - It's really complicated | ||
- | None of us are happy about the state of things, but we have to break compatibility and invalidate documentation constantly just to keep up. **Information older than about a week should be considered obsolete and re-checked.** | + | None of us are thrilled with the state of things, but we have to break compatibility and invalidate documentation constantly just to keep up. **Information older than about a week should be considered obsolete and re-checked.** |
- | ====What's the " | + | =====Anecdotal reports===== |
- | Many relevant considerations. | + | ==naptastic== |
- | * Probably most important is what hardware you have. | + | Using an RTX 3070, I can load Mistral 7b models with ExLlama2 at 4-bit precision and a context of 16k before running out of VRAM. If there' |
- | * GPUs are much faster **IF** you can load your entire model into VRAM. (Don't ask yet; we'll get there.) | + | |
- | * CPUs usually have access to more RAM, but it' | + | |
- | ====What hardware works?==== | + | Still on the 3070, I got ~12% increase by installing flash-attention. The error message when it fails to load says that it must be installed from source; I found it was already in pip. With an empty context I get 70~73 tokens/ |
+ | |||
+ | I have many more observations to put here but most of them aren't going to be relevant until the A770 work is done. | ||
+ | |||
+ | ==Your Report Here!== | ||
+ | FIXME | ||
+ | |||
+ | =====What hardware works?===== | ||
+ | Hoffman' | ||
+ | |||
+ | * CPU | ||
+ | * Your CPU **must** support (AVX? SSE4.X?) even if you do GPU-only inference. | ||
+ | * Intel: Newer than (???) | ||
+ | * AMD: Zen architecture. | ||
+ | * William Schaub has [[https:// | ||
+ | * Most users have more CPU-attached DRAM than GPU-attached VRAM, so more models can run via CPU inference. | ||
+ | * CPU/DRAM inference is orders of magnitude slower than GPU/VRAM inference. (More info needed.) | ||
* Nvidia newer than FIXME | * Nvidia newer than FIXME | ||
- | * 30XX, 40XX. (10XX? 20XX? Idk) | + | * I know one person got a 6GB 1060ti to run a 7b Mistral model |
+ | * Oddly no reports of success on 20XX cards have reached my eyes | ||
+ | * 3070, 3080, 3090 all work. 3090 seems to be the current | ||
+ | * 40XX. | ||
* AMD newer than FIXME | * AMD newer than FIXME | ||
* things known not to work: multi-GPU | * things known not to work: multi-GPU | ||
+ | * Not all quantization formats are supported. | ||
* Intel | * Intel | ||
* I don't have anything working yet, but it's supposed to be possible to run things on an A770 with Intel' | * I don't have anything working yet, but it's supposed to be possible to run things on an A770 with Intel' | ||
- | * CPU | + | * Partial offload (CPU + GPU) |
+ | * This is possible; there' | ||
- | ====So what's up with LLM formats? | + | =====How much DRAM/VRAM do I need?===== |
- | * HF - "HF means unquantised, | + | |
- | * AWQ - you almost certainly don't want this. FIXME | + | |
- | * GPTQ - "GPTQ is a format designed for GPU inference. If you have a decent GPU with enough VRAM for the model you want to use, it will provide the fastest inference at a decent quality" | + | |
- | * GGML - "GGML is a format that was first designed exclusively for CPU-only inference. | + | |
- | * \n\nRecently the situation has been complicated by the fact that GGML now also supports some GPU acceleration. | + | |
- | * GGUF - CPU only | + | |
- | * exl2 - new hotness. (Why though?) At time of writing, TheBloke isn't publishing exl2 quants. (Remember this info might be out of date.) | + | |
- | ====How much VRAM do I need?==== | + | |
- | * Usually about 1 GB more than you have. | + | |
- | ====How fast does it go?==== | + | **Just to keep things clear** I will use the term **DRAM** |
- | | + | |
- | ====How do I make it faster?==== | + | (Or, "what models am I limited to?" |
- | Assuming | + | * If you're using the same GPU for your OS, find out how much that takes and subtract it from your available VRAM. |
- | * flash-attention | + | * The model itself requires (bits * weights) RAM. |
- | * option tuning | + | * Context requires RAM. If a model that " |
- | * tips for saving context | + | * Memory requirements work the same way for CPU-attached RAM. |
+ | |||
+ | ==What fits in 4GB?== | ||
+ | * Nothing larger than 3b. I have a few; probably worth testing. | ||
+ | ==What fits in 6GB?== | ||
+ | * Anything in the Mistral 7b GPTQ family at 4-bit quantization. Context will be very limited. | ||
+ | ==What fits in 8GB?== | ||
+ | * Mistral 7b GPTQ at 4-bit quantization. I've tested context sizes up to 16k. | ||
+ | ==What fits in 12GB?== | ||
+ | ==What fits in 16GB?== | ||
+ | ==What fits in 20GB?== | ||
+ | Seriously c'mon. If you have this much VRAM, you probably don't need help with the calculations. | ||
+ | |||
+ | =====Can I use my CPU and GPU together? | ||
+ | Yes. Please someone write this section. | ||
+ | |||
+ | =====LLM Format Comparison===== | ||
+ | This information is honestly as much for my own use as anyone else' | ||
+ | |||
+ | > "have plenty vram on modern nvidia = use exllamav2+gptq/ | ||
+ | |||
+ | Different formats vary in a handful of relevant ways: | ||
+ | * Loader compatibility: | ||
+ | * Disk space: something quantized with more bits (or not quantized) will take more disk space. | ||
+ | * RAM usage: Ditto | ||
+ | * Quantization: | ||
+ | |||
+ | So choosing the best format to download for your use case is going to depend on all the same factors: what hardware | ||
+ | |||
+ | * HF - //"HF means unquantised, | ||
+ | * AWQ - you almost certainly don't want this. FIXME | ||
+ | * GPTQ - //" | ||
+ | * GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. I don't know if this affects ExLlama2 or _HF variations; I'm not going to waste bandwidth downloading a model I already know won't work with what I have. | ||
+ | * GGML - Generally not used anymore; GGUF is the new standard. (2023-12) | ||
+ | * //" | ||
+ | * GGUF - CPU-based with optional GPU offload. Successor to GGML. | ||
+ | * .gguf doesn' | ||
+ | * Quantization can vary within the same model. | ||
+ | * The filename tells you what you need to know about a model before you download it. The end will be something like: | ||
+ | * Q♣_0 use the same bpw as they say on the tin for all tensors. | ||
+ | * Q♣_K_S (" | ||
+ | * Q♣_K_M (" | ||
+ | * Q♣_K_L (" | ||
+ | * exl2 - GPU only. New hotness. (Why though?) At time of writing (2023-11-11), |
ai/formats-faq.1699719698.txt.gz · Last modified: 2023/11/11 16:21 by naptastic