ai:formats-faq
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
ai:formats-faq [2023/11/13 18:22] – remove information that's on the main FAQ page now naptastic | ai:formats-faq [2023/12/16 16:13] (current) – [What hardware works?] naptastic | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | =====Welcome to naptastic' | + | ====DISCLAIMER==== |
- | =====DISCLAIMER===== | + | |
No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out. | No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out. | ||
Line 9: | Line 8: | ||
None of us are thrilled with the state of things, but we have to break compatibility and invalidate documentation constantly just to keep up. **Information older than about a week should be considered obsolete and re-checked.** | None of us are thrilled with the state of things, but we have to break compatibility and invalidate documentation constantly just to keep up. **Information older than about a week should be considered obsolete and re-checked.** | ||
- | ====What's the " | + | =====Anecdotal reports===== |
- | Many relevant considerations. | + | |
- | * Probably most important is what hardware you have. | + | |
- | * GPUs are much faster **IF** you can load your entire model into VRAM. (Don't ask yet; we'll get there.) | + | |
- | * CPUs usually have access to more RAM, but it's very slow. | + | |
- | + | ||
- | ===Anecdotal reports=== | + | |
==naptastic== | ==naptastic== | ||
Using an RTX 3070, I can load Mistral 7b models with ExLlama2 at 4-bit precision and a context of 16k before running out of VRAM. If there' | Using an RTX 3070, I can load Mistral 7b models with ExLlama2 at 4-bit precision and a context of 16k before running out of VRAM. If there' | ||
+ | |||
+ | Still on the 3070, I got ~12% increase by installing flash-attention. The error message when it fails to load says that it must be installed from source; I found it was already in pip. With an empty context I get 70~73 tokens/ | ||
I have many more observations to put here but most of them aren't going to be relevant until the A770 work is done. | I have many more observations to put here but most of them aren't going to be relevant until the A770 work is done. | ||
- | ====What hardware works?==== | + | |
+ | ==Your Report Here!== | ||
+ | FIXME | ||
+ | |||
+ | =====What hardware works?===== | ||
+ | Hoffman' | ||
+ | |||
+ | * CPU | ||
+ | * Your CPU **must** support (AVX? SSE4.X?) even if you do GPU-only inference. | ||
+ | * Intel: Newer than (???) | ||
+ | * AMD: Zen architecture. | ||
+ | * William Schaub has [[https:// | ||
+ | * Most users have more CPU-attached DRAM than GPU-attached VRAM, so more models can run via CPU inference. | ||
+ | * CPU/DRAM inference is orders of magnitude slower than GPU/VRAM inference. (More info needed.) | ||
* Nvidia newer than FIXME | * Nvidia newer than FIXME | ||
- | * 30XX, 40XX. (10XX? 20XX? Idk) | + | * I know one person got a 6GB 1060ti to run a 7b Mistral model |
+ | * Oddly no reports of success on 20XX cards have reached my eyes | ||
+ | * 3070, 3080, 3090 all work. 3090 seems to be the current | ||
+ | * 40XX. | ||
* AMD newer than FIXME | * AMD newer than FIXME | ||
* things known not to work: multi-GPU | * things known not to work: multi-GPU | ||
+ | * Not all quantization formats are supported. | ||
* Intel | * Intel | ||
* I don't have anything working yet, but it's supposed to be possible to run things on an A770 with Intel' | * I don't have anything working yet, but it's supposed to be possible to run things on an A770 with Intel' | ||
- | * CPU | + | * Partial offload (CPU + GPU) |
+ | * This is possible; there' | ||
+ | |||
+ | =====How much DRAM/VRAM do I need? | ||
+ | |||
+ | **Just to keep things clear** I will use the term **DRAM** to refer to CPU-attached RAM, which is generally DDR4 or DDR5, and **VRAM** to refer to GPU-attached RAM. Being much closer to the GPU itself, and on the same circuit board, GDDR can connect with a much faster, wider bus. [[https:// | ||
+ | |||
+ | (Or, "what models am I limited to?" | ||
+ | * If you're using the same GPU for your OS, find out how much that takes and subtract it from your available VRAM. | ||
+ | * The model itself requires (bits * weights) RAM. | ||
+ | * Context requires RAM. If a model that " | ||
+ | * Memory requirements work the same way for CPU-attached RAM. | ||
+ | |||
+ | ==What fits in 4GB?== | ||
+ | * Nothing larger than 3b. I have a few; probably worth testing. | ||
+ | ==What fits in 6GB?== | ||
+ | * Anything in the Mistral 7b GPTQ family at 4-bit quantization. Context will be very limited. | ||
+ | ==What fits in 8GB?== | ||
+ | * Mistral 7b GPTQ at 4-bit quantization. I've tested context sizes up to 16k. | ||
+ | ==What fits in 12GB?== | ||
+ | ==What fits in 16GB?== | ||
+ | ==What fits in 20GB?== | ||
+ | Seriously c'mon. If you have this much VRAM, you probably don't need help with the calculations. | ||
+ | |||
+ | =====Can I use my CPU and GPU together? | ||
+ | Yes. Please someone write this section. | ||
+ | |||
+ | =====LLM Format Comparison===== | ||
+ | This information is honestly as much for my own use as anyone else' | ||
+ | |||
+ | > "have plenty vram on modern nvidia = use exllamav2+gptq/ | ||
- | ====LLM Format Comparison==== | ||
Different formats vary in a handful of relevant ways: | Different formats vary in a handful of relevant ways: | ||
* Loader compatibility: | * Loader compatibility: | ||
- | * Disk space: something | + | * Disk space: something quantized |
- | * VRAM usage (calculating this is highly non-trivial) | + | * RAM usage: Ditto |
- | * Quantization: | + | * Quantization: |
So choosing the best format to download for your use case is going to depend on all the same factors: what hardware will be running the model, what you're going to be doing with it, etc. So here are the model formats: | So choosing the best format to download for your use case is going to depend on all the same factors: what hardware will be running the model, what you're going to be doing with it, etc. So here are the model formats: | ||
Line 42: | Line 83: | ||
* GPTQ - //" | * GPTQ - //" | ||
* GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. I don't know if this affects ExLlama2 or _HF variations; I'm not going to waste bandwidth downloading a model I already know won't work with what I have. | * GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. I don't know if this affects ExLlama2 or _HF variations; I'm not going to waste bandwidth downloading a model I already know won't work with what I have. | ||
- | * GGML - Generally not used anymore; | + | * GGML - Generally not used anymore; GGUF is the new standard. (2023-12) |
- | * //" | + | * //" |
- | * //Recently the situation has been complicated by the fact that GGML now also supports some GPU acceleration. | + | * GGUF - CPU-based with optional GPU offload. Successor to GGML. |
- | * GGUF - CPU only. Replaces | + | * .gguf doesn' |
+ | * Quantization can vary within the same model. | ||
+ | * The filename tells you what you need to know about a model before you download it. The end will be something like: | ||
+ | * Q♣_0 use the same bpw as they say on the tin for all tensors. | ||
+ | * Q♣_K_S (" | ||
+ | * Q♣_K_M (" | ||
+ | * Q♣_K_L (" | ||
* exl2 - GPU only. New hotness. (Why though?) At time of writing (2023-11-11), | * exl2 - GPU only. New hotness. (Why though?) At time of writing (2023-11-11), | ||
- | |||
- | ====How much VRAM do I need?==== | ||
- | * Usually about 1 GB more than you have. | ||
- | |||
- | ====How fast does it go?==== | ||
- | * GPUs are fast; CPUs are not. There is a serious shortage of information about inference speed; AFAIK nobody is doing apples-to-apples benchmarks of different systems against each other. Doing so will have to be a community effort, since probably none of us has the hardware or time to do all the testing ourselves. |
ai/formats-faq.1699899775.txt.gz · Last modified: 2023/11/13 18:22 by naptastic