ai:formats-faq
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
ai:formats-faq [2023/11/11 17:08] – [DISCLAIMER] restructure opening naptastic | ai:formats-faq [2023/12/16 16:13] (current) – [What hardware works?] naptastic | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | =====Welcome to naptastic' | + | ====DISCLAIMER==== |
- | =====DISCLAIMER===== | + | |
No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out. | No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out. | ||
- | |||
- | =====PHILOSOPHY===== | ||
- | I very much subscribe to the "Stone Soup" philosophy of open-source. The problem is that everyone wants to be the person bringing the stone. But it only takes one stone. We need tables, utensils, meat, vegetables, seasonings, firewood, and people to tend the fire and stir the pot and cut up the ingredients... | ||
- | |||
- | Please consider how many people have put how much time into generating and assembling this information. Yes it's naptastic organizing the page (at least right now) but all the info is coming from other people. I do not want to scare people off from asking questions; otherwise I don't know what to put in the FAQ! But if you are going to bring questions, please also be willing to put some time in to test things and report back when you have successes. | ||
- | |||
- | ====How can I help?==== | ||
- | First off, **SUCCEED!!!** Get something working, even if it's not working as well as you want. Getting better and faster results is part of this FAQ too. | ||
- | |||
- | Second, tell me about it! What hardware worked? What models? What problems did you encounter and how did you solve them? How fast does it generate? | ||
- | |||
- | Third, if you want to help with this FAQ, it would be nice to have someone in another part of the world (because timezones) helping to keep this updated. The information moves **so fast**. | ||
- | |||
- | Lastly, if you can contribute to the actual open-source projects making this all possible, please do. That is where the most work needs to be done. | ||
- | |||
- | Important note: **YOU CAN USE AI TO HELP YOU WITH THIS!!!** It's not cheating! | ||
- | ====Nap' | ||
This page is going to be a mess. I wish there were an easy way to just say ${model} works with ${hardware} using ${software} but | This page is going to be a mess. I wish there were an easy way to just say ${model} works with ${hardware} using ${software} but | ||
Line 24: | Line 6: | ||
- It's really complicated | - It's really complicated | ||
- | None of us are happy about the state of things, but we have to break compatibility and invalidate documentation constantly just to keep up. **Information older than about a week should be considered obsolete and re-checked.** | + | None of us are thrilled with the state of things, but we have to break compatibility and invalidate documentation constantly just to keep up. **Information older than about a week should be considered obsolete and re-checked.** |
- | ====What's the " | + | =====Anecdotal reports===== |
- | Many relevant considerations. | + | ==naptastic== |
- | * Probably most important is what hardware you have. | + | Using an RTX 3070, I can load Mistral 7b models with ExLlama2 at 4-bit precision and a context of 16k before running out of VRAM. If there' |
- | * GPUs are much faster **IF** you can load your entire model into VRAM. (Don't ask yet; we'll get there.) | + | |
- | * CPUs usually have access to more RAM, but it' | + | |
- | ===Anecdotal reports=== | + | Still on the 3070, I got ~12% increase by installing flash-attention. The error message when it fails to load says that it must be installed from source; I found it was already in pip. With an empty context I get 70~73 tokens/ |
- | Please create a section for yourself using double-equals and put your notes there. | + | |
- | If your source of truth is someplace else (e.g., your blog or social media feed) we could just add a link, and then your report | + | I have many more observations to put here but most of them aren't going to be relevant until the A770 work is done. |
- | ==naptastic== | + | ==Your Report Here!== |
- | Using an RTX 3070, I can load Mistral 7b models with ExLlama2 at 4-bit precision and a context of 16k before running out of VRAM. If there' | + | FIXME |
- | I have many more observations to put here but most of them aren't going to be relevant until the A770 work is done. | + | =====What hardware works?===== |
- | ====What hardware works?==== | + | Hoffman' |
+ | |||
+ | * CPU | ||
+ | * Your CPU **must** support (AVX? SSE4.X?) even if you do GPU-only inference. | ||
+ | * Intel: Newer than (???) | ||
+ | * AMD: Zen architecture. | ||
+ | * William Schaub has [[https:// | ||
+ | * Most users have more CPU-attached DRAM than GPU-attached VRAM, so more models can run via CPU inference. | ||
+ | * CPU/DRAM inference is orders of magnitude slower than GPU/VRAM inference. (More info needed.) | ||
* Nvidia newer than FIXME | * Nvidia newer than FIXME | ||
- | * 30XX, 40XX. (10XX? 20XX? Idk) | + | * I know one person got a 6GB 1060ti to run a 7b Mistral model |
+ | * Oddly no reports of success on 20XX cards have reached my eyes | ||
+ | * 3070, 3080, 3090 all work. 3090 seems to be the current | ||
+ | * 40XX. | ||
* AMD newer than FIXME | * AMD newer than FIXME | ||
* things known not to work: multi-GPU | * things known not to work: multi-GPU | ||
+ | * Not all quantization formats are supported. | ||
* Intel | * Intel | ||
* I don't have anything working yet, but it's supposed to be possible to run things on an A770 with Intel' | * I don't have anything working yet, but it's supposed to be possible to run things on an A770 with Intel' | ||
- | * CPU | + | * Partial offload (CPU + GPU) |
+ | * This is possible; there' | ||
- | ====So what's up with LLM formats? | + | =====How much DRAM/VRAM do I need?===== |
- | * HF - "HF means unquantised, | + | |
- | * AWQ - you almost certainly don't want this. FIXME | + | |
- | * GPTQ - "GPTQ is a format designed for GPU inference. If you have a decent GPU with enough VRAM for the model you want to use, it will provide the fastest inference at a decent quality" | + | |
- | * GGML - "GGML is a format that was first designed exclusively for CPU-only inference. | + | |
- | * \n\nRecently the situation has been complicated by the fact that GGML now also supports some GPU acceleration. | + | |
- | * GGUF - CPU only | + | |
- | * exl2 - new hotness. (Why though?) At time of writing, TheBloke isn't publishing exl2 quants. (Remember this info might be out of date.) | + | |
- | ====How much VRAM do I need?==== | + | |
- | * Usually about 1 GB more than you have. | + | |
- | ====How fast does it go?==== | + | **Just to keep things clear** I will use the term **DRAM** |
- | | + | |
- | ====How do I make it faster?==== | + | (Or, "what models am I limited to?" |
- | Assuming | + | * If you're using the same GPU for your OS, find out how much that takes and subtract it from your available VRAM. |
- | * flash-attention | + | * The model itself requires (bits * weights) RAM. |
- | * option tuning | + | * Context requires RAM. If a model that " |
- | * tips for saving context | + | * Memory requirements work the same way for CPU-attached RAM. |
- | * I haven' | + | |
+ | ==What fits in 4GB?== | ||
+ | * Nothing larger than 3b. I have a few; probably worth testing. | ||
+ | ==What fits in 6GB?== | ||
+ | * Anything in the Mistral 7b GPTQ family at 4-bit quantization. Context will be very limited. | ||
+ | ==What fits in 8GB?== | ||
+ | * Mistral 7b GPTQ at 4-bit quantization. I've tested context sizes up to 16k. | ||
+ | ==What fits in 12GB?== | ||
+ | ==What fits in 16GB?== | ||
+ | ==What fits in 20GB?== | ||
+ | Seriously c'mon. If you have this much VRAM, you probably don't need help with the calculations. | ||
+ | |||
+ | =====Can I use my CPU and GPU together? | ||
+ | Yes. Please someone write this section. | ||
+ | |||
+ | =====LLM Format Comparison===== | ||
+ | This information is honestly as much for my own use as anyone else' | ||
+ | |||
+ | > "have plenty vram on modern nvidia = use exllamav2+gptq/ | ||
+ | |||
+ | Different formats vary in a handful of relevant ways: | ||
+ | * Loader compatibility: | ||
+ | * Disk space: something quantized with more bits (or not quantized) will take more disk space. | ||
+ | * RAM usage: Ditto | ||
+ | * Quantization: | ||
+ | |||
+ | So choosing the best format to download for your use case is going to depend on all the same factors: what hardware | ||
+ | |||
+ | * HF - //"HF means unquantised, | ||
+ | * AWQ - you almost certainly don't want this. FIXME | ||
+ | * GPTQ - //" | ||
+ | * GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. | ||
+ | * GGML - Generally not used anymore; GGUF is the new standard. (2023-12) | ||
+ | * //" | ||
+ | * GGUF - CPU-based with optional GPU offload. Successor to GGML. | ||
+ | * .gguf doesn' | ||
+ | * Quantization can vary within the same model. | ||
+ | * The filename tells you what you need to know about a model before you download | ||
+ | * Q♣_0 use the same bpw as they say on the tin for all tensors. | ||
+ | * Q♣_K_S (" | ||
+ | * Q♣_K_M (" | ||
+ | * Q♣_K_L (" | ||
+ | * exl2 - GPU only. New hotness. (Why though?) At time of writing (2023-11-11), |
ai/formats-faq.1699722509.txt.gz · Last modified: 2023/11/11 17:08 by naptastic