User Tools

Site Tools


ai:formats-faq

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ai:formats-faq [2023/11/13 02:00] – [How do I make it faster?] linkify flash-attention naptasticai:formats-faq [2023/12/16 16:13] (current) – [What hardware works?] naptastic
Line 1: Line 1:
-=====Welcome to naptastic's LLM/Generative AI FAQ===== +====DISCLAIMER====
-=====DISCLAIMER=====+
 No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out. No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out.
  
Line 7: Line 6:
   - It's really complicated   - It's really complicated
  
-None of us are happy about the state of things, but we have to break compatibility and invalidate documentation constantly just to keep up. **Information older than about a week should be considered obsolete and re-checked.**+None of us are thrilled with the state of things, but we have to break compatibility and invalidate documentation constantly just to keep up. **Information older than about a week should be considered obsolete and re-checked.**
  
-=====PHILOSOPHY===== +=====Anecdotal reports===== 
-very much subscribe to the "Stone Soup" philosophy of open-sourceThe problem is that everyone wants to be the person bringing the stoneBut stone soup only needs one stone! We need tables, utensils, meat, vegetables, seasonings, firewood, and people to tend the fire and stir the pot and cut up the ingredients...+==naptastic== 
 +Using an RTX 3070, can load Mistral 7b models with ExLlama2 at 4-bit precision and a context of 16k before running out of VRAM. If there's no context, generation runs at 50-60 tokens/secondBy the time context reaches 2k, generation is down to 10-20 tokens/secondMaking a standard set of prompts for benchmarking speed is on my to-do list(2023-11-13)
  
-Please consider how many people have put how much time into generating and assembling this informationYes it's naptastic organizing the page (at least right now) but all the info is coming from other people. I do not want to scare people off from asking questionsotherwise don't know what to put in the FAQ! But if you are going to bring questions, please also be willing to put some time in to test things and report back when you have successes.+Still on the 3070, I got ~12% increase by installing flash-attentionThe error message when it fails to load says that it must be installed from source; I found it was already in pip. With an empty context I get 70~73 tokens/second now.
  
-Important note: **YOU CAN USE AI TO HELP YOU WITH THIS!!!** It's not cheating!+I have many more observations to put here but most of them aren't going to be relevant until the A770 work is done.
  
-====How can I help?==== +==Your Report Here!== 
-  - **SUCCEED!!!** Get something working, even if it's not working as well as you want. Getting better and faster results is part of this FAQ too. +FIXME
-  - Tell me about it! What hardware worked? What models? What problems did you encounter and how did you solve them? How fast does it generate?  +
-  - Contribute to the actual open-source projects. That is where the most work needs to be done.+
  
-====What's the "best" model? How do I pick one?==== +=====What hardware works?===== 
-Many relevant considerations. +Hoffman's Iron Law applies hereYou can choose at most two of three desirable properties: cheap, fast, easy. Nvidia is the easiest to use, best-supported, and most expensiveAMD cards are generally less well-supported but still useful, and less expensiveIntel GPUs cost the least but (as of 2023-12 at leastcan do the leastCPU-based inference frequently requires no additional investment 
-  * Probably most important is what hardware you have. +
-    * GPUs are much faster **IF** you can load your entire model into VRAM. (Don't ask yet; we'll get there.) +
-    * CPUs usually have access to more RAM, but it's very slow.+
  
-===Anecdotal reports=== +  * CPU 
-==naptastic== +    * Your CPU **must** support (AVX? SSE4.X?) even if you do GPU-only inference. 
-Using an RTX 3070, I can load Mistral 7b models with ExLlama2 at 4-bit precision and a context of 16k before running out of VRAMIf there's no context, generation runs at 50-60 tokens/secondLess-polite interactions run more slowlyMore context runs more slowly. Making a standard set of prompts for benchmarking speed is on my to-do list. +      * Intel: Newer than (???) 
- +      * AMD: Zen architecture. 
-I have many more observations to put here but most of them aren't going to be relevant until the A770 work is done. +      * William Schaub has [[https://blog.longearsfor.life/blog/2023/11/26/building-pytorch-for-systems-without-avx2-instructions/|this blog post]] for people who don't have AVX2 support. He adds: "I ended up doing the same for torchaudio and torchvision because it turns out that the C++ API ended up mismatched from the official packages it'the same process except no changes needed in the cmake config.
-====What hardware works?====+    * Most users have more CPU-attached DRAM than GPU-attached VRAM, so more models can run via CPU inference
 +    * CPU/DRAM inference is orders of magnitude slower than GPU/VRAM inference(More info needed.)
   * Nvidia newer than FIXME   * Nvidia newer than FIXME
-    * 30XX40XX. (10XX? 20XX? Idk)+    * I know one person got a 6GB 1060ti to run a 7b Mistral model 
 +    * Oddly no reports of success on 20XX cards have reached my eyes 
 +    * 30703080, 3090 all work3090 seems to be the current (2023-12favorite. 
 +    * 40XX.
   * AMD newer than FIXME   * AMD newer than FIXME
     * things known not to work: multi-GPU     * things known not to work: multi-GPU
 +    * Not all quantization formats are supported.
   * Intel   * Intel
     * I don't have anything working yet, but it's supposed to be possible to run things on an A770 with Intel's SDK.     * I don't have anything working yet, but it's supposed to be possible to run things on an A770 with Intel's SDK.
-  * CPU+  * Partial offload (CPU + GPU) 
 +    * This is possible; there's going to be a section about it below but it's not written yet. 
 + 
 +=====How much DRAM/VRAM do I need?===== 
 + 
 +**Just to keep things clear** I will use the term **DRAM** to refer to CPU-attached RAM, which is generally DDR4 or DDR5, and **VRAM** to refer to GPU-attached RAM. Being much closer to the GPU itself, and on the same circuit board, GDDR can connect with a much faster, wider bus. [[https://en.wikipedia.org/wiki/High_Bandwidth_Memory|High-bandwidth memory]] goes even faster and wider. 
 + 
 +(Or, "what models am I limited to?"
 +  * If you're using the same GPU for your OS, find out how much that takes and subtract it from your available VRAM. 
 +  * The model itself requires (bits * weights) RAM. 
 +  * Context requires RAM. If a model that "should" fit fails to load with an out of memory error, try reducing the context size (max_seq_length IIRC). 
 +  * Memory requirements work the same way for CPU-attached RAM. 
 + 
 +==What fits in 4GB?== 
 +  * Nothing larger than 3b. I have a few; probably worth testing. 
 +==What fits in 6GB?== 
 +  * Anything in the Mistral 7b GPTQ family at 4-bit quantization. Context will be very limited. 
 +==What fits in 8GB?== 
 +  * Mistral 7b GPTQ at 4-bit quantization. I've tested context sizes up to 16k. 
 +==What fits in 12GB?== 
 +==What fits in 16GB?== 
 +==What fits in 20GB?== 
 +Seriously c'mon. If you have this much VRAM, you probably don't need help with the calculations. 
 + 
 +=====Can I use my CPU and GPU together?===== 
 +Yes. Please someone write this section. 
 + 
 +=====LLM Format Comparison===== 
 +This information is honestly as much for my own use as anyone else's. But here's a tldr: 
 + 
 +> "have plenty vram on modern nvidia = use exllamav2+gptq/exl models. for any other scenario, including offloading, its llama.cpp+gguf" -TheLeastMost, 2023-11
  
-====LLM Format Comparison==== 
 Different formats vary in a handful of relevant ways: Different formats vary in a handful of relevant ways:
   * Loader compatibility: not all loaders support all formats.   * Loader compatibility: not all loaders support all formats.
-  * Disk space: something not yet quantizedor quantized to a higher precision, will take more space on-disk+  * Disk space: something quantized with more bits (or not quantizedwill take more disk space. 
-  * VRAM usage (calculating this is highly non-trivial) +  * RAM usage: Ditto 
-  * Quantization: More bits (bpw) usually gives better results; the current consensus (2023-11) seems to be that 5-6bpw is the best option. Below 5bpw, quality drops sharply; above 6, quality doesn't improve much. (There's a Reddit post with more information; I need to find it.)+  * Quantization: More bits (bpw) usually gives better results; the current consensus (2023-11) seems to be that 5-6bpw is the best option. Below 5bpw, quality drops sharply; above 6, quality doesn't improve much. (There's a Reddit post with more information; I need to find it.) Not all formats support all quantization options.
  
 So choosing the best format to download for your use case is going to depend on all the same factors: what hardware will be running the model, what you're going to be doing with it, etc. So here are the model formats: So choosing the best format to download for your use case is going to depend on all the same factors: what hardware will be running the model, what you're going to be doing with it, etc. So here are the model formats:
Line 54: Line 83:
   * GPTQ - //"GPTQ is a format designed for GPU inference. If you have a decent GPU with enough VRAM for the model you want to use, it will provide the fastest inference at a decent quality"// -TheBloke   * GPTQ - //"GPTQ is a format designed for GPU inference. If you have a decent GPU with enough VRAM for the model you want to use, it will provide the fastest inference at a decent quality"// -TheBloke
     * GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. I don't know if this affects ExLlama2 or _HF variations; I'm not going to waste bandwidth downloading a model I already know won't work with what I have.     * GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. I don't know if this affects ExLlama2 or _HF variations; I'm not going to waste bandwidth downloading a model I already know won't work with what I have.
-  * GGML - Generally not used anymore; **for CPU-only inference**, GGUF is the new standard. (2023-11+  * GGML - Generally not used anymore; GGUF is the new standard. (2023-12
-    * //"GGML is a format that was first designed exclusively for CPU-only inference.  It is very accessible because it doesn't require a GPU at all, only CPU + RAM.  But CPU inference is quite a bit slower than GPU inference.// [...] +    * //"GGML is a format that was first designed exclusively for CPU-only inference.  It is very accessible because it doesn't require a GPU at all, only CPU + RAM.  But CPU inference is quite a bit slower than GPU inference.// [...] //Recently the situation has been complicated by the fact that GGML now also supports some GPU acceleration.  It's still not as fast as GPTQ in situations where you have enough VRAM to fully load a model.  But it can enable you to load a model larger than your GPU could support on its own, but still at decent speeds"// -TheBloke 
-    * //Recently the situation has been complicated by the fact that GGML now also supports some GPU acceleration.  It's still not as fast as GPTQ in situations where you have enough VRAM to fully load a model.  But it can enable you to load a model larger than your GPU could support on its own, but still at decent speeds"// -TheBloke +  * GGUF - CPU-based with optional GPU offloadSuccessor to GGML
-  * GGUF - CPU onlyReplaces GGML.+    * .gguf doesn't currently have a published specification, but it has had in the past. It's designed to be future-proof. It also keeps everything in a single file. 
 +    * Quantization can vary within the same model. 
 +    * The filename tells you what you need to know about a model before you download it. The end will be something like: 
 +      * Q♣_0 use the same bpw as they say on the tin for all tensors. 
 +      * Q♣_K_S ("small") uses ♣ bpw for all tensors. 
 +      * Q♣_K_M ("medium") uses more bpw for some specific tensors. 
 +      * Q♣_K_L ("large") uses more bpw for a larger set of tensors.
   * exl2 - GPU only. New hotness. (Why though?) At time of writing (2023-11-11), TheBloke isn't publishing exl2 quants.   * exl2 - GPU only. New hotness. (Why though?) At time of writing (2023-11-11), TheBloke isn't publishing exl2 quants.
- 
-====How much VRAM do I need?==== 
-  * Usually about 1 GB more than you have. 
- 
-====How fast does it go?==== 
-  * GPUs are fast; CPUs are not. There is a serious shortage of information about inference speed; AFAIK nobody is doing apples-to-apples benchmarks of different systems against each other. Doing so will have to be a community effort, since probably none of us has the hardware or time to do all the testing ourselves. 
- 
-====How do I make it faster?==== 
-Assuming that a hardware upgrade isn't an option: 
-  * [[https://github.com/Dao-AILab/flash-attention#installation-and-features|flash-attention]] 
-  * option tuning 
-  * tips for saving context (RAG?) 
-  * I haven't investigated the --sdp-attention option but it says it makes things faster. Needs testing! 
ai/formats-faq.1699840854.txt.gz · Last modified: 2023/11/13 02:00 by naptastic