ai:formats-faq
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
ai:formats-faq [2023/11/25 18:38] – try to be friendlier; add a bunch of information. naptastic | ai:formats-faq [2023/12/16 16:13] (current) – [What hardware works?] naptastic | ||
---|---|---|---|
Line 26: | Line 26: | ||
* Intel: Newer than (???) | * Intel: Newer than (???) | ||
* AMD: Zen architecture. | * AMD: Zen architecture. | ||
- | * (there might be a way around some of this by compiling things from source, but... please no. If you don't have and can't get a new enough CPU, this hobby is too expensive for you.) | + | * William Schaub has [[https:// |
* Most users have more CPU-attached DRAM than GPU-attached VRAM, so more models can run via CPU inference. | * Most users have more CPU-attached DRAM than GPU-attached VRAM, so more models can run via CPU inference. | ||
* CPU/DRAM inference is orders of magnitude slower than GPU/VRAM inference. (More info needed.) | * CPU/DRAM inference is orders of magnitude slower than GPU/VRAM inference. (More info needed.) | ||
Line 44: | Line 44: | ||
=====How much DRAM/VRAM do I need?===== | =====How much DRAM/VRAM do I need?===== | ||
- | **Just to keep things clear** I will use the term **DRAM** to refer to CPU-attached RAM, which is generally DDR4 or DDR5, and **VRAM** to refer to GPU-attached RAM. Being much closer | + | **Just to keep things clear** I will use the term **DRAM** to refer to CPU-attached RAM, which is generally DDR4 or DDR5, and **VRAM** to refer to GPU-attached RAM. Being much closer |
(Or, "what models am I limited to?") | (Or, "what models am I limited to?") | ||
Line 63: | Line 63: | ||
Seriously c'mon. If you have this much VRAM, you probably don't need help with the calculations. | Seriously c'mon. If you have this much VRAM, you probably don't need help with the calculations. | ||
- | ====Can I use my CPU and GPU together? | + | =====Can I use my CPU and GPU together?===== |
Yes. Please someone write this section. | Yes. Please someone write this section. | ||
- | ====LLM Format Comparison==== | + | =====LLM Format Comparison===== |
+ | This information is honestly as much for my own use as anyone else' | ||
+ | |||
+ | > "have plenty vram on modern nvidia = use exllamav2+gptq/ | ||
Different formats vary in a handful of relevant ways: | Different formats vary in a handful of relevant ways: | ||
* Loader compatibility: | * Loader compatibility: | ||
* Disk space: something quantized with more bits (or not quantized) will take more disk space. | * Disk space: something quantized with more bits (or not quantized) will take more disk space. | ||
- | * VRAM usage: | + | * RAM usage: |
- | * Quantization: | + | * Quantization: |
So choosing the best format to download for your use case is going to depend on all the same factors: what hardware will be running the model, what you're going to be doing with it, etc. So here are the model formats: | So choosing the best format to download for your use case is going to depend on all the same factors: what hardware will be running the model, what you're going to be doing with it, etc. So here are the model formats: | ||
Line 79: | Line 83: | ||
* GPTQ - //" | * GPTQ - //" | ||
* GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. I don't know if this affects ExLlama2 or _HF variations; I'm not going to waste bandwidth downloading a model I already know won't work with what I have. | * GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. I don't know if this affects ExLlama2 or _HF variations; I'm not going to waste bandwidth downloading a model I already know won't work with what I have. | ||
- | * GGML - Generally not used anymore; | + | * GGML - Generally not used anymore; GGUF is the new standard. (2023-12) |
- | * //" | + | * //" |
- | * //" | + | * GGUF - CPU-based with optional GPU offload. Successor to GGML. |
- | * //Recently the situation has been complicated by the fact that GGML now also supports some GPU acceleration. | + | * .gguf doesn' |
- | * GGUF - CPU only. Replaces | + | * Quantization |
- | * //" | + | * The filename tells you what you need to know about a model before you download it. The end will be something like: |
- | * | + | * Q♣_0 |
+ | * Q♣_K_S (" | ||
+ | * Q♣_K_M ("medium" | ||
+ | * Q♣_K_L (" | ||
* exl2 - GPU only. New hotness. (Why though?) At time of writing (2023-11-11), | * exl2 - GPU only. New hotness. (Why though?) At time of writing (2023-11-11), | ||
- | |||
- | ====How fast does it go?==== | ||
- | * GPUs are fast; CPUs are not. Due to how LLMs work, direct speed comparisons are difficult or impossible. (Is it " | ||
- | * There are a few decent speed comparisons out there; please feel free to send links, graphs, data... whatever. |
ai/formats-faq.1700937510.txt.gz · Last modified: 2023/11/25 18:38 by naptastic