ai:formats-faq
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
ai:formats-faq [2023/11/25 23:02] – [LLM Format Comparison] more info about .gguf file format naptastic | ai:formats-faq [2023/12/16 16:13] (current) – [What hardware works?] naptastic | ||
---|---|---|---|
Line 26: | Line 26: | ||
* Intel: Newer than (???) | * Intel: Newer than (???) | ||
* AMD: Zen architecture. | * AMD: Zen architecture. | ||
- | * (there might be a way around some of this by compiling things from source, but... please no. If you don't have and can't get a new enough CPU, this hobby is too expensive for you.) | + | * William Schaub has [[https:// |
* Most users have more CPU-attached DRAM than GPU-attached VRAM, so more models can run via CPU inference. | * Most users have more CPU-attached DRAM than GPU-attached VRAM, so more models can run via CPU inference. | ||
* CPU/DRAM inference is orders of magnitude slower than GPU/VRAM inference. (More info needed.) | * CPU/DRAM inference is orders of magnitude slower than GPU/VRAM inference. (More info needed.) | ||
Line 44: | Line 44: | ||
=====How much DRAM/VRAM do I need?===== | =====How much DRAM/VRAM do I need?===== | ||
- | **Just to keep things clear** I will use the term **DRAM** to refer to CPU-attached RAM, which is generally DDR4 or DDR5, and **VRAM** to refer to GPU-attached RAM. Being much closer | + | **Just to keep things clear** I will use the term **DRAM** to refer to CPU-attached RAM, which is generally DDR4 or DDR5, and **VRAM** to refer to GPU-attached RAM. Being much closer |
(Or, "what models am I limited to?") | (Or, "what models am I limited to?") | ||
Line 84: | Line 84: | ||
* GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. I don't know if this affects ExLlama2 or _HF variations; I'm not going to waste bandwidth downloading a model I already know won't work with what I have. | * GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. I don't know if this affects ExLlama2 or _HF variations; I'm not going to waste bandwidth downloading a model I already know won't work with what I have. | ||
* GGML - Generally not used anymore; GGUF is the new standard. (2023-12) | * GGML - Generally not used anymore; GGUF is the new standard. (2023-12) | ||
- | * //" | + | * //" |
- | * //Recently the situation has been complicated by the fact that GGML now also supports some GPU acceleration. | + | * GGUF - CPU-based with optional GPU offload. |
- | * GGUF - CPU only. (wait, it might support partial | + | * .gguf doesn' |
- | * //" | + | * Quantization can vary within the same model. |
- | * The .gguf filename tells you its quantization, | + | * The filename tells you what you need to know about a model before you download it. The end will be something like: |
+ | * Q♣_0 use the same bpw as they say on the tin for all tensors. | ||
+ | * Q♣_K_S (" | ||
+ | * Q♣_K_M (" | ||
+ | * Q♣_K_L (" | ||
* exl2 - GPU only. New hotness. (Why though?) At time of writing (2023-11-11), | * exl2 - GPU only. New hotness. (Why though?) At time of writing (2023-11-11), |
ai/formats-faq.1700953379.txt.gz · Last modified: 2023/11/25 23:02 by naptastic