Differences

This shows you the differences between two versions of the page.

--- ai:formats-faq [2023/11/25 18:38] – try to be friendlier; add a bunch of information. naptastic
+++ ai:formats-faq [2023/12/16 16:13] (current) – [What hardware works?] naptastic
@@ Line 26: / Line 26: @@
       * Intel: Newer than (???)
       * AMD: Zen architecture.
-      * (there might be a way around some of this by compiling things from source, but... please no. If you don't have and can't get a new enough CPU, this hobby is too expensive for you.)
+      * William Schaub has [[https://blog.longearsfor.life/blog/2023/11/26/building-pytorch-for-systems-without-avx2-instructions/|this blog post]] for people who don't have AVX2 support. He adds: "I ended up doing the same for torchaudio and torchvision because it turns out that the C++ API ended up mismatched from the official packages.  it's the same process except no changes needed in the cmake config."
     * Most users have more CPU-attached DRAM than GPU-attached VRAM, so more models can run via CPU inference.
     * CPU/DRAM inference is orders of magnitude slower than GPU/VRAM inference. (More info needed.)
@@ Line 44: / Line 44: @@
 =====How much DRAM/VRAM do I need?=====
-**Just to keep things clear** I will use the term **DRAM** to refer to CPU-attached RAM, which is generally DDR4 or DDR5, and **VRAM** to refer to GPU-attached RAM. Being much closer and on the PCB as the GPU itself, GDDR can connect with a much faster, wider bus. High-bandwidth memory goes even faster and wider.
+**Just to keep things clear** I will use the term **DRAM** to refer to CPU-attached RAM, which is generally DDR4 or DDR5, and **VRAM** to refer to GPU-attached RAM. Being much closer to the GPU itself, and on the same circuit board, GDDR can connect with a much faster, wider bus. [[https://en.wikipedia.org/wiki/High_Bandwidth_Memory|High-bandwidth memory]] goes even faster and wider.
 (Or, "what models am I limited to?")
@@ Line 63: / Line 63: @@
 Seriously c'mon. If you have this much VRAM, you probably don't need help with the calculations.
-====Can I use my CPU and GPU together?====
+=====Can I use my CPU and GPU together?=====
 Yes. Please someone write this section.
-====LLM Format Comparison====
+=====LLM Format Comparison=====
+This information is honestly as much for my own use as anyone else's. But here's a tldr:
+> "have plenty vram on modern nvidia = use exllamav2+gptq/exl models. for any other scenario, including offloading, its llama.cpp+gguf" -TheLeastMost, 2023-11
 Different formats vary in a handful of relevant ways:
   * Loader compatibility: not all loaders support all formats.
   * Disk space: something quantized with more bits (or not quantized) will take more disk space.
-  * VRAM usage:
+  * RAM usage: Ditto
-  * Quantization: More bits (bpw) usually gives better results; the current consensus (2023-11) seems to be that 5-6bpw is the best option. Below 5bpw, quality drops sharply; above 6, quality doesn't improve much. (There's a Reddit post with more information; I need to find it.)
+  * Quantization: More bits (bpw) usually gives better results; the current consensus (2023-11) seems to be that 5-6bpw is the best option. Below 5bpw, quality drops sharply; above 6, quality doesn't improve much. (There's a Reddit post with more information; I need to find it.) Not all formats support all quantization options.
 So choosing the best format to download for your use case is going to depend on all the same factors: what hardware will be running the model, what you're going to be doing with it, etc. So here are the model formats:
@@ Line 79: / Line 83: @@
   * GPTQ - //"GPTQ is a format designed for GPU inference. If you have a decent GPU with enough VRAM for the model you want to use, it will provide the fastest inference at a decent quality"// -TheBloke
     * GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. I don't know if this affects ExLlama2 or _HF variations; I'm not going to waste bandwidth downloading a model I already know won't work with what I have.
-  * GGML - Generally not used anymore; **for CPU-only inference**, GGUF is the new standard. (2023-11)
+  * GGML - Generally not used anymore; GGUF is the new standard. (2023-12)
-    * //"have plenty vram on modern nvidia = use exllamav2+gptq/exl models. for any other scenario, including offloading, its llama.cpp+gguf"// -TheLeastMost, 2023-11
+    * //"GGML is a format that was first designed exclusively for CPU-only inference.  It is very accessible because it doesn't require a GPU at all, only CPU + RAM.  But CPU inference is quite a bit slower than GPU inference.// [...] //Recently the situation has been complicated by the fact that GGML now also supports some GPU acceleration.  It's still not as fast as GPTQ in situations where you have enough VRAM to fully load a model.  But it can enable you to load a model larger than your GPU could support on its own, but still at decent speeds"// -TheBloke
-    * //"GGML is a format that was first designed exclusively for CPU-only inference.  It is very accessible because it doesn't require a GPU at all, only CPU + RAM.  But CPU inference is quite a bit slower than GPU inference.// [...]
+  * GGUF - CPU-based with optional GPU offload. Successor to GGML.
-    * //Recently the situation has been complicated by the fact that GGML now also supports some GPU acceleration.  It's still not as fast as GPTQ in situations where you have enough VRAM to fully load a model.  But it can enable you to load a model larger than your GPU could support on its own, but still at decent speeds"// -TheBloke
+    * .gguf doesn't currently have a published specification, but it has had in the past. It's designed to be future-proof. It also keeps everything in a single file.
-  * GGUF - CPU only. Replaces GGML.
+    * Quantization can vary within the same model.
-    * //"[GGML is] the backend that llama.cpp uses, so "GGML quantizations" are the type of quantizations files supported by llama.cpp can use, currently .gguf files."// - kerfuffle
+    * The filename tells you what you need to know about a model before you download it. The end will be something like:
-    *
+      * Q♣_0 use the same bpw as they say on the tin for all tensors.
+      * Q♣_K_S ("small") uses ♣ bpw for all tensors.
+      * Q♣_K_M ("medium") uses more bpw for some specific tensors.
+      * Q♣_K_L ("large") uses more bpw for a larger set of tensors.
   * exl2 - GPU only. New hotness. (Why though?) At time of writing (2023-11-11), TheBloke isn't publishing exl2 quants.
-====How fast does it go?====
-  * GPUs are fast; CPUs are not. Due to how LLMs work, direct speed comparisons are difficult or impossible. (Is it "fair" to say "it ran the same speed" if it generated different output?)
-  * There are a few decent speed comparisons out there; please feel free to send links, graphs, data... whatever.