User Tools

Site Tools


ai:formats-faq

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ai:formats-faq [2023/11/25 23:02] – [LLM Format Comparison] more info about .gguf file format naptasticai:formats-faq [2023/12/16 16:13] (current) – [What hardware works?] naptastic
Line 26: Line 26:
       * Intel: Newer than (???)       * Intel: Newer than (???)
       * AMD: Zen architecture.       * AMD: Zen architecture.
-      * (there might be a way around some of this by compiling things from source, but... please no. If you don't have and can't get a new enough CPU, this hobby is too expensive for you.)+      * William Schaub has [[https://blog.longearsfor.life/blog/2023/11/26/building-pytorch-for-systems-without-avx2-instructions/|this blog post]] for people who don't have AVX2 support. He adds: "I ended up doing the same for torchaudio and torchvision because it turns out that the C++ API ended up mismatched from the official packages.  it's the same process except no changes needed in the cmake config."
     * Most users have more CPU-attached DRAM than GPU-attached VRAM, so more models can run via CPU inference.     * Most users have more CPU-attached DRAM than GPU-attached VRAM, so more models can run via CPU inference.
     * CPU/DRAM inference is orders of magnitude slower than GPU/VRAM inference. (More info needed.)     * CPU/DRAM inference is orders of magnitude slower than GPU/VRAM inference. (More info needed.)
Line 44: Line 44:
 =====How much DRAM/VRAM do I need?===== =====How much DRAM/VRAM do I need?=====
  
-**Just to keep things clear** I will use the term **DRAM** to refer to CPU-attached RAM, which is generally DDR4 or DDR5, and **VRAM** to refer to GPU-attached RAM. Being much closer and on the PCB as the GPU itself, GDDR can connect with a much faster, wider bus. High-bandwidth memory goes even faster and wider.+**Just to keep things clear** I will use the term **DRAM** to refer to CPU-attached RAM, which is generally DDR4 or DDR5, and **VRAM** to refer to GPU-attached RAM. Being much closer to the GPU itself, and on the same circuit board, GDDR can connect with a much faster, wider bus. [[https://en.wikipedia.org/wiki/High_Bandwidth_Memory|High-bandwidth memory]] goes even faster and wider.
  
 (Or, "what models am I limited to?") (Or, "what models am I limited to?")
Line 84: Line 84:
     * GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. I don't know if this affects ExLlama2 or _HF variations; I'm not going to waste bandwidth downloading a model I already know won't work with what I have.     * GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. I don't know if this affects ExLlama2 or _HF variations; I'm not going to waste bandwidth downloading a model I already know won't work with what I have.
   * GGML - Generally not used anymore; GGUF is the new standard. (2023-12)   * GGML - Generally not used anymore; GGUF is the new standard. (2023-12)
-    * //"GGML is a format that was first designed exclusively for CPU-only inference.  It is very accessible because it doesn't require a GPU at all, only CPU + RAM.  But CPU inference is quite a bit slower than GPU inference.// [...] +    * //"GGML is a format that was first designed exclusively for CPU-only inference.  It is very accessible because it doesn't require a GPU at all, only CPU + RAM.  But CPU inference is quite a bit slower than GPU inference.// [...] //Recently the situation has been complicated by the fact that GGML now also supports some GPU acceleration.  It's still not as fast as GPTQ in situations where you have enough VRAM to fully load a model.  But it can enable you to load a model larger than your GPU could support on its own, but still at decent speeds"// -TheBloke 
-    * //Recently the situation has been complicated by the fact that GGML now also supports some GPU acceleration.  It's still not as fast as GPTQ in situations where you have enough VRAM to fully load a model.  But it can enable you to load a model larger than your GPU could support on its own, but still at decent speeds"// -TheBloke +  * GGUF - CPU-based with optional GPU offload. Successor to GGML. 
-  * GGUF - CPU only. (wait, it might support partial offload. FIXME ) Replaces GGML. +    * .gguf doesn't currently have a published specificationbut it has had in the pastIt's designed to be future-proofIt also keeps everything in a single file
-    * //"[GGML is] the backend that llama.cpp usesso "GGML quantizations" are the type of quantizations files supported by llama.cpp can use, currently .gguf files."// - kerfuffle +    * Quantization can vary within the same model. 
-    * The .gguf filename tells you its quantization, and FIXME What do _0, _K, _K_M, _K_S, and _K_L mean?+    * The filename tells you what you need to know about a model before you download it. The end will be something like: 
 +      * Q♣_0 use the same bpw as they say on the tin for all tensors. 
 +      * Q♣_K_S ("small") uses ♣ bpw for all tensors. 
 +      * Q♣_K_M ("medium") uses more bpw for some specific tensors. 
 +      * Q♣_K_L ("large") uses more bpw for a larger set of tensors.
   * exl2 - GPU only. New hotness. (Why though?) At time of writing (2023-11-11), TheBloke isn't publishing exl2 quants.   * exl2 - GPU only. New hotness. (Why though?) At time of writing (2023-11-11), TheBloke isn't publishing exl2 quants.
ai/formats-faq.1700953379.txt.gz · Last modified: 2023/11/25 23:02 by naptastic