Naptastic

This is an old revision of the document!

Welcome to naptastic's LLM/Generative AI FAQ

DISCLAIMER

No guarantees are provided as to the accuracy of this information. I do my best but things move fast and this is hecking hard to figure out.

PHILOSOPHY

I very much subscribe to the “Stone Soup” philosophy of open-source. The problem is that everyone wants to be the person bringing the stone. But it only takes one stone. We need tables, utensils, meat, vegetables, seasonings, firewood, and people to tend the fire and stir the pot and cut up the ingredients…

Please consider how many people have put how much time into generating and assembling this information. Yes it's naptastic organizing the page (at least right now) but all the info is coming from other people. I do not want to scare people off from asking questions; otherwise I don't know what to put in the FAQ! But if you are going to bring questions, please also be willing to put some time in to test things and report back when you have successes.

How can I help?

First off, SUCCEED!!! Get something working, even if it's not working as well as you want. Getting better and faster results is part of this FAQ too.

Second, tell me about it! What hardware worked? What models? What problems did you encounter and how did you solve them? How fast does it generate?

Third, if you want to help with this FAQ, it would be nice to have someone in another part of the world (because timezones) helping to keep this updated. The information moves so fast.

Lastly, if you can contribute to the actual open-source projects making this all possible, please do. That is where the most work needs to be done.

Important note: YOU CAN USE AI TO HELP YOU WITH THIS!!! It's not cheating!

Nap's AI/LLM generation FAQ

This page is going to be a mess. I wish there were an easy way to just say ${model} works with ${hardware} using ${software} but

It's a really fast-moving target
It's really complicated

None of us are happy about the state of things, but we have to break compatibility and invalidate documentation constantly just to keep up. Information older than about a week should be considered obsolete and re-checked.

What's the "best" model? How do I pick one?

Many relevant considerations.

Probably most important is what hardware you have.
- GPUs are much faster IF you can load your entire model into VRAM. (Don't ask yet; we'll get there.)
- CPUs usually have access to more RAM, but it's very slow.

Anecdotal reports

Please create a section for yourself using double-equals and put your notes there.

If your source of truth is someplace else (e.g., your blog or social media feed) we could just add a link, and then your report here never has to get updated and it never goes stale.

naptastic

Using an RTX 3070, I can load Mistral 7b models with ExLlama2 at 4-bit precision and a context of 16k before running out of VRAM. If there's no context, generation runs at 50-60 tokens/second. Less-polite interactions run more slowly. More context runs more slowly. Making a standard set of prompts for benchmarking speed is on my to-do list.

I have many more observations to put here but most of them aren't going to be relevant until the A770 work is done.

What hardware works?

Nvidia newer than
- 30XX, 40XX. (10XX? 20XX? Idk)
AMD newer than
- things known not to work: multi-GPU
Intel
- I don't have anything working yet, but it's supposed to be possible to run things on an A770 with Intel's SDK.
CPU

So what's up with LLM formats?

HF - “HF means unquantised, float16 format. You almost never want that. It's the format that models come in when first released. It's good for further fine tuning/training, or as a base for quantisation, but isn't really recommended to be used for inference” -TheBloke
AWQ - you almost certainly don't want this.
GPTQ - “GPTQ is a format designed for GPU inference. If you have a decent GPU with enough VRAM for the model you want to use, it will provide the fastest inference at a decent quality” -TheBloke
- GPTQ models are already quantized when they get to you. AFAICT it only supports 4-bit and 8-bit quantization. Right now (2023-11-11) ExLlama won't load 8-bit quantized GPTQ models; it only supports 4-bit models. I don't know if this affects ExLlama2 or _HF variations; I'm not going to waste bandwidth downloading a model I already know won't work with what I have.
GGML - “GGML is a format that was first designed exclusively for CPU-only inference. It is very accessible because it doesn't require a GPU at all, only CPU + RAM. But CPU inference is quite a bit slower than GPU inference. […]
- Recently the situation has been complicated by the fact that GGML now also supports some GPU acceleration. It's still not as fast as GPTQ in situations where you have enough VRAM to fully load a model. But it can enable you to load a model larger than your GPU could support on its own, but still at decent speeds” -TheBloke
GGUF - CPU only
exl2 - new hotness. (Why though?) At time of writing (2023-11-11), TheBloke isn't publishing exl2 quants.

How much VRAM do I need?

Usually about 1 GB more than you have.

How fast does it go?

GPUs are fast; CPUs not so much. There is a serious shortage of information about inference speed; AFAIK nobody is doing apples-to-apples benchmarks of different systems against each other. Doing so will have to be a community effort, since probably none of us has the hardware or time to do all the testing ourselves.

How do I make it faster?

Assuming that a hardware upgrade isn't an option:

flash-attention
option tuning
tips for saving context
I haven't investigated the –sdp-attention option but it says it makes things faster. Needs testing!

Table of Contents