ai:faq

Frequently Asked Questions About Hosting Your Own AI
- 1. WHAT MODEL DO I USE?!
Please read these few short paragraphs before diving into The Answers.
The Answers

Frequently Asked Questions About Hosting Your Own AI

1. WHAT MODEL DO I USE?!

While this guide will attempt to point you in the right direction and save you some time finding a good model for you, It is literally impossible to give a definitive answer. There is no “best for”, “best right now”, “best among”, or really any other kind of “best”. It's “best” to let go of “best”.

In order to pick a model one must consider:

Compatibility: what formats you can use,
Resources: depending on your situation, you might become limited by GPU speed, VRAM, CPU speed, DRAM, hard disk space, or (less likely) bandwidth.
Tradeoffs: Fast, Cheap, Good: choose at most two. (“Good” and “Easy” draw from the same well.)
Surprises: probably some other considerations, and finally,
your use case.

This grumpy pile of text is gradually turning into a guide–hopefully not too misguided–for selecting models.

How do I know if my model is compatible with my system?

Probably not 100% guarantee, but… we can reduce the chances of a wasted download.

The situation is really complicated but this is a FAQ so I'll keep it simple:

Read the model card. If it doesn't have one, don't download it. The model card is also the most likely place to find reasons a model might not work for you.
If you know the model will fit completely in VRAM, the best performance comes from GPTQ models. (2023-12; I haven't personally verified this.)
If the model will not fit completely in VRAM, you cannot use GPTQ; use GGUF instead.
- GGUF comes in multiple quantization formats. You only need to download one. Use Q5_K_M if you're not sure.

More details on the formats can be found here.

Are there at least comparisons?

Sure. If you find a good one, send it to me and I'll add a link here.

HuggingFace leaderboard contains all kinds of scores you might care about.
- NOTE: There is currently (2023-12) controversy about how useful the leaderboard is. This has to do with model contamination. (TODO: add “contamination” to the glossary and maybe make a page about it)
u/WolframRavenwolf is the only Redditor I see posting in-depth comparisons of models. Their testing has a narrow focus and might not match your use case.
NSFW Chatbot Leaderboard exists

Can I at try one before I download it?

Yes. Nap does not know how. Please ask for edit permission and fill this section in. <3

Are there any other shortcuts worth taking?

I only know of one more: Use a model that someone else in your situation is already using, and they already know it works well. I'd like to collect a few (dozen) such reports here, if possible. Also what hardware, software, and speed you get, if possible.

brain: Passes almost every AGI test given over a 40-year period. 90b tensors, 100t weights, runs on a completely proprietary stack. When it's thinking hard, it generates about 14-16 tokens/second. (It has almost been discovered once.)

Please read these few short paragraphs before diving into The Answers.

Philosophy

I very much subscribe to the “Stone Soup” philosophy of open-source. The problem is that everyone wants to be the person bringing the stone. But stone soup only needs one stone! We need tables, utensils, meat, vegetables, seasonings, firewood, and people to tend the fire and stir the pot and cut up the ingredients…

Please consider how many people have put how much time into generating and assembling this information. Yes it's naptastic organizing the page (at least right now) but all the info is coming from other people. I do not want to scare people off from asking questions; otherwise I don't know what to put in the FAQ! But if you are going to bring questions, please also be willing to put some effort into figuring it out yourself, and report back when you have successes.

Important note: YOU CAN USE AI TO HELP WRITE STUFF!!! It's not cheating!

Conduct

Keep content on this Wiki professionally appropriate. Remember, we are all responsible for how our own behavior affects others, including people who are different from us.

How can I help?

SUCCEED!!! Get something working, even if it's not working as well as you want. Getting better and faster results is part of this FAQ too.
Tell me about it! What hardware worked? What models? What problems did you encounter and how did you solve them? How fast does it generate?
Contribute to the actual open-source projects. That is where the most work needs to be done.
Improve the FAQ. Specifically, consider this question: What would have been helpful to know earlier? What do you wish someone had explained before you spent a bunch of time learning it the hard way? That's what this FAQ is about: every one of us who self-hosts should make it easier for anyone who does it next. Wiki-specific items:
- If somebody wants to set up SSO for DokuWiki so we can just use our Google accounts or whatever… that's on my to-do list, but it's so far down I'll probably never get to it.
- (I'm not gonna switch to MediaWiki.)

And now, without further ado:

The Answers

Getting Started

Know your goals. It is critical that you know what you want your AI to do for you. Even better if you have it written down.

What Things Can AI Do Right Now?

LLMs generate text and code
- They can integrate with… (fill this in plz)
Diffusers generate images
- Upscaling
- Fill-in and fill-out
- Video
- (anything else?)
Data format conversions
- OCR (“optical character recognition”, which is just a fancy way of saying “image-to-text”.)
- Speech-to-text (partially a classification problem; might be better served with other tools.)
- text-to-speech (though this might be better served by other tools)
lots of other stuff.

What CAN'T AI do right now?

LLMs are still pretty bad with math.
Music generation is in its infancy.
OCR for music transcription is still a hilariously impractical idea.
lots of other stuff.

What kind of hardware should I buy?

It depends. (I know, I know… I hate that answer too but it's the truth.)

Buying a CPU for inference is folly. The only advantage a CPU has is that it usually has more DRAM than the GPU has VRAM, so it can load larger models. The difference in inference speed is at least an order of magnitude. Choosing a GPU, the most important factor is how much VRAM it has.

For maximum ease and speed, buy Nvidia GPUs. They are really expensive, though.
For a reduced cost, more headaches, and fewer applications that currently support it, buy AMD. They're still pretty expensive.
Intel GPUs have the best price/VRAM ratio of the bunch, but there is almost no support. Getting them to work is (mostly) almost impossible even for experienced system administrators.

What do all these terms mean?

(nap definitely needs help with this)

need a glossary

How do I do the thing?

Start with README.MD for the software you want to use. Seriously.
Links to how-to's

How do I get help with the thing?

Read README.MD for the software you want to use again. Seriously.
Discord servers
subreddits
Links to other resources
If you get help, please give back. Update the documentation. Help other people where you can. Fix code where you can.

Next Steps

Better Environment

You Need A Better Environment. (We all do. IMO there isn't a good environment out there, and… that's a rant for another day.)

Faster generation

flash-attention - Needs testing!
option tuning - nap knows nothing; any input would be appreciated
--sdp-attention option says it makes things faster. Needs testing!

Better Results

Retrieval-Augmented Generation (RAG)

What is RAG and how do I use it?

Other Possibilities

What else can my AI do?

Models

link to formats-faq for now

Applications

Oobabooga text-generation-webui
- Perhaps the most popular starting point. It's (relatively) easy to deploy and use, but also provides a pretty full feature-set including support for plugins.
- Plugin to add image generation by integrating
sooooo much belongs here and I don't even know where to begin

Plugins

I want to establish some ground rules before listing plugins, so that it doesn't turn into a free-for-all of

Table of Contents