This is an old revision of the document!
Table of Contents
Start
Ubuntu Server 22.04, normal (not minimal) install. Main drive has 4k sectors, so grub is EFI.
The environment might have some Infiniband-related pollution. It shouldn't matter.
I run things as root so I don't have to | sudo tee and other such nonsense. Think before you press enter!!!
Conda
The apt list:
cat > /etc/apt/sources.list.d/conda.list deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main
The GPG key:
curl https://repo.anaconda.com/pkgs/misc/gpgkeys/anaconda.asc | gpg --dearmor > /usr/share/keyrings/conda-archive-keyring.gpg
Install the thing:
apt update apt -y install conda
Intel
The apt repository for the GPU drivers:
cat > /etc/apt/sources.list.d/intel-gpu-jammy.list deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client
The apt repository for OneAPI:
cat > /etc/apt/sources.list.d/oneAPI.list deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main
You need both these GPG keys:
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg wget -O - https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor > /usr/share/keyrings/oneapi-archive-keyring.gpg
Install ALL THE THINGS!!!
apt update apt -y install \ intel-basekit intel-aikit intel-oneapi-pytorch intel-oneapi-tensorflow \ intel-opencl-icd intel-level-zero-gpu level-zero \ intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \ libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \ libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \ mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo install libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev install intel-fw-gpu intel-i915-dkms xpu-smi
A reboot is required.
A venv for Ooba
With Intel's packages installed we have Conda environments:
root@sadness:~# ln -s /opt/intel/oneapi/setvars.sh setvars.sh root@sadness:~# conda info --envs # conda environments: # base * /opt/intel/oneapi/intelpython/python3.9 pytorch /opt/intel/oneapi/intelpython/python3.9/envs/pytorch pytorch-gpu /opt/intel/oneapi/intelpython/python3.9/envs/pytorch-gpu tensorflow /opt/intel/oneapi/intelpython/python3.9/envs/tensorflow tensorflow-2.13.0 /opt/intel/oneapi/intelpython/python3.9/envs/tensorflow-2.13.0 tensorflow-gpu /opt/intel/oneapi/intelpython/python3.9/envs/tensorflow-gpu /opt/intel/oneapi/pytorch/latest /opt/intel/oneapi/tensorflow/latest
The pytorch sanity check fails with a familiar error:
conda activate pytorch-gpu python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];" (...snip...) ImportError: libmkl_sycl.so.3: cannot open shared object file: No such file or directory
There are some wheels missing. Let's leave pytorch-gpu pristine. Make a copy and muck around with that.
conda create --name textgen --clone pytorch-gpu
Holy crap this takes a long time. But now I can inst–hang on I'm not sure why this is necessary:
conda install intel-extension-for-pytorch=2.1.10 pytorch=2.1.0 -c intel -c conda-forge
Holy crap this takes a long time. IPEX and pytorch weren't already in the pytorch-gpu env? Ok…
llama-cpp-python
This is definitely wrong but I don't know what's wrong with it. This is just what I did to get llama-cpp-python installed, (from the correct venv of course) and it reported success:
export CMAKE_ARGS="-DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON" pip install llama-cpp-python
Ooba
Cloning Ooba and placing models in the models/ directory is an exercise left to the reader.
Requirements:
pip install rich accelerate gradio==3.50.* markdown transformers datasets peft
State
I can launch Ooba and it behaves as expected:
(textgen) david@sadness:~/code/oobabooga/text-generation-webui$ python server.py --listen 01:49:35-734416 INFO Starting Text generation web UI 01:49:35-740908 WARNING You are potentially exposing the web UI to the entire internet without any access password. You can create one with the "--gradio-auth" flag like this: --gradio-auth username:password Make sure to replace username:password with your own. 01:49:35-745314 INFO Loading the extension "gallery" Running on local URL: http://0.0.0.0:7860
(Don't worry, it's behind 7 proxies.) I select a .gguf file from my models folder and load it with llama.cpp, and as long as I have –cpu enabled, it will work. If I turn off the CPU check box, then inference will start, spin its wheels for a long time, then crash.
Error Wall Of Text
00:59:01-612664 INFO Loading "tinyllama-1.1b-chat-v1.0.Q6_K.gguf" 00:59:01-864875 INFO llama.cpp weights detected: "models/tinyllama-1.1b-chat-v1.0.Q6_K.gguf" GGML_SYCL_DEBUG=0 ggml_init_sycl: GGML_SYCL_F16: yes ggml_init_sycl: SYCL_USE_XMX: yes found 3 SYCL devices: Device 0: Intel(R) Arc(TM) A770 Graphics, compute capability 1.3, max compute_units 512, max work group size 1024, max sub group size 32, global mem size 255012864 Device 1: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz, compute capability 3.0, max compute_units 8, max work group size 8192, max sub group size 64, global mem size 33517965312 Device 2: Intel(R) Arc(TM) A770 Graphics, compute capability 3.0, max compute_units 512, max work group size 1024, max sub group size 32, global mem size 255012864 Using device 0 (Intel(R) Arc(TM) A770 Graphics) as main device llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from models/tinyllama-1.1b-chat-v1.0.Q6_K.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = tinyllama_tinyllama-1.1b-chat-v1.0 llama_model_loader: - kv 2: llama.context_length u32 = 2048 llama_model_loader: - kv 3: llama.embedding_length u32 = 2048 llama_model_loader: - kv 4: llama.block_count u32 = 22 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 18 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - type f32: 45 tensors llama_model_loader: - type q6_K: 156 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_layer = 22 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 5632 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 1B llm_load_print_meta: model ftype = Q6_K llm_load_print_meta: model params = 1.10 B llm_load_print_meta: model size = 860.86 MiB (6.56 BPW) llm_load_print_meta: general.name = tinyllama_tinyllama-1.1b-chat-v1.0 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 2 '</s>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.08 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/23 layers to GPU llm_load_tensors: CPU buffer size = 860.86 MiB .......................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 44.00 MiB llama_new_context_with_model: KV self size = 44.00 MiB, K (f16): 22.00 MiB, V (f16): 22.00 MiB llama_new_context_with_model: CPU input buffer size = 9.02 MiB llama_new_context_with_model: SYCL_Host compute buffer size = 144.00 MiB llama_new_context_with_model: graph splits (measure): 1 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | Model metadata: {'tokenizer.chat_template': "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n' + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}", 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '2048', 'general.name': 'tinyllama_tinyllama-1.1b-chat-v1.0', 'llama.embedding_length': '2048', 'llama.feed_forward_length': '5632', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '64', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '22', 'llama.attention.head_count_kv': '4', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '18'} Using gguf chat template: {% for message in messages %} {% if message['role'] == 'user' %} {{ '<|user|> ' + message['content'] + eos_token }} {% elif message['role'] == 'system' %} {{ '<|system|> ' + message['content'] + eos_token }} {% elif message['role'] == 'assistant' %} {{ '<|assistant|> ' + message['content'] + eos_token }} {% endif %} {% if loop.last and add_generation_prompt %} {{ '<|assistant|>' }} {% endif %} {% endfor %} Using chat eos_token: </s> Using chat bos_token: <s> 00:59:02-182271 INFO LOADER: "llama.cpp" 00:59:02-183834 INFO TRUNCATION LENGTH: 2048 00:59:02-185281 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)" 00:59:02-186983 INFO Loaded the model in 0.57 seconds. Prompt evaluation: 0%| | 0/1 [00:00<?, ?it/s] Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)Exception caught at file:/tmp/pip-install-1g4vflw6/llama-cpp-python_a217762ea5e14fb997940c76ade3bb52/vendor/llama.cpp/ggml-sycl.cpp, line:12271
It doesn't matter if I have any layers offloaded or not.