Ubuntu Server 22.04, normal (not minimal) install. Main drive has 4k sectors, so grub is EFI.
The environment might have some Infiniband-related pollution. It shouldn't matter.
I run things as root so I don't have to | sudo tee and other such nonsense. Think before you press enter!!!
The apt list:
cat > /etc/apt/sources.list.d/conda.list deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main
The GPG key:
curl https://repo.anaconda.com/pkgs/misc/gpgkeys/anaconda.asc | gpg --dearmor > /usr/share/keyrings/conda-archive-keyring.gpg
Install the thing:
apt update apt -y install conda
The apt repository for the GPU drivers:
cat > /etc/apt/sources.list.d/intel-gpu-jammy.list deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client
The apt repository for OneAPI:
cat > /etc/apt/sources.list.d/oneAPI.list deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main
You need both these GPG keys:
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg wget -O - https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor > /usr/share/keyrings/oneapi-archive-keyring.gpg
Install ALL THE THINGS!!!
apt update apt -y install \ intel-basekit intel-aikit intel-oneapi-pytorch intel-oneapi-tensorflow \ intel-opencl-icd intel-level-zero-gpu level-zero level-zero-dev \ intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \ libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \ libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \ mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo \ libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev \ intel-fw-gpu intel-i915-dkms xpu-smi \
A reboot is required.
With Intel's packages installed we have Conda environments:
root@sadness:~# ln -s /opt/intel/oneapi/setvars.sh setvars.sh root@sadness:~# conda info --envs # conda environments: # base * /opt/intel/oneapi/intelpython/python3.9 pytorch /opt/intel/oneapi/intelpython/python3.9/envs/pytorch pytorch-gpu /opt/intel/oneapi/intelpython/python3.9/envs/pytorch-gpu tensorflow /opt/intel/oneapi/intelpython/python3.9/envs/tensorflow tensorflow-2.13.0 /opt/intel/oneapi/intelpython/python3.9/envs/tensorflow-2.13.0 tensorflow-gpu /opt/intel/oneapi/intelpython/python3.9/envs/tensorflow-gpu /opt/intel/oneapi/pytorch/latest /opt/intel/oneapi/tensorflow/latest
The pytorch sanity check fails with a familiar error:
conda activate pytorch-gpu python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];" (...snip...) ImportError: libmkl_sycl.so.3: cannot open shared object file: No such file or directory
There are some wheels missing. Let's leave pytorch-gpu pristine. Make a copy and muck around with that.
conda create --name textgen --clone pytorch-gpu
Holy crap this takes a long time. But now I can inst–hang on I'm not sure why this is necessary:
conda activate textgen conda install intel-extension-for-pytorch=2.1.10 pytorch=2.1.0 -c intel -c conda-forge
Holy crap this takes a long time. IPEX and pytorch weren't already in the pytorch-gpu env? Ok… well, that got the A770 to show up in the sanity check anyway.
python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];" 2.1.0a0+cxx11.abi 2.1.10+xpu [0]: _DeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=243MB, max_compute_units=512, gpu_eu_count=512)
Interestingly, clinfo -l shows both i915 devices in the system; I don't know if that's a problem or not. Perhaps I should disable the iGPU?
(textgen) root@sadness:~# clinfo -l Platform #0: Intel(R) OpenCL `-- Device #0: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Platform #1: Intel(R) OpenCL Graphics `-- Device #0: Intel(R) Arc(TM) A770 Graphics
And llama.cpp… I have no explanation for this output.
GGML_SYCL_DEBUG=0 ggml_init_sycl: GGML_SYCL_F16: yes ggml_init_sycl: SYCL_USE_XMX: yes found 3 SYCL devices: Device 0: Intel(R) Arc(TM) A770 Graphics, compute capability 1.3, max compute_units 512, max work group size 1024, max sub group size 32, global mem size 255012864 Device 1: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz, compute capability 3.0, max compute_units 8, max work group size 8192, max sub group size 64, global mem size 33517965312 Device 2: Intel(R) Arc(TM) A770 Graphics, compute capability 3.0, max compute_units 512, max work group size 1024, max sub group size 32, global mem size 255012864
This is definitely wrong but I don't know what's wrong with it. This is just what I did to get llama-cpp-python installed, (from the correct venv of course) and it reported success:
export CMAKE_ARGS="-DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON" pip install llama-cpp-python
Cloning Ooba and placing models in the models/ directory is an exercise left to the reader.
Requirements:
pip install rich accelerate gradio==3.50.* markdown transformers datasets peft
Actually using requirements.txt creates a conflict. I haven't dug into it yet.
I can launch Ooba and it behaves as expected:
(textgen) david@sadness:~/code/oobabooga/text-generation-webui$ python server.py --listen 01:49:35-734416 INFO Starting Text generation web UI 01:49:35-740908 WARNING You are potentially exposing the web UI to the entire internet without any access password. You can create one with the "--gradio-auth" flag like this: --gradio-auth username:password Make sure to replace username:password with your own. 01:49:35-745314 INFO Loading the extension "gallery" Running on local URL: http://0.0.0.0:7860
(Don't worry, it's behind 7 proxies.) I select a .gguf file from my models folder and load it with llama.cpp, and as long as I have –cpu enabled, it will work. If I turn off the CPU check box, then inference will start, spin its wheels for a long time, then crash.
00:59:01-612664 INFO Loading "tinyllama-1.1b-chat-v1.0.Q6_K.gguf" 00:59:01-864875 INFO llama.cpp weights detected: "models/tinyllama-1.1b-chat-v1.0.Q6_K.gguf" GGML_SYCL_DEBUG=0 ggml_init_sycl: GGML_SYCL_F16: yes ggml_init_sycl: SYCL_USE_XMX: yes found 3 SYCL devices: Device 0: Intel(R) Arc(TM) A770 Graphics, compute capability 1.3, max compute_units 512, max work group size 1024, max sub group size 32, global mem size 255012864 Device 1: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz, compute capability 3.0, max compute_units 8, max work group size 8192, max sub group size 64, global mem size 33517965312 Device 2: Intel(R) Arc(TM) A770 Graphics, compute capability 3.0, max compute_units 512, max work group size 1024, max sub group size 32, global mem size 255012864 Using device 0 (Intel(R) Arc(TM) A770 Graphics) as main device llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from models/tinyllama-1.1b-chat-v1.0.Q6_K.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = tinyllama_tinyllama-1.1b-chat-v1.0 llama_model_loader: - kv 2: llama.context_length u32 = 2048 llama_model_loader: - kv 3: llama.embedding_length u32 = 2048 llama_model_loader: - kv 4: llama.block_count u32 = 22 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 18 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - type f32: 45 tensors llama_model_loader: - type q6_K: 156 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_layer = 22 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 5632 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 1B llm_load_print_meta: model ftype = Q6_K llm_load_print_meta: model params = 1.10 B llm_load_print_meta: model size = 860.86 MiB (6.56 BPW) llm_load_print_meta: general.name = tinyllama_tinyllama-1.1b-chat-v1.0 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 2 '</s>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.08 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/23 layers to GPU llm_load_tensors: CPU buffer size = 860.86 MiB .......................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 44.00 MiB llama_new_context_with_model: KV self size = 44.00 MiB, K (f16): 22.00 MiB, V (f16): 22.00 MiB llama_new_context_with_model: CPU input buffer size = 9.02 MiB llama_new_context_with_model: SYCL_Host compute buffer size = 144.00 MiB llama_new_context_with_model: graph splits (measure): 1 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | Model metadata: {'tokenizer.chat_template': "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n' + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}", 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '2048', 'general.name': 'tinyllama_tinyllama-1.1b-chat-v1.0', 'llama.embedding_length': '2048', 'llama.feed_forward_length': '5632', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '64', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '22', 'llama.attention.head_count_kv': '4', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '18'} Using gguf chat template: {% for message in messages %} {% if message['role'] == 'user' %} {{ '<|user|> ' + message['content'] + eos_token }} {% elif message['role'] == 'system' %} {{ '<|system|> ' + message['content'] + eos_token }} {% elif message['role'] == 'assistant' %} {{ '<|assistant|> ' + message['content'] + eos_token }} {% endif %} {% if loop.last and add_generation_prompt %} {{ '<|assistant|>' }} {% endif %} {% endfor %} Using chat eos_token: </s> Using chat bos_token: <s> 00:59:02-182271 INFO LOADER: "llama.cpp" 00:59:02-183834 INFO TRUNCATION LENGTH: 2048 00:59:02-185281 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)" 00:59:02-186983 INFO Loaded the model in 0.57 seconds. Prompt evaluation: 0%| | 0/1 [00:00<?, ?it/s] Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)Exception caught at file:/tmp/pip-install-1g4vflw6/llama-cpp-python_a217762ea5e14fb997940c76ade3bb52/vendor/llama.cpp/ggml-sycl.cpp, line:12271
It doesn't matter if I have any layers offloaded or not.