Table of Contents

Start

Ubuntu Server 22.04, normal (not minimal) install. Main drive has 4k sectors, so grub is EFI.

The environment might have some Infiniband-related pollution. It shouldn't matter.

I run things as root so I don't have to | sudo tee and other such nonsense. Think before you press enter!!!

Conda

The apt list:

  cat > /etc/apt/sources.list.d/conda.list
  deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main

The GPG key:

  curl https://repo.anaconda.com/pkgs/misc/gpgkeys/anaconda.asc | gpg --dearmor > /usr/share/keyrings/conda-archive-keyring.gpg

Install the thing:

  apt update
  apt -y install conda

Intel

The apt repository for the GPU drivers:

  cat > /etc/apt/sources.list.d/intel-gpu-jammy.list
  deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client

The apt repository for OneAPI:

  cat > /etc/apt/sources.list.d/oneAPI.list
  deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main

You need both these GPG keys:

  wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
  wget -O - https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor > /usr/share/keyrings/oneapi-archive-keyring.gpg

Install ALL THE THINGS!!!

  apt update
  apt -y install \
      intel-basekit intel-aikit intel-oneapi-pytorch intel-oneapi-tensorflow \
      intel-opencl-icd intel-level-zero-gpu level-zero level-zero-dev \
      intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
      libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
      libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
      mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo \
      libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev \
      intel-fw-gpu intel-i915-dkms xpu-smi \

A reboot is required.

A venv for Ooba

With Intel's packages installed we have Conda environments:

  root@sadness:~# ln -s /opt/intel/oneapi/setvars.sh setvars.sh
  root@sadness:~# conda info --envs
  # conda environments:
  #
  base                  *  /opt/intel/oneapi/intelpython/python3.9
  pytorch                  /opt/intel/oneapi/intelpython/python3.9/envs/pytorch
  pytorch-gpu              /opt/intel/oneapi/intelpython/python3.9/envs/pytorch-gpu
  tensorflow               /opt/intel/oneapi/intelpython/python3.9/envs/tensorflow
  tensorflow-2.13.0        /opt/intel/oneapi/intelpython/python3.9/envs/tensorflow-2.13.0
  tensorflow-gpu           /opt/intel/oneapi/intelpython/python3.9/envs/tensorflow-gpu
                           /opt/intel/oneapi/pytorch/latest
                           /opt/intel/oneapi/tensorflow/latest

The pytorch sanity check fails with a familiar error:

  conda activate pytorch-gpu
  python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"
      (...snip...)
      ImportError: libmkl_sycl.so.3: cannot open shared object file: No such file or directory

There are some wheels missing. Let's leave pytorch-gpu pristine. Make a copy and muck around with that.

  conda create --name textgen --clone pytorch-gpu

Holy crap this takes a long time. But now I can inst–hang on I'm not sure why this is necessary:

  conda activate textgen
  conda install intel-extension-for-pytorch=2.1.10 pytorch=2.1.0 -c intel -c conda-forge

Holy crap this takes a long time. IPEX and pytorch weren't already in the pytorch-gpu env? Ok… well, that got the A770 to show up in the sanity check anyway.

  python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"
  2.1.0a0+cxx11.abi
  2.1.10+xpu
  [0]: _DeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=243MB, max_compute_units=512, gpu_eu_count=512)

Interestingly, clinfo -l shows both i915 devices in the system; I don't know if that's a problem or not. Perhaps I should disable the iGPU?

  (textgen) root@sadness:~# clinfo -l
  Platform #0: Intel(R) OpenCL
   `-- Device #0: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
  Platform #1: Intel(R) OpenCL Graphics
   `-- Device #0: Intel(R) Arc(TM) A770 Graphics

And llama.cpp… I have no explanation for this output.

  GGML_SYCL_DEBUG=0
  ggml_init_sycl: GGML_SYCL_F16:   yes
  ggml_init_sycl: SYCL_USE_XMX: yes
  found 3 SYCL devices:
    Device 0: Intel(R) Arc(TM) A770 Graphics,	compute capability 1.3,
      max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 255012864
    Device 1: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz,	compute capability 3.0,
      max compute_units 8,	max work group size 8192,	max sub group size 64,	global mem size 33517965312
    Device 2: Intel(R) Arc(TM) A770 Graphics,	compute capability 3.0,
      max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 255012864

llama-cpp-python

FIXME This is definitely wrong but I don't know what's wrong with it. This is just what I did to get llama-cpp-python installed, (from the correct venv of course) and it reported success:

  export CMAKE_ARGS="-DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON"
  pip install llama-cpp-python

Ooba

Cloning Ooba and placing models in the models/ directory is an exercise left to the reader.

Requirements:

  pip install rich accelerate gradio==3.50.* markdown transformers datasets peft

Actually using requirements.txt creates a conflict. I haven't dug into it yet.

State

I can launch Ooba and it behaves as expected:

  (textgen) david@sadness:~/code/oobabooga/text-generation-webui$ python server.py --listen
  01:49:35-734416 INFO     Starting Text generation web UI                                                                                                     
  01:49:35-740908 WARNING                                                                                                                                      
                           You are potentially exposing the web UI to the entire internet without any access password.                                         
                           You can create one with the "--gradio-auth" flag like this:                                                                         
                                                                                                                                                               
                           --gradio-auth username:password                                                                                                     
                                                                                                                                                               
                           Make sure to replace username:password with your own.                                                                               
  01:49:35-745314 INFO     Loading the extension "gallery"                                                                                                     
  
  Running on local URL:  http://0.0.0.0:7860

(Don't worry, it's behind 7 proxies.) I select a .gguf file from my models folder and load it with llama.cpp, and as long as I have –cpu enabled, it will work. If I turn off the CPU check box, then inference will start, spin its wheels for a long time, then crash.

Error Wall Of Text

  00:59:01-612664 INFO     Loading "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"
  00:59:01-864875 INFO     llama.cpp weights detected: "models/tinyllama-1.1b-chat-v1.0.Q6_K.gguf"
  GGML_SYCL_DEBUG=0
  ggml_init_sycl: GGML_SYCL_F16:   yes
  ggml_init_sycl: SYCL_USE_XMX: yes
  found 3 SYCL devices:
    Device 0: Intel(R) Arc(TM) A770 Graphics, compute capability 1.3,
      max compute_units 512,  max work group size 1024,   max sub group size 32,  global mem size 255012864
    Device 1: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz,   compute capability 3.0,
      max compute_units 8,    max work group size 8192,   max sub group size 64,  global mem size 33517965312
    Device 2: Intel(R) Arc(TM) A770 Graphics, compute capability 3.0,
      max compute_units 512,  max work group size 1024,   max sub group size 32,  global mem size 255012864
  Using device 0 (Intel(R) Arc(TM) A770 Graphics) as main device
  llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from models/tinyllama-1.1b-chat-v1.0.Q6_K.gguf (version GGUF V3 (latest))
  llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  llama_model_loader: - kv   0:                       general.architecture str              = llama
  llama_model_loader: - kv   1:                               general.name str              = tinyllama_tinyllama-1.1b-chat-v1.0
  llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
  llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
  llama_model_loader: - kv   4:                          llama.block_count u32              = 22
  llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
  llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
  llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
  llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 4
  llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
  llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
  llama_model_loader: - kv  11:                          general.file_type u32              = 18
  llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
  llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
  llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,61249]   = ["▁ t", "e r", "i n", "▁ a", "e n...
  llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
  llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
  llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
  llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
  llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
  llama_model_loader: - kv  22:               general.quantization_version u32              = 2
  llama_model_loader: - type  f32:   45 tensors
  llama_model_loader: - type q6_K:  156 tensors
  llm_load_vocab: special tokens definition check successful ( 259/32000 ).
  llm_load_print_meta: format           = GGUF V3 (latest)
  llm_load_print_meta: arch             = llama
  llm_load_print_meta: vocab type       = SPM
  llm_load_print_meta: n_vocab          = 32000
  llm_load_print_meta: n_merges         = 0
  llm_load_print_meta: n_ctx_train      = 2048
  llm_load_print_meta: n_embd           = 2048
  llm_load_print_meta: n_head           = 32
  llm_load_print_meta: n_head_kv        = 4
  llm_load_print_meta: n_layer          = 22
  llm_load_print_meta: n_rot            = 64
  llm_load_print_meta: n_embd_head_k    = 64
  llm_load_print_meta: n_embd_head_v    = 64
  llm_load_print_meta: n_gqa            = 8
  llm_load_print_meta: n_embd_k_gqa     = 256
  llm_load_print_meta: n_embd_v_gqa     = 256
  llm_load_print_meta: f_norm_eps       = 0.0e+00
  llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
  llm_load_print_meta: f_clamp_kqv      = 0.0e+00
  llm_load_print_meta: f_max_alibi_bias = 0.0e+00
  llm_load_print_meta: n_ff             = 5632
  llm_load_print_meta: n_expert         = 0
  llm_load_print_meta: n_expert_used    = 0
  llm_load_print_meta: pooling type     = 0
  llm_load_print_meta: rope type        = 0
  llm_load_print_meta: rope scaling     = linear
  llm_load_print_meta: freq_base_train  = 10000.0
  llm_load_print_meta: freq_scale_train = 1
  llm_load_print_meta: n_yarn_orig_ctx  = 2048
  llm_load_print_meta: rope_finetuned   = unknown
  llm_load_print_meta: model type       = 1B
  llm_load_print_meta: model ftype      = Q6_K
  llm_load_print_meta: model params     = 1.10 B
  llm_load_print_meta: model size       = 860.86 MiB (6.56 BPW)
  llm_load_print_meta: general.name     = tinyllama_tinyllama-1.1b-chat-v1.0
  llm_load_print_meta: BOS token        = 1 '<s>'
  llm_load_print_meta: EOS token        = 2 '</s>'
  llm_load_print_meta: UNK token        = 0 '<unk>'
  llm_load_print_meta: PAD token        = 2 '</s>'
  llm_load_print_meta: LF token         = 13 '<0x0A>'
  llm_load_tensors: ggml ctx size =    0.08 MiB
  llm_load_tensors: offloading 0 repeating layers to GPU
  llm_load_tensors: offloaded 0/23 layers to GPU
  llm_load_tensors:        CPU buffer size =   860.86 MiB
  ..........................................................................................
  llama_new_context_with_model: n_ctx      = 2048
  llama_new_context_with_model: freq_base  = 10000.0
  llama_new_context_with_model: freq_scale = 1
  llama_kv_cache_init:        CPU KV buffer size =    44.00 MiB
  llama_new_context_with_model: KV self size  =   44.00 MiB, K (f16):   22.00 MiB, V (f16):   22.00 MiB
  llama_new_context_with_model:        CPU input buffer size   =     9.02 MiB
  llama_new_context_with_model:  SYCL_Host compute buffer size =   144.00 MiB
  llama_new_context_with_model: graph splits (measure): 1
  AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
  Model metadata: {'tokenizer.chat_template': "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}", 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '2048', 'general.name': 'tinyllama_tinyllama-1.1b-chat-v1.0', 'llama.embedding_length': '2048', 'llama.feed_forward_length': '5632', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '64', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '22', 'llama.attention.head_count_kv': '4', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '18'}
  Using gguf chat template: {% for message in messages %}
  {% if message['role'] == 'user' %}
  {{ '<|user|>
  ' + message['content'] + eos_token }}
  {% elif message['role'] == 'system' %}
  {{ '<|system|>
  ' + message['content'] + eos_token }}
  {% elif message['role'] == 'assistant' %}
  {{ '<|assistant|>
  '  + message['content'] + eos_token }}
  {% endif %}
  {% if loop.last and add_generation_prompt %}
  {{ '<|assistant|>' }}
  {% endif %}
  {% endfor %}
  Using chat eos_token: </s>
  Using chat bos_token: <s>
  00:59:02-182271 INFO     LOADER: "llama.cpp"                                                                                                                 
  00:59:02-183834 INFO     TRUNCATION LENGTH: 2048                                                                                                             
  00:59:02-185281 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                                                                       
  00:59:02-186983 INFO     Loaded the model in 0.57 seconds.                                                                                                   
  Prompt evaluation:   0%|                                                                                                               | 0/1 [00:00<?, ?it/s]
  Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)Exception caught at file:/tmp/pip-install-1g4vflw6/llama-cpp-python_a217762ea5e14fb997940c76ade3bb52/vendor/llama.cpp/ggml-sycl.cpp, line:12271

It doesn't matter if I have any layers offloaded or not.