-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
Hello,
If I load the model and draft onto the same GPU (for example GPU.0) - then the problem does not arise. If I load the model onto the GPU.0, and draft on GPU.1 - then an error appears.
Linux xpu 6.19.3-061903-generic #202602191659 SMP PREEMPT_DYNAMIC Sat Feb 21 08:17:10 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Config
"Qwen3-14B-int4-ov-spec": {
"model_name": "Qwen3-14B-int4-ov-spec",
"model_path": "/mnt/data2/models/OpenVINO/Qwen3-14B-int4-ov",
"device": "GPU.1",
"model_type": "llm",
"engine": "ovgenai",
"draft_model_path": "/mnt/data2/models/OpenVINO/Qwen3-0.6B-int4-ov",
"draft_device": "GPU.2",
"num_assistant_tokens": 7,
"runtime_config": {
"PERFORMANCE_HINT": "LATENCY"
}
},
OpenARC server log
2026-02-22 12:41:58,202 - ERROR - [DEBUG] draft_model_loaded: True
2026-02-22 12:41:58,203 - ERROR - [DEBUG] self.model_num_assistant_tokens: 3
2026-02-22 12:41:58,203 - ERROR - [DEBUG] generation_kwargs.num_assistant_tokens: 3
2026-02-22 12:41:58,203 - ERROR - [DEBUG] generation_kwargs.assistant_confidence_threshold: 0.0
2026-02-22 12:42:17,029 - INFO - [LLM Worker: Qwen3-14B-int4-ov-spec] Metrics: {'load_time (s)': 28.29, 'ttft (s)': 0.37, 'tpot (ms)': 54.28816, 'prefill_throughput (tokens/s)': 2000.81, 'decode_throughput (tokens/s)': 18.42022, 'decode_duration (s)': 18.82504, 'input_token': 731, 'new_token': 341, 'total_token': 1072, 'stream': True, 'stream_chunk_tokens': 1}
2026-02-22 12:42:17,758 - INFO - Request received: POST /v1/chat/completions from 127.0.0.1
2026-02-22 12:42:17,765 - INFO - "Qwen3-8B-int4-ov" request received
2026-02-22 12:42:17,766 - INFO - Request completed: POST /v1/chat/completions status=400 duration=0.007s
2026-02-22 12:42:33,721 - INFO - Request received: POST /openarc/unload from 127.0.0.1
2026-02-22 12:42:34,434 - INFO - [Qwen3-14B-int4-ov-spec] unloaded successfully
2026-02-22 12:42:34,435 - INFO - Request completed: POST /openarc/unload status=200 duration=0.714s
2026-02-22 12:42:41,835 - INFO - Request received: POST /openarc/load from 127.0.0.1
2026-02-22 12:42:41,837 - INFO - Qwen3-14B-int4-ov-spec loading...
2026-02-22 12:42:41,837 - INFO - ModelType.LLM on GPU.1 with {}
2026-02-22 12:42:42,245 - INFO - Loaded draft model from /mnt/data2/models/OpenVINO/Qwen3-0.6B-int4-ov on GPU.2
2026-02-22 12:43:09,562 - ERROR - Model loading failed for Qwen3-14B-int4-ov-spec
Traceback (most recent call last):
File "/home/arc/OpenArc/src/server/model_registry.py", line 145, in _load_task
model_instance = await create_model_instance(load_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/arc/OpenArc/src/server/model_registry.py", line 254, in create_model_instance
await asyncio.to_thread(model_instance.load_model, load_config)
File "/usr/local/lib/python3.11/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/arc/OpenArc/src/engine/ov_genai/llm.py", line 306, in load_model
self.model = LLMPipeline(
^^^^^^^^^^^^
RuntimeError: Exception from src/inference/src/cpp/core.cpp:110:
Exception from src/inference/src/dev/plugin.cpp:54:
Check 'false' failed at src/plugins/intel_gpu/src/plugin/program_builder.cpp:163:
[GPU] ProgramBuilder build failed!
Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_common.hpp:40:
[GPU] clEnqueueNDRangeKernel, error code: -52 CL_INVALID_KERNEL_ARGS
2026-02-22 12:43:09,669 - INFO - Request completed: POST /openarc/load status=500 duration=27.834s
UV PIP LIST
(openarc) (openarc) arc@xpu:~/OpenArc$ uv pip list
Package Version Editable project location
-------------------------- ---------------------- -------------------------
about-time 4.2.1
addict 2.4.0
aiohappyeyeballs 2.6.1
aiohttp 3.12.14
aiosignal 1.4.0
alive-progress 3.2.0
annotated-types 0.7.0
anyio 4.9.0
asttokens 3.0.0
attrs 25.3.0
audioread 3.0.1
autograd 1.8.0
babel 2.17.0
blis 1.3.0
brotli 1.1.0
catalogue 2.0.10
certifi 2025.7.14
cffi 2.0.0
charset-normalizer 3.4.2
click 8.2.1
cloudpathlib 0.22.0
cma 4.2.0
colorama 0.4.6
comm 0.2.3
confection 0.1.5
contourpy 1.3.2
cryptography 46.0.3
csvw 3.6.0
curated-tokenizers 0.0.9
curated-transformers 0.1.1
cycler 0.12.1
cymem 2.0.11
datasets 4.0.0
ddgs 9.6.1
debugpy 1.8.17
decorator 5.2.1
deprecated 1.2.18
dill 0.3.8
distro 1.9.0
dlinfo 2.0.0
docopt 0.6.2
espeakng-loader 0.2.4
evdev 1.9.2
executing 2.2.1
fastapi 0.116.1
filelock 3.18.0
fonttools 4.58.5
frozenlist 1.7.0
fsspec 2025.3.0
grapheme 0.6.0
griffe 1.14.0
h11 0.16.0
h2 4.3.0
hf-xet 1.1.5
hpack 4.1.0
httpcore 1.0.9
httpx 0.28.1
httpx-sse 0.4.3
huggingface-hub 0.33.4
hyperframe 6.1.0
idna 3.10
iniconfig 2.3.0
inquirerpy 0.3.4
ipykernel 7.0.1
ipython 9.6.0
ipython-pygments-lexers 1.1.1
ipywidgets 8.1.7
isodate 0.7.2
jedi 0.19.2
jinja2 3.1.6
jiter 0.11.0
joblib 1.5.1
jsonschema 4.24.0
jsonschema-specifications 2025.4.1
jupyter-client 8.6.3
jupyter-core 5.9.1
jupyterlab-widgets 3.0.15
kiwisolver 1.4.8
kokoro 0.9.4
langcodes 3.5.0
language-data 1.3.0
language-tags 1.2.0
lazy-loader 0.4
librosa 0.11.0
llvmlite 0.45.0
loguru 0.7.3
lxml 6.0.2
marisa-trie 1.3.1
markdown-it-py 3.0.0
markupsafe 3.0.2
matplotlib 3.10.3
matplotlib-inline 0.1.7
mcp 1.20.0
mdurl 0.1.2
misaki 0.9.4
mpmath 1.3.0
msgpack 1.1.1
multidict 6.6.3
multiprocess 0.70.16
murmurhash 1.0.13
natsort 8.4.0
nest-asyncio 1.6.0
networkx 3.4.2
ninja 1.11.1.4
nncf 2.17.0
num2words 0.5.14
numba 0.62.0
numpy 2.2.6
onnx 1.18.0
openai 2.2.0
openai-agents 0.4.2
openarc 2.0 /home/arc/OpenArc
openvino 2026.1.0.dev20260221
openvino-genai 2026.1.0.0.dev20260221
openvino-telemetry 2025.2.0
openvino-tokenizers 2026.1.0.0.dev20260221
optimum 1.27.0
optimum-intel 1.25.2
packaging 25.0
pandas 2.2.3
parso 0.8.5
pexpect 4.9.0
pfzy 0.3.4
phonemizer-fork 3.3.2
pillow 11.3.0
pip 25.2
platformdirs 4.4.0
pluggy 1.6.0
pooch 1.8.2
preshed 3.0.10
primp 0.15.0
prompt-toolkit 3.0.52
propcache 0.3.2
protobuf 6.31.1
psutil 7.0.0
ptyprocess 0.7.0
pure-eval 0.2.3
pyarrow 20.0.0
pycparser 2.23
pydantic 2.11.7
pydantic-core 2.33.2
pydantic-settings 2.11.0
pydot 3.0.4
pygments 2.19.2
pyjwt 2.10.1
pymoo 0.6.1.5
pynput 1.8.1
pyparsing 3.2.3
pytest 8.4.2
python-dateutil 2.9.0.post0
python-dotenv 1.2.1
python-multipart 0.0.20
python-xlib 0.33
pytz 2025.2
pyyaml 6.0.2
pyzmq 27.1.0
rdflib 7.2.1
referencing 0.36.2
regex 2024.11.6
requests 2.32.4
rfc3986 1.5.0
rich 14.0.0
rich-click 1.8.9
rpds-py 0.26.0
safetensors 0.5.3
scikit-learn 1.7.0
scipy 1.16.0
segments 2.3.0
setuptools 80.9.0
shellingham 1.5.4
six 1.17.0
smart-open 7.3.1
smolagents 1.22.0
sniffio 1.3.1
socksio 1.0.0
sounddevice 0.5.2
soundfile 0.13.1
soxr 1.0.0
spacy 3.8.7
spacy-curated-transformers 0.3.1
spacy-legacy 3.0.12
spacy-loggers 1.0.5
srsly 2.5.1
sse-starlette 3.0.3
stack-data 0.6.3
starlette 0.47.1
sympy 1.14.0
tabulate 0.9.0
termcolor 3.1.0
thinc 8.3.6
threadpoolctl 3.6.0
tokenizers 0.21.2
torch 2.8.0+cpu
torchvision 0.23.0+cpu
tornado 6.5.2
tqdm 4.67.1
traitlets 5.14.3
transformers 4.52.4
typer 0.19.2
types-requests 2.32.4.20250913
typing-extensions 4.14.1
typing-inspection 0.4.1
tzdata 2025.2
uritemplate 4.2.0
urllib3 2.5.0
uvicorn 0.35.0
wasabi 1.1.3
wcwidth 0.2.14
weasel 0.4.1
widgetsnbextension 4.0.14
wrapt 1.17.2
xxhash 3.5.0
yarl 1.20.1
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels