Skip to content

OpenAI interface - Home Assistant is crashing model on the second prompt #63

@DrazorV

Description

@DrazorV

Getting this error always on the second prompt from home assistant to openarc

-- Boot dd3d3c10b9dd428c913f1f4007148098 --
Feb 06 10:09:56 openarc systemd[1]: Started openarc.service - OpenArc Inference Server (venv).
Feb 06 10:09:56 openarc openarc[550]: Configuration saved to: /root/OpenArc/openarc_config.json
Feb 06 10:09:56 openarc openarc[550]: Starting OpenArc server on 0.0.0.0:8000
Feb 06 10:09:56 openarc openarc[550]: 2026-02-06 10:09:56,656 - INFO - Launching  0.0.0.0:8000
Feb 06 10:09:56 openarc openarc[550]: 2026-02-06 10:09:56,656 - INFO - --------------------------------
Feb 06 10:09:56 openarc openarc[550]: 2026-02-06 10:09:56,656 - INFO - OpenArc endpoints:
Feb 06 10:09:56 openarc openarc[550]: 2026-02-06 10:09:56,656 - INFO -   - POST   /openarc/load           Load a model
Feb 06 10:09:56 openarc openarc[550]: 2026-02-06 10:09:56,656 - INFO -   - POST   /openarc/unload         Unload a model
Feb 06 10:09:56 openarc openarc[550]: 2026-02-06 10:09:56,656 - INFO -   - GET    /openarc/status         Get model status
Feb 06 10:09:56 openarc openarc[550]: 2026-02-06 10:09:56,656 - INFO - --------------------------------
Feb 06 10:09:56 openarc openarc[550]: 2026-02-06 10:09:56,656 - INFO - OpenAI compatible endpoints:
Feb 06 10:09:56 openarc openarc[550]: 2026-02-06 10:09:56,656 - INFO -   - GET    /v1/models
Feb 06 10:09:56 openarc openarc[550]: 2026-02-06 10:09:56,656 - INFO -   - POST   /v1/chat/completions
Feb 06 10:09:56 openarc openarc[550]: 2026-02-06 10:09:56,656 - INFO -   - POST   /v1/audio/transcriptions: Whisper only
Feb 06 10:09:56 openarc openarc[550]: 2026-02-06 10:09:56,656 - INFO -   - POST   /v1/audio/speech: Kokoro only
Feb 06 10:09:56 openarc openarc[550]: 2026-02-06 10:09:56,656 - INFO -   - POST   /v1/embeddings
Feb 06 10:09:56 openarc openarc[550]: 2026-02-06 10:09:56,656 - INFO -   - POST   /v1/rerank
Feb 06 10:10:02 openarc openarc[550]: 2026-02-06 10:10:02,560 - INFO - Started server process [550]
Feb 06 10:10:02 openarc openarc[550]: 2026-02-06 10:10:02,560 - INFO - Waiting for application startup.
Feb 06 10:10:02 openarc openarc[550]: 2026-02-06 10:10:02,561 - INFO - qwen3-8b loading...
Feb 06 10:10:02 openarc openarc[550]: 2026-02-06 10:10:02,561 - INFO - ModelType.LLM on GPU.1 with {'GPU_ENABLE_LARGE_ALLOCATIONS': True, 'GPU_ENABLE_SDPA_OPTIMIZATION': False}
Feb 06 10:10:04 openarc openarc[550]: WARNING: Resizable BAR not detected for device 0000:31:00.0
Feb 06 10:10:19 openarc openarc[550]: 2026-02-06 10:10:19,678 - INFO - qwen3-8b loaded successfully
Feb 06 10:10:19 openarc openarc[550]: 2026-02-06 10:10:19,679 - INFO - Startup: loaded 'qwen3-8b'
Feb 06 10:10:19 openarc openarc[550]: 2026-02-06 10:10:19,679 - INFO - [qwen3-8b LLM Worker] Started, waiting for packets...
Feb 06 10:10:19 openarc openarc[550]: 2026-02-06 10:10:19,679 - INFO - Application startup complete.
Feb 06 10:10:19 openarc openarc[550]: 2026-02-06 10:10:19,679 - INFO - Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Feb 06 10:11:38 openarc openarc[550]: 2026-02-06 10:11:38,381 - INFO - Request received: GET /v1/models from 127.0.0.1
Feb 06 10:11:38 openarc openarc[550]: 2026-02-06 10:11:38,382 - INFO - Request completed: GET /v1/models status=200 duration=0.001s
Feb 06 10:12:29 openarc openarc[550]: 2026-02-06 10:12:29,405 - INFO - Request received: GET /openarc/status from 127.0.0.1
Feb 06 10:12:29 openarc openarc[550]: 2026-02-06 10:12:29,406 - INFO - Request completed: GET /openarc/status status=200 duration=0.000s
Feb 06 10:13:25 openarc openarc[550]: 2026-02-06 10:13:25,207 - INFO - Request received: GET /openarc/status from 127.0.0.1
Feb 06 10:13:25 openarc openarc[550]: 2026-02-06 10:13:25,208 - INFO - Request completed: GET /openarc/status status=200 duration=0.001s
Feb 06 10:18:34 openarc openarc[550]: 2026-02-06 10:18:34,956 - INFO - Request received: GET /v1/models from 172.20.20.200
Feb 06 10:18:34 openarc openarc[550]: 2026-02-06 10:18:34,957 - INFO - Request completed: GET /v1/models status=200 duration=0.001s
Feb 06 10:18:36 openarc openarc[550]: 2026-02-06 10:18:36,631 - INFO - Request received: GET /v1/models from 172.20.20.200
Feb 06 10:18:36 openarc openarc[550]: 2026-02-06 10:18:36,632 - INFO - Request completed: GET /v1/models status=200 duration=0.001s
Feb 06 10:18:46 openarc openarc[550]: 2026-02-06 10:18:46,728 - INFO - Request received: POST /v1/chat/completions from 172.20.20.200
Feb 06 10:18:46 openarc openarc[550]: 2026-02-06 10:18:46,734 - INFO - Request completed: POST /v1/chat/completions status=200 duration=0.006s
Feb 06 10:18:46 openarc openarc[550]: 2026-02-06 10:18:46,800 - ERROR - [DEBUG] draft_model_loaded: False
Feb 06 10:18:46 openarc openarc[550]: 2026-02-06 10:18:46,800 - ERROR - [DEBUG] self.model_num_assistant_tokens: None
Feb 06 10:18:46 openarc openarc[550]: 2026-02-06 10:18:46,800 - ERROR - [DEBUG] generation_kwargs.num_assistant_tokens: 0
Feb 06 10:18:46 openarc openarc[550]: 2026-02-06 10:18:46,800 - ERROR - [DEBUG] generation_kwargs.assistant_confidence_threshold: 0.0
Feb 06 10:18:53 openarc openarc[550]: 2026-02-06 10:18:53,275 - INFO - [qwen3-8b LLM Worker] Metrics: {'load_time (s)': 16.81, 'ttft (s)': 2.98, 'tpot (ms)': 26.37698, 'prefill_throughput (tokens/s)': 2854.96, 'decode_throughput (tokens/s)': 37.91185, 'decode_duration (s)': 6.46583, 'input_token': 8508, 'new_token': 133, 'total_token': 8641, 'stream': True, 'stream_chunk_tokens': 1}
Feb 06 10:18:56 openarc openarc[550]: 2026-02-06 10:18:56,956 - INFO - Request received: POST /v1/chat/completions from 172.20.20.200
Feb 06 10:18:56 openarc openarc[550]: 2026-02-06 10:18:56,960 - INFO - Request completed: POST /v1/chat/completions status=200 duration=0.005s
Feb 06 10:18:56 openarc openarc[550]: 2026-02-06 10:18:56,974 - ERROR - [DEBUG] draft_model_loaded: False
Feb 06 10:18:56 openarc openarc[550]: 2026-02-06 10:18:56,974 - ERROR - [DEBUG] self.model_num_assistant_tokens: None
Feb 06 10:18:56 openarc openarc[550]: 2026-02-06 10:18:56,974 - ERROR - [DEBUG] generation_kwargs.num_assistant_tokens: 0
Feb 06 10:18:56 openarc openarc[550]: 2026-02-06 10:18:56,974 - ERROR - [DEBUG] generation_kwargs.assistant_confidence_threshold: 0.0
Feb 06 10:18:57 openarc openarc[550]: 2026-02-06 10:18:57,011 - ERROR - LLM inference failed!
Feb 06 10:18:57 openarc openarc[550]: Traceback (most recent call last):
Feb 06 10:18:57 openarc openarc[550]:   File "/root/OpenArc/src/server/worker_registry.py", line 87, in infer_llm
Feb 06 10:18:57 openarc openarc[550]:     async for item in llm_instance.generate_type(packet.gen_config):
Feb 06 10:18:57 openarc openarc[550]:   File "/root/OpenArc/src/engine/ov_genai/llm.py", line 200, in generate_stream
Feb 06 10:18:57 openarc openarc[550]:     result = await gen_task
Feb 06 10:18:57 openarc openarc[550]:              ^^^^^^^^^^^^^^
Feb 06 10:18:57 openarc openarc[550]:   File "/root/OpenArc/src/engine/ov_genai/llm.py", line 179, in _run_generation
Feb 06 10:18:57 openarc openarc[550]:     return await asyncio.to_thread(
Feb 06 10:18:57 openarc openarc[550]:            ^^^^^^^^^^^^^^^^^^^^^^^^
Feb 06 10:18:57 openarc openarc[550]:   File "/root/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/asyncio/threads.py", line 25, in to_thread
Feb 06 10:18:57 openarc openarc[550]:     return await loop.run_in_executor(None, func_call)
Feb 06 10:18:57 openarc openarc[550]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Feb 06 10:18:57 openarc openarc[550]:   File "/root/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/concurrent/futures/thread.py", line 58, in run
Feb 06 10:18:57 openarc openarc[550]:     result = self.fn(*self.args, **self.kwargs)
Feb 06 10:18:57 openarc openarc[550]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Feb 06 10:18:57 openarc openarc[550]: RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:224:
Feb 06 10:18:57 openarc openarc[550]: Check 'intermediates_memories.size() > sequential_gws_subseq_mapping_idx' failed at src/plugins/intel_gpu/src/graph/impls/ocl_v2/sdpa/paged_attention_opt.cpp:1745:
Feb 06 10:18:57 openarc openarc[550]: [GPU] Unexpected number of intermediates buffers for Paged Attention for mixed stage
Feb 06 10:18:57 openarc openarc[550]: 2026-02-06 10:18:57,012 - ERROR - [qwen3-8b LLM Worker] Inference failed, triggering model unload...
Feb 06 10:18:57 openarc openarc[550]: 2026-02-06 10:18:57,226 - INFO - [qwen3-8b] unloaded successfully

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions