Releases: SearchSavior/OpenArc
2.0.3
What's Changed
- added docker build system
- add inference cancel for LLM/VLM. Cancel streaming mid flight by closing client connection, resetting KV cache to before that request. Before, canceling mid flight left inference running in a background thread until organic completion. in practice, this burns GPU cycles. Not anymore! Note: cancel does not work for non-streaming (stream=False). In stream=False case generation will continue until you reach max_tokens, OOM, or inference completes on its own for whatever reason.
2.0.2
2.0.1
What's Changed
- add lazy loading of dependencies for CLI speedup and remove heavy src/init by @SearchSavior in #42
-
- add readme getting started, add to cli reference by @SearchSavior in #46
- 2.0.1 by @SearchSavior in #47
Full Changelog: 2.0...v2.0.1
2.0
What's new
2.0 is a full rewrite of OpenArc that took ~3 months of work on evenings and weekends.
The readme documents all changes I could think to write.
Better MoE support should arrive from OpenVINO Q4 which usually lands in december.
PRs
- Added embedding service by @mwrothbe in #33
- Added reranker api by @mwrothbe in #36
- 1.0.6 rerank by @SearchSavior in #39
New Contributors
Full Changelog: 1.05...2.0
1.0.5
What's Changed
- 1.0.5 by @SearchSavior in #31
Full Changelog: v1.0.4...1.05
OpenArc 1.0.5
Codebase
-
conda has been deprecated; OpenArc now uses uv. See README for details.
-
the CLI tool has been integrated into pyproject.toml. Now it can be invoked with 'openarc'.
-
OV_ModelPool introduced to API
- Stores loaded model as a class instance
- Setup for a request/queuing/ internal scheduling I have been working on.
-
added OpenVINO GenAI examples to scripts/examples for
- continuous batching with openvino genai
- another version which provides speedup using batching for single user
- speculative decoding example
Discussion
- OpenVINO GenAI integration
-
Since 1.0.4 I have been exploring other inference engines with help from the Discord community. They are
great. Between that and digging into MLOps literature, I think we can improve usability of the OpenVINO GenAI implementation of 'continuous batching', which is Intel's implementation of vllm-like internal scheduling mechanism. -
These classes from OpenVINO GenAI enable us to control how KV cache gets manipulated at and during
inference time, paving the way for better performance in many more scenarios for small models now, and medium models post B60. -
In my own projects I have been able to get excellent performance on models up to 14B on GPU and past 30B
on CPU only on several different devices; with ov_genai_continuous_batch.py on a laptop with i7-6700hq Qwen-1.7B-int4_asym clocks nearly 13 t/s. Results on larger machines were even better.
-
Concurrency and High Throughput with OpenVINO GenAI
OpenVINO GenAI has features which are vllm-like and enable similar performance to vllm in situations where inputs are run in paralell. I have done extensive testing on this for a synthetic data project with hundreds of concurrent requests where inputs vary and context engineering manipulates the internal scheduler. Here are some results from Llama-3.1-Nemotron-8B-int4_asym. Hard-coded input tokens are ~400; input can vary up to ~600; batch runs for a single turn.
Batch size: 600
Processing time: 82.43 seconds (time per batch
Average TTFT: 2.463 seconds
Average Throughput: 110.2 tokens/s
Not bad!
Multi GPU
Paralellism has been a goal since the begining of OpenArc, way back when I first got my GPUs in July 2024. Back then the landscape was barren, filled with people who wrote their own SYCL drivers on stone tablets, when llama.cpp Vulkan was our only hope, requiring live sacrifice. Anyone who followed the IPEX Ollama surely drowned in the Euphrates beside Achilles. Times were hard.
With help from the OpenArc Discord community we have made progress piecing together this puzzle.
@savvadesogle shared findings with the IPEX backend of vLLM. WIth his help, I got that running and, of course, began studying their paper on paged attention. Since the beginning of OpenArc I have been struggling to implement what the vLLM paper describes into an API. Basically that paper helped me understand features in OpenVINO GenAI which are so poorly documented
@HumerousGorgon did a deep dive into pipeline paralell preview over at OpenVINO Model Server (OVMS). His findings pushed me to investigate deeper into OpenVINO GenAI, which OVMS uses heavily; turns out "PIPELINE_PARALELL" feature is coming in the next release to the Python API enabling many new and exciting scenarios, including ones where NPU works with GPU. So, work has begun on expanding OpenArc in preparation for this capability.
Thanks to my friends at OpenArc Discord for contributing!
Broadly, the plan is to replicate our existing capability exactly and then go from there.
Anyone interested in contributing should join discord so we can talk shop OR open an issue so others can participate.
v1.0.4
Full Changelog: v1.0.4...v1.0.4
- added cli tool
- added lots of documentation
- added a logo, many changes to readme
- updated HD repo to model unsloth structure
- DOOM ascii art style was used
- rip and tear
1.0.3: The Vision Update
OpenArc 1.0.3: The Vision Update
New features
- Vision support (!)
- OpenArc takes a dynamic approach to how images are processed
- Recieved messages are checked for base64 and are passed to the appropriate tokenization method, enabling text to text as well as image to text in the same chat/input
- There are no normalization steps for images. We don't shrink to 100dpi, or apply a zoom, or anything like that; bring your own logic for preprocessing.
- stream=false is not supported yet
- Load multiple models at once on different devices with the "Model Manager" tab. unload models from here as well.
- Added model metadata. Now loaded models store data about how they were loaded; we use this throughout inference, track models in memory across devices. You can now
- Load both vision and text models into memory
- Be careful though, we dont have any safety measure in place. Usually these situations would create a stalled load up or some memory error
- For those with multiple GPUs, you could run multiple
- Added model_type field. When loading a model you now specify either TEXT or VISION; this routes requests to the appropriate class and will be extended to other architectures/tasks in the future
- Updated model conversion tool to latest which has a ton of experimental datatypes/quant types.
- Dashboard has been refactored and is less of a mess
- And many more changes to the codebase that communicate project direction.
Issues
- Right now gemma3 has specific requirements for inference. We are working out the right set of parameters to load with and it needs better documentation
- Inference doesn't usually fail gracefully i.e, needs better handling of threading so the API doesn't become inaccessible and crash when a thread fails for whatever reason
- Concurrent requests to multiple loaded models is not yet implemented, and we don't have queuing yet
- There are probably other things so report what you encounter in the discord or on github but for the "vision"