Releases · SearchSavior/OpenArc

06 Feb 03:25

SearchSavior

2.0.3

0b92514

2.0.3 Latest

Latest

What's Changed

added docker build system
add inference cancel for LLM/VLM. Cancel streaming mid flight by closing client connection, resetting KV cache to before that request. Before, canceling mid flight left inference running in a background thread until organic completion. in practice, this burns GPU cycles. Not anymore! Note: cancel does not work for non-streaming (stream=False). In stream=False case generation will continue until you reach max_tokens, OOM, or inference completes on its own for whatever reason.

Assets 2

13 Dec 02:36

SearchSavior

v2.0.2

964a6af

2.0.2

#52

Assets 2

01 Nov 00:31

SearchSavior

v2.0.1

964a6af

2.0.1

What's Changed

add lazy loading of dependencies for CLI speedup and remove heavy src/init by @SearchSavior in #42
- add readme getting started, add to cli reference by @SearchSavior in #46
2.0.1 by @SearchSavior in #47

Full Changelog: 2.0...v2.0.1

Contributors

SearchSavior

Assets 2

25 Oct 20:38

SearchSavior

2.0

cb86c67

2.0

What's new

2.0 is a full rewrite of OpenArc that took ~3 months of work on evenings and weekends.

The readme documents all changes I could think to write.

Better MoE support should arrive from OpenVINO Q4 which usually lands in december.

PRs

Added embedding service by @mwrothbe in #33
Added reranker api by @mwrothbe in #36
1.0.6 rerank by @SearchSavior in #39

New Contributors

@mwrothbe made their first contribution in #33

Full Changelog: 1.05...2.0

Contributors

mwrothbe and SearchSavior

Assets 2

07 Aug 01:26

SearchSavior

1.05

87728ab

1.0.5

What's Changed

1.0.5 by @SearchSavior in #31

Full Changelog: v1.0.4...1.05

OpenArc 1.0.5

Codebase

conda has been deprecated; OpenArc now uses uv. See README for details.
the CLI tool has been integrated into pyproject.toml. Now it can be invoked with 'openarc'.
OV_ModelPool introduced to API
- Stores loaded model as a class instance
- Setup for a request/queuing/ internal scheduling I have been working on.
added OpenVINO GenAI examples to scripts/examples for
- continuous batching with openvino genai
- another version which provides speedup using batching for single user
- speculative decoding example

Discussion

OpenVINO GenAI integration
- Since 1.0.4 I have been exploring other inference engines with help from the Discord community. They are
  great. Between that and digging into MLOps literature, I think we can improve usability of the OpenVINO GenAI implementation of 'continuous batching', which is Intel's implementation of vllm-like internal scheduling mechanism.
- These classes from OpenVINO GenAI enable us to control how KV cache gets manipulated at and during
  inference time, paving the way for better performance in many more scenarios for small models now, and medium models post B60.
- In my own projects I have been able to get excellent performance on models up to 14B on GPU and past 30B
  on CPU only on several different devices; with ov_genai_continuous_batch.py on a laptop with i7-6700hq Qwen-1.7B-int4_asym clocks nearly 13 t/s. Results on larger machines were even better.

Concurrency and High Throughput with OpenVINO GenAI

OpenVINO GenAI has features which are vllm-like and enable similar performance to vllm in situations where inputs are run in paralell. I have done extensive testing on this for a synthetic data project with hundreds of concurrent requests where inputs vary and context engineering manipulates the internal scheduler. Here are some results from Llama-3.1-Nemotron-8B-int4_asym. Hard-coded input tokens are ~400; input can vary up to ~600; batch runs for a single turn.

Batch size: 600 
Processing time: 82.43 seconds (time per batch
Average TTFT:        2.463 seconds
Average Throughput:  110.2 tokens/s

Not bad!

Multi GPU

Paralellism has been a goal since the begining of OpenArc, way back when I first got my GPUs in July 2024. Back then the landscape was barren, filled with people who wrote their own SYCL drivers on stone tablets, when llama.cpp Vulkan was our only hope, requiring live sacrifice. Anyone who followed the IPEX Ollama surely drowned in the Euphrates beside Achilles. Times were hard.

With help from the OpenArc Discord community we have made progress piecing together this puzzle.

@savvadesogle shared findings with the IPEX backend of vLLM. WIth his help, I got that running and, of course, began studying their paper on paged attention. Since the beginning of OpenArc I have been struggling to implement what the vLLM paper describes into an API. Basically that paper helped me understand features in OpenVINO GenAI which are so poorly documented

@HumerousGorgon did a deep dive into pipeline paralell preview over at OpenVINO Model Server (OVMS). His findings pushed me to investigate deeper into OpenVINO GenAI, which OVMS uses heavily; turns out "PIPELINE_PARALELL" feature is coming in the next release to the Python API enabling many new and exciting scenarios, including ones where NPU works with GPU. So, work has begun on expanding OpenArc in preparation for this capability.

Thanks to my friends at OpenArc Discord for contributing!

Broadly, the plan is to replicate our existing capability exactly and then go from there.

Anyone interested in contributing should join discord so we can talk shop OR open an issue so others can participate.

Contributors

savvadesogle, HumerousGorgon, and SearchSavior

Assets 2

08 Jun 01:25

SearchSavior

v1.0.4

d372e39

v1.0.4

Full Changelog: v1.0.4...v1.0.4

added cli tool
added lots of documentation
added a logo, many changes to readme
updated HD repo to model unsloth structure
DOOM ascii art style was used
rip and tear

Assets 2

17 Apr 02:25

SearchSavior

1.0.3

1e92df9

1.0.3: The Vision Update

OpenArc 1.0.3: The Vision Update

New features

Vision support (!)
- OpenArc takes a dynamic approach to how images are processed
- Recieved messages are checked for base64 and are passed to the appropriate tokenization method, enabling text to text as well as image to text in the same chat/input
- There are no normalization steps for images. We don't shrink to 100dpi, or apply a zoom, or anything like that; bring your own logic for preprocessing.
- stream=false is not supported yet
Load multiple models at once on different devices with the "Model Manager" tab. unload models from here as well.
Added model metadata. Now loaded models store data about how they were loaded; we use this throughout inference, track models in memory across devices. You can now
- Load both vision and text models into memory
- Be careful though, we dont have any safety measure in place. Usually these situations would create a stalled load up or some memory error
- For those with multiple GPUs, you could run multiple
Added model_type field. When loading a model you now specify either TEXT or VISION; this routes requests to the appropriate class and will be extended to other architectures/tasks in the future
Updated model conversion tool to latest which has a ton of experimental datatypes/quant types.
Dashboard has been refactored and is less of a mess
And many more changes to the codebase that communicate project direction.

Issues

Right now gemma3 has specific requirements for inference. We are working out the right set of parameters to load with and it needs better documentation
Inference doesn't usually fail gracefully i.e, needs better handling of threading so the API doesn't become inaccessible and crash when a thread fails for whatever reason
Concurrent requests to multiple loaded models is not yet implemented, and we don't have queuing yet
There are probably other things so report what you encounter in the discord or on github but for the "vision"

Assets 3

0 Join discussion

Releases: SearchSavior/OpenArc

2.0.3

What's Changed

Uh oh!

2.0.2

Uh oh!

2.0.1

What's Changed

Contributors

Uh oh!

2.0

What's new

PRs

New Contributors

Contributors

Uh oh!

1.0.5

What's Changed

Codebase

Discussion

Concurrency and High Throughput with OpenVINO GenAI

Multi GPU

Contributors

Uh oh!

v1.0.4

Uh oh!

1.0.3: The Vision Update

OpenArc 1.0.3: The Vision Update

Issues

Uh oh!