Fix Neuron engine for SDK 2.28+ and add Trainium (trn2) support#659
Fix Neuron engine for SDK 2.28+ and add Trainium (trn2) support#659jimburtoft wants to merge 5 commits intomichaelfeil:mainfrom
Conversation
- Handle removal of optimum.bettertransformer in optimum >= 2.0 by wrapping the import in try/except (acceleration.py) - Include token_type_ids in Neuron tokenizer output to match compiled model expectations (neuron.py) - Update Dockerfile.neuron to remove pinned SDK versions - Rewrite README.md with tested instructions for AWS DLAMI setup Tested on inf2.xlarge with Neuron SDK 2.28, optimum-neuron 0.4.3 and 0.4.5. Performance: ~210 embeddings/sec throughput, ~25ms latency (bge-small-en-v1.5).
- Fix NameError when BetterTransformerManager is not available in check_if_bettertransformer_possible() (affects both torch and neuron engines on optimum >= 2.0) - Update README with benchmark results across g5, inf2, trn2 instances
Compile models for a single NeuronCore (num_cores=1) instead of sharding across all cores. This gives 5% better single-core performance and enables linear scaling via multiple processes. With 2 processes on inf2.xlarge (2 cores): - Before (tensor parallel): 206 emb/s - After (data parallel): 425 emb/s (+106%) Update README with multi-process deployment instructions and revised benchmark results.
- Add trn2.3xlarge (4 NeuronCores) benchmark results: 19ms latency, 753 emb/s - Add trn2 data parallelism example (4 processes on 4 cores) - Restructure throughput table to show scaling across instance types - Note: trn2 has ~30% lower per-core latency than inf2
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Neuron engine's functionality and performance within Infinity, ensuring compatibility with the latest Neuron SDKs and Optimum versions. It addresses critical bugs that previously prevented the engine from operating correctly and introduces a more efficient data parallelism strategy for scaling inference across AWS Inferentia and Trainium hardware. The changes also expand hardware support to include Trainium (trn2), providing comprehensive documentation and benchmarks for optimal deployment. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces important fixes for running on AWS Neuron with SDK 2.28+ and adds support for Trainium instances. The changes correctly handle the removal of bettertransformer in recent optimum versions, fix an input mismatch for Neuron models, and improve throughput by switching to data parallelism. The documentation and Dockerfile updates are also valuable additions. My review includes a few suggestions: one to improve the Dockerfile build process, a minor correction in the new documentation, and a more robust fix for checking an optional dependency to prevent a potential runtime error.
| if not CHECK_OPTIMUM.is_available: | ||
| return False |
There was a problem hiding this comment.
The added check if not CHECK_OPTIMUM.is_available: might not be sufficient to prevent a NameError. The is_available property on OptionalImports is a cached_property. It will return True if optimum was available at first check, even if the subsequent import of bettertransformer fails and mark_dirty is called. This means check_if_bettertransformer_possible will proceed and fail when accessing BetterTransformerManager.
A more robust way to check if BetterTransformerManager was imported successfully is to check for its presence in the module's scope.
| if not CHECK_OPTIMUM.is_available: | |
| return False | |
| if "BetterTransformerManager" not in globals(): | |
| return False |
infra/aws_neuron/Dockerfile.neuron
Outdated
| RUN pip3 install --no-deps sentence_transformers | ||
| RUN pip3 install --upgrade neuronx-cc torch-neuronx torchvision libneuronxla protobuf optimum-neuron optimum |
There was a problem hiding this comment.
To reduce the number of layers in the Docker image and improve build efficiency, it's a good practice to combine consecutive RUN commands. You can merge the two pip3 install commands into a single RUN instruction. Also, consider using the --no-cache-dir option with pip to prevent caching and reduce the final image size.
RUN pip3 install --no-cache-dir --no-deps sentence_transformers && \
pip3 install --no-cache-dir --upgrade neuronx-cc torch-neuronx torchvision libneuronxla protobuf optimum-neuron optimum
infra/aws_neuron/README.md
Outdated
| | Instance | Cores | Peak emb/s | Concurrency | | ||
| |----------|-------|-----------|-------------| | ||
| | g5.xlarge (GPU) | 1 GPU | 536 | 8 concurrent | | ||
| | inf2.xlarge | 1 core | 216 | 4 concurrent | | ||
| | inf2.xlarge | 2 cores | 427 | 4 concurrent | | ||
| | trn2.3xlarge | 1 core | 348 | 4 concurrent | | ||
| | trn2.3xlarge | 4 cores | 753 | 4 concurrent | |
There was a problem hiding this comment.
The Concurrency column in the throughput table contains redundant text (concurrent). For better clarity and consistency, it would be better to have just the number, similar to other columns. This makes the table easier to parse for humans and potentially for automated tools.
| | Instance | Cores | Peak emb/s | Concurrency | | |
| |----------|-------|-----------|-------------| | |
| | g5.xlarge (GPU) | 1 GPU | 536 | 8 concurrent | | |
| | inf2.xlarge | 1 core | 216 | 4 concurrent | | |
| | inf2.xlarge | 2 cores | 427 | 4 concurrent | | |
| | trn2.3xlarge | 1 core | 348 | 4 concurrent | | |
| | trn2.3xlarge | 4 cores | 753 | 4 concurrent | | |
| | Instance | Cores | Peak emb/s | Concurrency | | |
| |----------|-------|-----------|-------------| | |
| | g5.xlarge (GPU) | 1 GPU | 536 | 8 | | |
| | inf2.xlarge | 1 core | 216 | 4 | | |
| | inf2.xlarge | 2 cores | 427 | 4 | | |
| | trn2.3xlarge | 1 core | 348 | 4 | | |
| | trn2.3xlarge | 4 cores | 753 | 4 | |
Greptile SummaryThis PR fixes three bugs that prevented Key changes:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant Server as infinity_emb (one process per core)
participant Embedder as NeuronOptimumEmbedder
participant Tok as AutoTokenizer
participant Model as NeuronModelForFeatureExtraction
Client->>Server: POST /embeddings
Server->>Embedder: encode_pre(sentences)
Embedder->>Tok: tokenize — returns input_ids, attention_mask, token_type_ids
Tok-->>Embedder: input_dict
Embedder-->>Server: input_dict
Server->>Embedder: encode_core(input_dict)
Embedder->>Embedder: pad_up_to_size(batch_size, each tensor)
Embedder->>Model: forward(**input_dict)
Model-->>Embedder: last_hidden_state
Embedder-->>Server: token_embeddings + attention_mask
Server->>Embedder: encode_post(embedding)
Embedder->>Embedder: pooling + normalize
Embedder-->>Server: float32 embeddings
Server-->>Client: response
Note over Client,Server: One server process per NeuronCore (data parallelism)
Last reviewed commit: d9fe9f0 |
| # Compile for a single NeuronCore. For data-parallel scaling across | ||
| # multiple cores, run separate server processes pinned to individual | ||
| # cores via NEURON_RT_VISIBLE_CORES (see README). | ||
| compiler_args = {"num_cores": 1, "auto_cast_type": "fp16"} |
There was a problem hiding this comment.
Hard-coded num_cores=1 silently breaks models requiring tensor parallelism
Setting num_cores=1 works well for small/medium embedding models (as documented in the README), but it will silently produce a worse result — or fail to compile entirely — for large models that don't fit within a single NeuronCore's on-chip memory. The previous get_nc_count() behavior at least used all available cores for a single model instance.
Consider making this configurable (e.g. via an EngineArgs field or an environment variable) so that users with large models can opt back in to tensor parallelism without changing source code:
compiler_args = {
"num_cores": int(os.environ.get("NEURON_NUM_CORES", "1")),
"auto_cast_type": "fp16",
}At a minimum, it would be worth adding a log warning when num_cores=1 is used on an instance with more than one NeuronCore, so operators are aware they are under-utilising the chip with a single process.
infra/aws_neuron/Dockerfile.neuron
Outdated
| RUN pip3 install --no-deps sentence_transformers | ||
| RUN pip3 install --upgrade neuronx-cc torch-neuronx torchvision libneuronxla protobuf optimum-neuron optimum |
There was a problem hiding this comment.
Unpinned dependency versions may cause non-reproducible builds
Removing the version pins for neuronx-cc, torch-neuronx, libneuronxla, optimum-neuron, and optimum means every docker build will pull the latest available versions. While this avoids the old hard-coded versions going stale, it can cause silent breakage whenever a new SDK release ships an incompatible change.
Consider at minimum pinning major versions (e.g. optimum-neuron>=0.4,<1.0) or adding a comment documenting the tested version matrix from the README so that the file serves as a reproducible baseline.
Additional Comments (2)
The new guard on lines 49–50 only checks whether the top-level So the guard passes, execution reaches line 58, and a The guard needs to also test the dirty flag: The same issue affects the
After changing (Remove the entire |
…_NUM_CORES env var - acceleration.py: Use globals() check instead of CHECK_OPTIMUM.is_available to correctly detect when bettertransformer import failed (mark_dirty does not invalidate the cached is_available property) - neuron.py: Add NEURON_NUM_CORES env var (default 1) so large models can opt in to tensor parallelism without source changes - neuron.py: Remove dead get_nc_count() function and unused imports - Dockerfile: Merge RUN layers, add --no-cache-dir, add tested version comment - README: Clean up redundant text in throughput table
michaelfeil
left a comment
There was a problem hiding this comment.
Did you test this out? Do you need help publishing a image or release version?
|
Yes, tested on inf2.xlarge (SDK 2.27), trn2.3xlarge (SDK 2.28), and g5.xlarge (GPU). Fresh clone → install → run → embedding requests all working. Data parallelism gives 427 emb/s on inf2 (vs 206 old) and 750 emb/s on trn2 with 4 processes. |
Related Issue
closes #408
Summary
Fixes 3 bugs that prevent
--engine neuronfrom working with Neuron SDK 2.28 / optimum neuron >= 4.0, switches from tensor parallelism to data parallelism for a 2-4x throughput improvement, and adds Trainium (trn2) support with benchmarks.Bug Fixes
acceleration.py:optimum.bettertransformerwas removed in optimum >= 2.0. The unconditional import crashes at module load time, breaking both--engine torchand--engine neuron. Fixed with a try/except wrapper that marks the optional import as dirty.acceleration.py:check_if_bettertransformer_possible()callsBetterTransformerManagerwithout checking if the import succeeded. Added aCHECK_OPTIMUM.is_availableguard.neuron.py: Tokenizer was called withreturn_token_type_ids=False, but the compiled Neuron model expects 3 inputs (includingtoken_type_ids). Removing that flag fixes the input mismatch.Performance Improvement
Changed
num_coresfromget_nc_count()(tensor parallelism — shards model across all cores) to1(data parallelism — one model per core, multiple server processes). For small/medium embedding models, tensor parallelism wastes cores.To scale, users run one
infinity_embprocess per NeuronCore, each pinned viaNEURON_RT_VISIBLE_CORES. Throughput scales linearly.Benchmarks (bge-small-en-v1.5, batch_size=4)
Latency (serial requests, P50)
Throughput (data parallelism)
Files Changed
libs/infinity_emb/infinity_emb/transformer/acceleration.py— bettertransformer import fix + CHECK_OPTIMUM guardlibs/infinity_emb/infinity_emb/transformer/embedder/neuron.py— removereturn_token_type_ids=False, changenum_coresto 1infra/aws_neuron/Dockerfile.neuron— remove pinned SDK versionsinfra/aws_neuron/README.md— rewritten with setup guide, data parallelism instructions, trn2 examples, and benchmarksTested on inf2.xlarge (SDK 2.27), trn2.3xlarge (SDK 2.28), and g5.xlarge (GPU baseline).
Checklist
Additional Notes
tested independently, but not sure if you have the build path for testing on Inferentia.