Introduce `_FortranAAssignSimple` and other `_FortranAAssign` related changes for compilation time improvement. by bhandarkar-pranav · Pull Request #3068 · ROCm/llvm-project

bhandarkar-pranav · 2026-06-25T20:16:27Z

No description provided.

…arguments Fixes a compile-time regression for firstprivate on simple arrays. Fixes llvm#200922 Root cause: When InlineHLFIRAssign checks whether to inline hlfir.assign operations in omp.private copy regions, alias analysis conservatively returns MayAlias for the copy region block arguments (%arg0 mold vs %arg1 private). This causes InlineHLFIRAssign to fail and fall back to fir.call @_FortranAAssign, which pulls in 89 runtime functions via LTO, creating 550K+ abstract attributes in OpenMPOpt. Solution: Add special-case handling in FIR alias analysis to recognize omp.private copy region block arguments. The OpenMP spec guarantees that the mold and private arguments reference different memory (private copy is freshly allocated in the init region), so they cannot alias. The fix allows InlineHLFIRAssign to inline the assignment into an element-wise loop, avoiding the runtime call entirely. Test: Added alias-analysis-omp-private-copy-region.mlir Assisted by: Claude Sonnet

…compile time Limit ShallowCopy rank specializations from 1-15 to 1-4, using a generic fallback for higher ranks. This significantly reduces code size and LTO compile time when the Fortran runtime is linked into GPU offloading code. Before: 15 ranks × 7 types × 3 scenarios = 315 template instantiations After: 4 ranks × 7 types × 3 scenarios = 84 template instantiations + fallbacks Impact on LTO compile time (measured with OpenMP offloading test case): - LTO time: 17s → 13s (24% improvement) - Abstract Attributes created: 396K → 276K (30% reduction) - Attributor worklist entries: 1.36M → 929K (32% reduction) Trade-off: Arrays with rank > 4 use a generic runtime loop instead of compile-time specialized iteration. This is acceptable because: - Most real-world Fortran arrays are rank 1-3 - Rank 5-15 arrays are rare in practice - The generic loop is still efficient (just not unrollable)

…y points This commit adds infrastructure for splitting assignment runtime calls into simple and complex paths to reduce LTO compile time for GPU offload. Problem: When compiling firstprivate(allocatable_array) for GPU, _FortranAAssign causes LTO to pull in runtime functions and templates. OpenMPOpt analyzes these functions during optimization, causing significant compile-time overhead (7.9s out of 19.7s total, or 40.7% of compile time). Solution (Phase 2.0 - Infrastructure): Add two new runtime entry points: - _FortranAAssignSimple: Fast path with direct memmove for intrinsic types (includes runtime assertions for validation - to be removed after testing) - _FortranAAssignComplex: Full machinery (same as current _FortranAAssign) This commit adds the entry points and builder functions but does NOT change any call sites. No functional changes - validation suite confirms 6/8 tests pass (same as baseline). Files modified: - flang-rt/lib/runtime/assign.cpp: Runtime implementations with assertions - flang/include/flang/Runtime/assign.h: Public API declarations - flang/lib/Optimizer/Builder/Runtime/Assign.cpp: Builder implementations - flang/include/flang/Optimizer/Builder/Runtime/Assign.h: Builder declarations Note: AssignSimple includes RUNTIME_CHECK assertions to catch misuse during development. These will be removed after thorough validation in later phases. Next steps (Phase 2.1): - Modify ConvertToFIR.cpp to use Simple path for allocatable scalars - Validate correctness before expanding to arrays Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…types This is a change to improve the compilation time when _FortranAAssign is called. This change routes allocatable scalar assignments with trivial (intrinsic) types to the new AssignSimple runtime path. **Note:** Most scalar allocatables are already inlined by the optimizer so this change primarily affects edge cases that don't meet inline criteria (polymorphic, explicit-length character, temporary LHS, or CUDA attributes). Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…ble arrays with trivial types Extend the AssignSimple runtime path to handle both allocatable and non-allocatable arrays with intrinsic types to achieve an improvement in compilation time. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…AAssign Critical fix for GPU device firstprivate correctness. Problem: - AssignSimple was allocating when `!to.IsAllocated() || to.Elements() != from.Elements()` - For OpenMP device firstprivate, the runtime pre-allocates device memory and sets up the descriptor with IsAllocated() = true - AssignSimple's aggressive reallocation check caused conflicts with OpenMP runtime's device allocation, leading to GPU memory access faults Solution: - Match _FortranAAssign allocation logic: only allocate when !to.IsAllocated() - Do NOT reallocate if already allocated - Respect OpenMP runtime's device memory management Root cause: - OpenMP runtime allocates device memory for firstprivate variables - Sets descriptor's IsAllocated() = true pointing to device memory - Runtime assignment should respect existing allocations - Only allocate when descriptor shows unallocated state Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…ple path Critical correctness fix: scalar-to-array assignments require broadcasting, not simple memmove. Problem: - AssignSimple uses memmove for array copying - Scalar-to-array (e.g., i32 → array<?xi32>) requires element-wise broadcast - We were routing all trivial-type assignments to Simple path - Missing rank check allowed scalar-to-array to incorrectly use memmove Solution: - Add lhs.getRank() == rhs.getRank() check before using AssignSimple - Scalar-to-array (rank mismatch) now routes to Assign/AssignComplex - Both allocatable and non-allocatable paths updated Impact: - Prevents generating incorrect code for scalar-to-array assignments - Lit tests expecting scalar-to-array to use _FortranAAssign will pass - No performance impact - only affects mixed-rank assignments Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…nComplex Critical fix for OpenMP workdistribute lowering to support new runtime assignment variants introduced on this branch. Problem: - This branch introduced _FortranAAssignSimple and _FortranAAssignComplex - OpenMP workdistribute lowering only knew about _FortranAAssign - All 6 workdistribute-* lit tests failed with: "Runtime call _FortranAAssignSimple lowering not supported for workdistribute yet." Solution: - Add helper function isFortranAssignCall() to check for all Assign variants - Update 5 locations in LowerWorkdistribute.cpp to use the helper - Treat AssignSimple/AssignComplex the same as Assign for OpenMP purposes Locations Updated: 1. shouldParallelize() - determines if op should be parallelized 2. verifyTargetTeamsWorkdistribute() - validates supported runtime calls 3. workdistributeRuntimeCallLower() - lowers runtime calls in workdistribute 4. moveToHost() - two locations handling runtime call hoisting Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…le path Critical correctness fix: volatile assignments require special memory ordering semantics that cannot be handled by simple memmove. Problem: - AssignSimple uses memmove for array copying - Volatile variables require memory barriers and proper ordering - We were routing all trivial-type assignments to Simple path - Missing volatile check allowed volatile to incorrectly use memmove Solution: - Add !fir::isa_volatile_type(lhs.getType()) check before using AssignSimple - Volatile assignments now route to Assign (which handles memory ordering) - Both allocatable and non-allocatable paths updated Impact: - Prevents generating incorrect code for volatile assignments - All 5 volatile lit tests now pass (volatile1-4.fir) - No performance impact - only affects volatile assignments Test Results: - PASS: HLFIR/volatile1.fir - PASS: HLFIR/volatile2.fir - PASS: HLFIR/volatile3.fir - PASS: HLFIR/volatile4.fir - PASS: HLFIR/volatile.fir Total: 5/5 volatile tests passing (100%) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Update test expectations to match changed behavior where simple array-to-array assignments with intrinsic types use _FortranAAssignSimple instead of _FortranAAssign. Tests Updated: 1. HLFIR/fir-local-alloca-block.fir - Assignment: !fir.box<!fir.array<1xi32>> (allocatable array) - Element type: i32 (intrinsic) - Ranks match: 1 = 1 - Non-volatile - Correctly uses AssignSimple 2. HLFIR/assign-codegen.fir - Assignment: array<?xi32> to fixed array<100xi32> - Element type: i32 (intrinsic) - Ranks match: 1 = 1 - Non-volatile - Correctly uses AssignSimple 3. HLFIR/assign-codegen.fir - Assignment: allocatable heap array<?xi32> - Element type: i32 (intrinsic) - Ranks match: 1 = 1 - Non-volatile - Correctly uses AssignSimple Test Results: - PASS: HLFIR/fir-local-alloca-block.fir - PASS: HLFIR/assign-codegen.fir Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…ove LTO compile time" This reverts commit 4a0a63e.

Remove references to _FortranAAssignComplex which was never implemented. The implementation uses only two runtime functions: - _FortranAAssignSimple for simple intrinsic-type assignments - _FortranAAssign for all other cases (complex path) There is no need for a separate _FortranAAssignComplex function since _FortranAAssign already handles all complex cases (derived types, polymorphic, volatile, scalar-to-array, etc.). Changes: - Remove FortranAssignComplexStr constant from OpenMP code - Update isFortranAssignCall() to only check for Assign and AssignSimple - Update comment to reflect only two variants No functional change - the code already uses _FortranAAssign for complex cases via genAssign(). Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…signments Fix for 92 gfortran test suite failures where non-contiguous assignments (strided slices, transformational intrinsics, etc.) were incorrectly routed to _FortranAAssignSimple instead of _FortranAAssign. Changes: - flang/lib/Optimizer/HLFIR/Transforms/ConvertToFIR.cpp: Added lhs.isSimplyContiguous() && rhs.isSimplyContiguous() check at line ~196 for non-allocatable array assignments. - Note: Allocatable assignments (line ~154) do NOT have this check because whole allocatable assignments are always contiguous. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…nSimple Fix allocatable assignment when LHS and RHS have different shapes. Per Fortran 2018 10.2.1.3(3), allocatable LHS must be deallocated and reallocated if it has a different shape than RHS. Changes: - flang-rt/lib/runtime/assign.cpp: Added shape comparison loop for allocatable arrays Deallocate LHS if shapes differ Reallocate with RHS shape (existing logic lines ~887-903) Implementation: - Check each dimension extent: to.Extent() vs from.Extent() - If any dimension differs, set needsReallocation flag - Deallocate existing allocation if needed - Allocate with shape from RHS descriptor Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Fixes remaining 2 gfortran test failures (PR95331.f90, class_transformational_2.f90) where allocatable assignments have non-contiguous RHS. Changes: 1. flang/lib/Optimizer/HLFIR/Transforms/ConvertToFIR.cpp: - Removed contiguity check from line ~196 (non-allocatable arrays) - Now routes all intrinsic array assignments to AssignSimple - Contiguity handling moved to runtime 2. flang-rt/lib/runtime/assign.cpp: - Removed RUNTIME_CHECK(from.IsContiguous()) at line 863 - Added runtime contiguity check and dual-path logic: * Contiguous: fast memmove (existing path) * Non-contiguous: element-wise copy loop (new path) - Element-wise loop uses GetLowerBounds/IncrementSubscripts/Element (same proven logic as _FortranAAssign) Implementation (~25 lines): - Runtime if/else based on to.IsContiguous() && from.IsContiguous() - Element-wise loop for non-contiguous cases handles: * Strided slices (arr(1:10:2)) * Transformational intrinsic results (eoshift, cshift, etc.) * Derived type component references * Any non-contiguous descriptor layout Performance impact: - Test 02 (allocatable, contiguous): 1.73s -> 1.98s (+14% acceptable) - Still 90% better than baseline (19.7s) and full Assign (19.62s) - Non-contiguous cases now work correctly (gfortran tests pass) Validation: - Host correctness: 9/9 passing (all tests including shape mismatch) - Gfortran tests: PR95331.f90 and class_transformational_2.f90 now pass - No RUNTIME_CHECK failures Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…o AssignSimple AssignSimple's element-wise copy loop could produce incorrect results when the LHS and RHS reference overlapping memory with non-contiguous access patterns (e.g., a(9:5:-1) = a(1:5:1)). The loop iterates in forward subscript order, which can overwrite source elements before they are read when strides cause the destination and source ranges to overlap. Fix this by calling MayAlias() before the copy to detect overlapping memory ranges. When aliasing is detected, RHS data is copied into a contiguous temporary buffer before any modifications to the LHS. This matches the aliasing handling in _FortranAAssign (AssignTicket::Begin), but without pulling in WorkQueue/ShallowCopy template machinery that would increase LTO compile time. Also add a conformability check for non-allocatable array assignments to crash on shape mismatch, matching _FortranAAssign behavior. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

…sign lowering Adds flang/test/HLFIR/assign-simple-routing.fir with 17 test functions covering every routing boundary in ConvertToFIR.cpp's AssignOp conversion: - Positive cases (8): allocatable i32/f64/complex<f32>/logical<4> rank-1, allocatable i32 rank-2, non-allocatable i32/f32/logical<4> — all route to _FortranAAssignSimple. - Negative cases (6): derived type (allocatable and non-alloc), volatile (allocatable and non-alloc), scalar-to-array rank mismatch, polymorphic — all route to _FortranAAssign. - Other variants (3): temporary_lhs → _FortranAAssignTemporary, polymorphic allocatable → _FortranAAssignPolymorphic, keep_lhs_len → _FortranAAssignExplicitLengthCharacter. Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>

…nction Add 5 unit tests to validate AssignSimple's runtime behavior: 1. AliasedReverseStride - Tests aliasing detection and temporary buffer creation for overlapping memory (a(5:1:-1) = a(1:5)). Verifies MayAlias() prevents data corruption during reverse-stride copy. 2. ReallocateUnallocated - Tests allocatable reallocation from unallocated state. Verifies destination becomes allocated with correct bounds and data. 3. ReallocateShapeMismatch - Tests allocatable reallocation when extents differ (3 → 5 elements). Verifies reallocation triggers on extent mismatch. 4. NonContiguousToContiguous - Tests strided (non-contiguous) source to contiguous destination. Creates 8-element array with stride 2 to produce [1,3,5,7], verifies element-wise copy loop. 5. ZeroSizeArray - Tests zero-size array edge case, verifies no crash. The tests use StaticDescriptor with Establish pattern (learned from Descriptor.cpp tests) to create descriptors with controlled memory layout, including overlapping regions for aliasing tests. All tests pass and validate the key code paths in AssignSimple: - Step 1: Aliasing detection via MayAlias() - Step 2: Allocatable reallocation logic - Step 3: Data copy (contiguous and non-contiguous) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Update the doc comment for genAssignSimple to accurately reflect the runtime implementation's capabilities: - Changed "Assumes:" to "Preconditions enforced at call site:" to clarify that these are compile-time guarantees, not runtime assumptions - Expanded the list to include all preconditions: intrinsic type, matching ranks, same element byte size, and non-volatile - Added "Runtime handles:" section to document what the runtime function actually handles: contiguous and non-contiguous layouts, aliasing detection, and allocatable reallocation - Fixed typo: "continguous" → "contiguous" The old comment was misleading because it implied the function only handles contiguous arrays, when in fact the runtime implementation explicitly handles non-contiguous arrays via element-wise loops (assign.cpp:1038-1044). Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Add test case @alloc_scalar_to_array to verify that allocatable assignments with rank mismatch (scalar RHS to array LHS) correctly route to _FortranAAssign instead of _FortranAAssignSimple. The test fills a gap in coverage: while @nonalloc_scalar_to_array tested the non-allocatable scalar-to-array case, there was no corresponding test for the allocatable path. This is important because the allocatable path has its own rank-match guard that must be verified. The rank mismatch (scalar i32 RHS vs array<?xi32> LHS) violates the lhs.getRank() == rhs.getRank() precondition, so despite having a trivial element type, this assignment requires the full _FortranAAssign path to handle scalar broadcasting semantics. Without this test, a regression in the allocatable routing logic could incorrectly route scalar-to-array assignments to AssignSimple, which would fail at runtime with a rank mismatch error. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Add INTERNAL_CHECK assertion to verify that the source and destination descriptors have matching ranks at the entry point of ShallowCopyRank. While the template specialization code already checks both ranks during rank matching, adding an explicit precondition check at the function entry improves code clarity and catches rank mismatches earlier with a clearer error message. This addresses the concern that the function signature accepts two descriptors without explicitly documenting or enforcing that they must have the same rank. The assertion makes this precondition explicit and prevents potential silent data corruption from rank mismatches. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…nd not _FortranAAssign

bhandarkar-pranav and others added 27 commits June 25, 2026 12:08

fix clang-format-issues

d350463

fix clang-format issues

2da2898

Fix clang-format issues

81b7615

Revert "[flang-rt] Reduce ShallowCopy template instantiations to impr…

290983f

…ove LTO compile time" This reverts commit 4a0a63e.

Make changes based on code review.

4801870

Fix clang-format issues

f44a5a4

Fix CHECK statements in lit tests to look for _FortranAAssignSimple a…

573b021

…nd not _FortranAAssign

bhandarkar-pranav added flang testing only labels Jun 25, 2026

bhandarkar-pranav changed the title ~~Introduce _FortranAAssignSimple~~ Introduce _FortranAAssignSimple and other _FortranAAssign related changes for compilation time improvement. Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce `_FortranAAssignSimple` and other `_FortranAAssign` related changes for compilation time improvement.#3068

Introduce `_FortranAAssignSimple` and other `_FortranAAssign` related changes for compilation time improvement.#3068
bhandarkar-pranav wants to merge 27 commits into
amd-stagingfrom
flang/aomp-issue-200922_ph2_2

bhandarkar-pranav commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bhandarkar-pranav commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant