Skip to content

Merge upstream/llvm into amd-debug#2863

Merged
mariusz-sikora-at-amd merged 184 commits into
amd-debugfrom
amd/dev/masikora/amd-debug-merge-candidate
Jun 11, 2026
Merged

Merge upstream/llvm into amd-debug#2863
mariusz-sikora-at-amd merged 184 commits into
amd-debugfrom
amd/dev/masikora/amd-debug-merge-candidate

Conversation

@mariusz-sikora-at-amd

Copy link
Copy Markdown

Merge with all Scott's upstream changes

Please run git show --remerge-diff on top of this PR

arsenm and others added 30 commits May 26, 2026 22:26
…cases. (llvm#199158)

This is a follow-up on Jean's comment
llvm#198933 (comment)

This patch makes use of the descriptor strides when `fir.array_coor`'s
memref is a `fir.box` that is not a fir.embox result.
This patch enables pulling slicing `fir.rebox` operations
into `fir.array_coor`. This helps preserve information about
the original rank of the array being accessed.
`FIRToMemRef` and later passes may benefit from this.

Assisted by: Claude
…199137)

Problem: `hasNearbyPairedStore` uses
`stripAndAccumulateInBoundsConstantOffsets` to decompose store pointers
into (base, offset) pairs and check whether two stores are 16 bytes
apart. This fails when LSR has rewritten pointer arithmetic into
non-inbounds GEPs because the function refuses to look through them. The
two stores then appear to have different base pointers and the check
returns false. When this happens, `lowerInterleavedStore` proceeds to
emit `ST2` for a pattern that would be more profitable as `zip+stp`,
since the load-store optimizer can pair adjacent stores into `STP` but
cannot merge `ST2` with anything. On a bf16-to-fp32 NEON conversion loop
this causes a regression from 11 to 17 instructions per iteration.
Note: Interleaved stores support was added for RISCV in
llvm#115354. Turning this off
produces the desired STP instructions.

https://godbolt.org/z/1afsjPd3e

Fix: Switch to `stripAndAccumulateConstantOffsets` with
`AllowNonInbounds=true`. The function is a bail-out heuristic doing pure
address arithmetic, so the inbounds semantic guarantee is not needed for
correctness.

---------

Co-authored-by: Kunal Pathak <kupathak@fb.com>
The [WebAssembly Component
Model](https://component-model.bytecodealliance.org/) has added support
for [cooperative
multithreading](WebAssembly/component-model#557).
This has been implemented in the [Wasmtime
engine](bytecodealliance/wasmtime#11751) and is
part of the wider project of [WASI preview
3](https://wasi.dev/roadmap#upcoming-wasi-03-releases), which is
currently tracked
[here](https://github.com/orgs/bytecodealliance/projects/16).

These changes require updating the way that `__stack_pointer` and
`__tls_base` work purely for a new `wasm32-wasip3` target; other targets
will not be touched. Specifically, rather than using a Wasm global for
tracking the stack pointer and TLS base, the new
[`context.get/set`](https://github.com/WebAssembly/component-model/blob/main/design/mvp/CanonicalABI.md#-canon-contextget)
component model builtin functions will be used (the intention being that
runtimes will need to aggressively optimize these calls into single
load/stores). For justification on this choice rather than switching out
the global at context-switch boundaries, see [this
comment](WebAssembly/wasi-libc#691 (comment))
and [this
comment](WebAssembly/wasi-libc#691 (comment)).

This PR adds support for using library calls instead of globals for
holding the stack pointer and TLS base. When used, this thread context
ABI emits calls to `__wasm_{get,set}_{stack_pointer,tls_base}` when
needed. These functions can then be implemented in `libc`. This is
enabled only for the WASIp3 target.

There is a temporary macro define for `__wasm_libcall_thread_context__`
which can be removed once `wasi-libc` has fully migrated to the new ABI
for the WASIp3 target.
In Python 3.0 and later it is no longer necessary to explicitly derive
from `object` to opt into "new-style" classes, they are the default.

Since the current minimum Python version is 3.8, this is no longer
required. This patch removes `object` from the base class lists of all
affected classes in lit.
llvm#199786)

This patch removes future statements from lit for features that are
mandatory in Python 3.0 and later.

Specifically, it removes future statements for
[`absolute_import`](https://docs.python.org/3/library/__future__.html#future__.absolute_import)
and
[`print_function`](https://docs.python.org/3/library/__future__.html#future__.print_function),
since both became mandatory in Python 3.0.
…2457)

This patch adds basic support for partial alias masking, which allows
entering the vector loop even when there is aliasing within a single
vector iteration. It does this by clamping the VF to the safe distance
between pointers. This allows the runtime VF to be anywhere from 2 to
the "static" VF.

Conceptually, this transform looks like:

```
  // `c` and `b` may alias.
  for (int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
  }
```

->

```
  svbool_t alias_mask = loop.dependence.war.mask(b, c);
  int num_active = num_active_lanes(mask);
  if (num_active >= 2) {
    for (int i = 0; i < n; i += num_active) {
      // ... vector loop masked with `alias_mask`
    }
  }
  // ... scalar tail
```

This initial patch has a number of limitations:

- The loop must be tail-folded
  * We intend to follow-up with full alias-masking support for loops
    without tail-folding
- The mask and transform is only valid for IC = 1
  * Some recipes may not handle the "ClampedVF" correctly at IC > 1
  * Note: On AArch64, we also only have native alias mask instructions 
    for IC = 1
- Reverse iteration is not supported
  * The mask reversal logic is not correct for the alias mask (or 
    clamped ALM)
- First order recurrences are not supported
  * The `splice.right` is not lowered correctly for clamped VFs
- Reductions are not supported 
  * The final horizontal reduction needs to set lanes past the 
    "ClampedVF" to the identity value
- This style of vectorization is not enabled by default/costed
  * It can be enabled with `-force-partial-aliasing-vectorization`
  * When enabled, alias masking is used instead of the standard diff
    checks (when legal to do so)

This PR supersedes llvm#100579 (closes llvm#100579).
…e unique linkage names. (llvm#198667)

Use normalized path from the macro prefix map to generate the unique ids
for the internal linkage names. That allows a reproducible hash on any
build system. Regularly the macro prefix map gets normalized in favor of
the target system before the path substitution.
This allows us to keep GUIDs consistent across compilation phases which
may change the name or linkage type.

See https://discourse.llvm.org/t/rfc-keep-globalvalue-guids-stable/84801

This is a large change since the addition of metadata breaks many tests.
The test changes are mostly just trivial changes to checks to get them
passing.
Fix the column `0` for the `<total>` row in llvm-mca's `Average Wait times` report. The `total`
row now represents the total dynamic execution count used to normalize the averages, 
instead of the per-instruction iteration count. Update the timeline view docs and autogenerated
test expectations accordingly.

Co-authored-by: liuxiaodong <liuxiaodong@sunmmio.com>
)

The comment in getOutputSectionName has always called the second-dot
stripping "for MinGW" (e.g. .ctors.NNNN), but the code applied it on
every target. This hiddes a split-dwarf bug llvm#199616.

Take an isMinGW gate and skip the stripping when it is false.
Move out `setHasProfileAvailable` into `markFunctionsWithProfile`.
This also allows extracting per-pre-aggregated type handling in
`parseAggregatedLBREntry` into a switch statement.

Test Plan:
NFC

Processing time change (wall time):
* 10MB pre-aggregated profile:
  - Parsing aggregated branch events: 0.16s -> 0.05s
  - Pre-process profile data (parsing+marking): 0.18s -> 0.16s

* 6GB perf.data file:
  - Parsing branch events: 29.06s -> 28.55s
  - Pre-process profile data (excluding perf script): 29.47s -> 29.13s

Reviewers:
rafaelauler, yota9, maksfb, ayermolo, yozhu, yavtuk, paschalis-mpeis

Pull Request: llvm#199320
…99797)

Sink the lets Defs = [VXSAT] into the classs.

This makes the encoding based structure of this file more consistent.
…=0. NFC (llvm#199798)

We had a let outside the class and inside.
Since SegInstSEW is only used by segment load/store, no need to keep it
for other builtins.
I broke this test in llvm#199739. As a result to that change, the start of
the CODE section in the linked WASM file shifted from 0x41 to 0x37 (a
shift of -10 bytes).

I was not aware that `wasm-ld` had testing outside of `lld/test/wasm`.
GCC released a new version, so we should bump the versions installed in
the CI so we can upgrade.
…vm#196462)

As suggested by @jmorse and @efriedma-quic in llvm#196223.

---------

Co-authored-by: Corentin Jabot <corentinjabot@gmail.com>
Fixes llvm#177852.

The reproducer has two `.cfi_startproc` directives separated by a
`.popsection`. The first is never closed; the second is properly paired
with `.cfi_endproc`. `MCStreamer::finish()` only inspects the last entry
of `DwarfFrameInfos`, so the unfinished earlier frame slips through and
crashes `finishImpl()` when it emits frame data with a null End label.

Use `hasUnfinishedDwarfFrameInfo()` instead, which walks the full
`FrameInfoStack` and catches every unfinished frame.

---------

Co-authored-by: Fangrui Song <i@maskray.me>
… sources (llvm#199604)

PR llvm#179924 and llvm#179925 added optimized assembly implementations for ARM
double-precision and single-precision FP comparisons (arm/cmpdf2.S,
arm/gedf2.S, arm/unorddf2.S, arm/cmpsf2.S, arm/gesf2.S, arm/unordsf2.S)
but only added SUPERSEDES annotations for the thumb1 variants. The arm
variants were missing these annotations, causing both the generic and
optimized implementations to be included in libclang_rt.builtins.a.

For double-precision, the archive contains:
  - comparedf2.c.obj (pos 28): defines __unorddf2, __aeabi_dcmpun, ...
  - divdc3.c.obj (pos 32): defines __divdc3; refs __aeabi_dcmpun
  - unorddf2.S.obj (pos 126): defines __unorddf2, __aeabi_dcmpun
  - aeabi_dcmp.S.obj (pos 158): defines __aeabi_dcmpeq; refs __eqdf2

When linking divdc3_test.c, the linker loads divdc3.c.obj which
introduces __aeabi_dcmpun as undefined. BFD-like linkers (GNU ld, ELD)
continue scanning forward and resolve __aeabi_dcmpun from unorddf2.S.obj
(pos 126). Later, aeabi_dcmp.S.obj introduces __eqdf2 as undefined,
which is resolved by comparedf2.c.obj (pos 28) on the next pass. Since
both comparedf2.c.obj and unorddf2.S.obj define __unorddf2, the linker
reports a duplicate symbol error.

lld does not encounter this because of the difference in the way it
resolves symbol references. This causes comparedf2.c.obj (pos 28) to be
selected first for __aeabi_dcmpun, making unorddf2.S.obj unnecessary.

The same pattern exists for single-precision where arm/comparesf2.S and
arm/unordsf2.S both define __unordsf2 and __aeabi_fcmpun.

The fix adds SUPERSEDES annotations so that the generic implementations
(comparedf2.c for double-precision and arm/comparesf2.S for single-
precision) are removed from the source list when the optimized assembly
replacements are present. The assembly files together provide all
symbols that the generic implementations define.

The surrounding code was reviewed, and this PR was developed with the
assistance of AI.
…#196906)

Move the bit name list of BBAddrMap::Features and BBAddrMap::BBEntry::Metadata
into a new BBAddrMap.def and derive the enum, bitfield, encode(), decode(),
and operator== from it. Adding a new bit now only requires one line in the
.def file.

Also expose BBAddrMap::Features::KnownMask for future use.
…vm#198192)

Simplify LitConfig initialization and setter to allow None values.
TestingConfig.maxIndividualTestTime is initialized to 0 (or resolved to
0 if None) strictly during initialization.

This fixes an issue where the aggressive BOLT timeout of 60s (previously
set globally on lit_config) was leaking and affecting libc++ tests. By
moving the timeout configuration from the global lit_config to the
individual test suite config, we ensure that timeouts are isolated and
respect suite-local settings without leaking.

PR Stack:
* ➤ llvm#198192
* llvm#198193

Assisted-by: Gemini
hvdijk and others added 14 commits May 28, 2026 03:13
Debug labels did not exist in LLVM 3.7 and have no equivalent.
…ardian argument (llvm#198695)

A function parameter of type RefPtr<T>& should not be used as a guardian
variable of a raw pointer/reference variable if the function body
contains an assignment to it since such an assignment can shorten the
lifetime of the guarded object.
…lvm#200091)

Patches reverted:

commit c315c66
Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
Date:   Wed May 27 12:51:13 2026

    [AMDGPU] Fix codesize estimate after llvm#198005 (llvm#200033)

    This fixes failure in libc tests which checks the exact encoding
    size. Encoding is now shorter, but it did not recognize fp16
    immediates as an inlinable constant and assumes literal encoding.

    Shorter encodings were created here:
    llvm#198005

commit 2b3bc03
Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
Date:   Wed May 27 10:55:36 2026

    [AMDGPU] Use shorter form for i16 operands (llvm#198005)

    For 16-bit operands an inline constant is zero extended
    which in particular allows to use FP constants. These
    will have 16 bits of zeroes in the high half and FP16
    value in the low 16 bits.

    The patch changes semantics of the FP literal argument
    used in i16 context in the asm parser to fp16.

Apparently this breaks some libc tests with bf16. I do not know
why, these were not supposed to be affected. Reverting for now.

Failed tests: https://lab.llvm.org/buildbot/#/builders/10/builds/29005
…lvm#199917)

With a constant lane index, split the vector and recurse on the
single-GPR half containing Idx (already Custom-lowered).
…lvm#200000)

The `RISCVMoveMerger` pass was incorrectly forming
`CM_MVSA01/QC_CM_MVSA01` when `Zdinx` was enabled. The pass attempted CM
merge for copy pairs even when the first copy was not an `a0/a1-based`
CM candidate.

Fix by only running `findMatchingInst` when the current copy is a valid
CM candidate.
…lvm#199963)

ExpandAMDGPUPredicateBuiltIn synthesized an IntegerLiteral typed
_Bool/bool — a shape no other producer creates, and one that
StmtPrinter::VisitIntegerLiteral has no case for. -ast-print on the
resulting if-condition hit llvm_unreachable.

Emit the canonical boolean literal instead:

- C++, C23, OpenCL, HIP: CXXBoolLiteralExpr 'bool'
- pre-C23 C: IntegerLiteral 'int'

In the C case this matches what <stdbool.h>'s true/false macros expand
to.

Fixes llvm#199563
The commit added a dep from profile -> interception, so define that
target too

Fixes 5db1364
PR llvm#177665 added an unconditional `extern` reference to
`__llvm_profile_hip_collect_device_data` from `InstrProfilingFile.c`,
which forces `InstrProfilingPlatformROCm.o` (and its sanitizer_common /
interception dependencies) out of `libclang_rt.profile.a` in every PGO
binary. That breaks bots without `-lpthread` and races dlsym/PLT state
in non-HIP programs via the interceptor constructor.

Fix:
- Declare the hook `COMPILER_RT_WEAK` and gate the call on its address.
No `COMPILER_RT_VISIBILITY`: a hidden weak-undef function would be
non-preemptible and the address test would fold to true.
- Gate `installHipModuleInterceptors` on `dlsym(hipModuleLoad)` so the
constructor is a no-op if `ROCm.o` is still pulled in.

Fixes:
- https://lab.llvm.org/buildbot/#/builders/66/builds/31311
- https://lab.llvm.org/buildbot/#/builders/174/builds/36180

Verified:
- `check-profile` 134/134 pass.
- `nm` on a non-HIP `clang -fprofile-generate` binary: zero
`installHip`/`ROCm`/`sanitizer`/`hip_collect` symbols.
- HIP offload PGO end-to-end on gfx1101 (compile → run → `llvm-profdata
merge` → `llvm-cov`) still works; interceptor installs, device profile
collected via shared API.
SplitDebugName checked -o and /o but not /Fo, so clang-cl /Fo<path> /c
fell through to the cwd-relative fallback and every .dwo landed in cwd
under <source-stem>.dwo regardless of the .obj location.
This adds those test cases while llvm#111561 gathers dust.
`loop-fusion` treats any loop-invariant scalar non-anti dependence as
safe to fuse. In the linked issue, it incorrectly allows scalar flow
dependences where the first loop writes a loop-invariant location and
the second loop later reads that same location. Fusion interleaves the
producer and consumer and this changes the value observed by the second
loop.

Example C source would look like:
```C
for (int i = 0; i < N; i++) {
    ptr[0] = i;
}
for (int j = 0; j < N; j++) {
    out[j] = ptr[0];
}
=>
for (int i = 0; i < N; i++) {
    ptr[0] = i;
    out[i] = ptr[0];
}
```

This patch makes the DA scalar-dependence shortcut **_more
conservative_** by rejecting scalar non-anti and allowing input/output
dependences. This preserves the existing safe read and write cases while
preventing the miscompile above.

The patch also updates the `loop-fusion` debug message to reflect the
narrower accepted case, updates the existing regression to check the new
debug message, and adds a new regression from the linked issue.

Fixes llvm#191238
…vm#198872)

Add `-fcoverage-mapping`, `-fno-coverage-mapping`,
`-fcoverage-compilation-dir=`, `-ffile-compilation-dir=`, and
`-fcoverage-prefix-map=` to the LinkerWrapper `CompilerOptions`
forwarding list. Without this, passing `-fprofile-instr-generate
-fcoverage-mapping` to clang for a HIP program silently omits the
coverage mapping flags from the embedded device recompilation, so
`__llvm_covmap`/`__llvm_covfun` sections are never emitted for device
code.
@rocm-cciapp

rocm-cciapp Bot commented Jun 11, 2026

Copy link
Copy Markdown

@dstutt dstutt left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primarily just using the downstream version, right? Were the changes already cherry-picked?

Either way, the resolution looks ok to me.

@mariusz-sikora-at-amd

Copy link
Copy Markdown
Author

Primarily just using the downstream version, right?

Either way, the resolution looks ok to me.

I used mostly upstream version.

Were the changes already cherry-picked?

You are referring to #2636?

@dstutt

dstutt commented Jun 11, 2026

Copy link
Copy Markdown

Primarily just using the downstream version, right?
Either way, the resolution looks ok to me.

I used mostly upstream version.

Were the changes already cherry-picked?

You are referring to #2636?

Not specifically, it just looked that way.
Even if that isn't the case, the changes LGTM.

@mariusz-sikora-at-amd mariusz-sikora-at-amd merged commit b719329 into amd-debug Jun 11, 2026
29 checks passed
@mariusz-sikora-at-amd mariusz-sikora-at-amd deleted the amd/dev/masikora/amd-debug-merge-candidate branch June 11, 2026 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.