For launch patch, not merge by LifengWang · Pull Request #3 · yanbing-j/pytorch

LifengWang · 2025-07-21T07:24:11Z

update xeon launch script for Intel(R) Xeon 6 support

* [CPUBlas] add brgemm API for fp8 GEMM * Update code

Fix CI failures

…rch#165479) These happen when building with CMAKE_BUILD_TYPE=RelWithAssert This should fix two types of failures that started with pytorch#163665 Disclaimer that I used a lot of AI since I don't how pybind works or what refcounts and pointers are, so idk if this is a good solution, or even a solution at all (fwiw the tests pass now) The first one type is Truncated: ``` default_pg, _ = _new_process_group_helper( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2096, in _new_process_group_helper backend_class = creator_fn(dist_backend_opts, backend_options) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/fake_pg.py", line 25, in _create_fake_pg return FakeProcessGroup._create_internal( RuntimeError: new_refcount != 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/c10/util/intrusive_ptr.h":319, please report a bug to PyTorch. intrusive_ptr: Cannot increase refcount after it reached zero. Exception raised from retain_ at /var/lib/jenkins/workspace/c10/util/intrusive_ptr.h:319 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0 pytorch#7 c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) from ??:0 pytorch#8 void pybind11::class_<c10d::FakeProcessGroup, (anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup> >::init_instance<(anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup>, 0>(pybind11::detail::instance*, void const*) from init.cpp:0 pytorch#9 pybind11::detail::type_caster_generic::cast(void const*, pybind11::return_value_policy, pybind11::handle, pybind11::detail::type_info const*, void* (*)(void const*), void* (*)(void const*), void const*) from :0 pytorch#10 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)pytorch#127}, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> >, int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v>(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)pytorch#127}&&, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> > (*)(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0 ``` and I fix it here by getting rid of `DontIncreaseRefcount` and using make_intrusive to do the ref count handling instead. However, I also had to move the constructor to be public, which I think is not good, based on the reasoning of the original PR The other one type is ``` Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/test_testing.py", line 2415, in test_no_warning_on_import self.assertEqual(out, "") File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4233, in assertEqual raise error_metas.pop()[0].to_error( # type: ignore[index] AssertionError: String comparison failed: "/opt/conda/envs/py_3.10/lib/python3.10/s[352 chars]):\n" != '' - /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/__init__.py:29: FutureWarning: pybind11-bound class 'torch._C._distributed_c10d.FakeProcessGroup' is using an old-style placement-new '__init__' which has been deprecated. See the upgrade guide in pybind11's docs. This message is only visible when compiled in debug mode. - if is_available() and not torch._C._c10d_init(): To execute this test, run the following from the base repo dir: python test/test_testing.py TestImports.test_no_warning_on_import ``` which I fix by getting rid of the `__init__` which I think is ok since it'll just error if you try to make one? Pull Request resolved: pytorch#165479 Approved by: https://github.com/ezyang

If another static object (like `g_device_config_parse_hook_registry_instance` created by the `REGISTER_ALLOCATOR_CONFIG_PARSE_HOOK` macro) tries to call `registerDeviceConfigParserHook` before `device_config_parser_hook_` is initialized, assigning to it (operator=) can fail, which leads to a runtime error. When I use a compilation optimization of ` -O1` I see this issue: ``` [src/libcxx/include/__functional/function.h:496]:14: runtime error: member access within null pointer of type 'const __policy' #0 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:496]:14 #1 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:483]:19 #2 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:727]:8 #3 0x563224e28b78 in c10::CachingAllocator::AcceleratorAllocatorConfig::registerDeviceConfigParserHook(std::__u::function<void (std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>> const&)>&&, std::__u::unordered_set<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>, std::__u::hash<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>, std::__u::equal_to<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>, std::__u::allocator<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>> const&) [torch/c10/core/AllocatorConfig.h:263]:32 #4 0x563224e28e9d in DeviceConfigParserHookRegistry [torch/c10/core/AllocatorConfig.h:369]:5 #5 0x563224e28e9d in __cxx_global_var_init.34 [torch/c10/cuda/CUDAAllocatorConfig.cpp:195]:1 #6 0x563224e28e9d in _GLOBAL__sub_I_CUDAAllocatorConfig.cpp torch/c10/cuda/CUDAAllocatorConfig.cpp pytorch#7 0x5632459709ac in __libc_csu_init /[usr/grte/v5/debug-src/src/csu/elf-init.c:88]:7 pytorch#8 0x7f748b9562e7 in __libc_start_main (/usr/grte/v5/lib64/libc.so.6+0x612e7) (BuildId: ca23ec6d935352118622ce674a8bb52d) pytorch#9 0x5632018f3729 in _start /usr/grte/v5/debug-src/src/csu/../sysdeps/x86_64/start.S:120 ``` Pull Request resolved: pytorch#172581 Approved by: https://github.com/guangyey, https://github.com/albanD

…nces between x86 vs aarch64 (pytorch#176085) In the test: ``` python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda ``` it raises an exception when calling `STD_CUDA_CHECK(cudaSetDevice(99999));` which got the expected `CUDA error: invalid device` message. However, the expected string for the C++ stack trace is different between `x86` vs `aarch64` due perhaps in these issues: - pytorch#119905 - pytorch#134387 In the current setup when getting a stack trace string: - x86 contains `C++ CapturedTraceback:` - aarch64 contains `Exception raised from` + `frame #` An example of the full string from an aarch64 system when : ``` AssertionError: 'C++ CapturedTraceback:' not found in 'CUDA error: invalid device ordinal\nGPU device may be out of range, do you have enough GPUs?\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n\nException raised from test_std_cuda_check_error at /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/csrc/test_std_cuda_check.cu:23 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xe471ebcd39f4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)\nframe #1: <unknown function> + 0x43f998 (0xe471ebdcf998 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1bc (0xe471ebdcfc0c in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #3: torch_c10_cuda_check_msg + 0x1c (0xe471ef335c4c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)\nframe #4: test_std_cuda_check_error() + 0x58 (0xe470cd396678 in /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/install/usr/local/lib/python3.12/dist-packages/libtorch_agn_2_10/_C.so)\nframe #5: c10::BoxedKernel::makeFromFunctor<StableIValueBoxedKernel>(std::unique_ptr<StableIValueBoxedKernel, std::default_delete<StableIValueBoxedKernel> >)::{lambda(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)#1}::_FUN(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 0x16c (0xe47211cd419c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe #6: <unknown function> + 0x61d34bc (0xe47211cf34bc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe pytorch#7: <unknown function> + 0xe6c324 (0xe4721532c324 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#8: <unknown function> + 0xe6c7e0 (0xe4721532c7e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#9: <unknown function> + 0xd3907c (0xe472151f907c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#10: <unknown function> + 0x5ccbf8 (0xe47214a8cbf8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe pytorch#11: /usr/bin/python() [0x504a34]\nframe pytorch#12: PyObject_Call + 0x6c (0x4c633c in /usr/bin/python)\nframe pytorch#13: _PyEval_EvalFrameDefault + 0x3ea0 (0x568564 in /usr/bin/python)\nframe pytorch#14: _PyObject_Call_Prepend + 0xc4 (0x4c5934 in /usr/bin/python)\nframe pytorch#15: /usr/bin/python() [0x52a070]\nframe pytorch#16: _PyObject_MakeTpCall + 0x78 (0x4c3e58 in /usr/bin/python)\nframe pytorch#17: _PyEval_EvalFrameDefault + 0x8a0 (0x564f64 in /usr/bin/python)\nframe pytorch#18: PyEval_EvalCode + 0x130 (0x5632b4 in /usr/bin/python)\nframe pytorch#19: PyRun_StringFlags + 0xe0 (0x59c330 in /usr/bin/python)\nframe pytorch#20: PyRun_SimpleStringFlags + 0x44 (0x67ebc4 in /usr/bin/python)\nframe pytorch#21: Py_RunMain + 0x390 (0x68b380 in /usr/bin/python)\nframe pytorch#22: Py_BytesMain + 0x28 (0x68ae88 in /usr/bin/python)\nframe pytorch#23: <unknown function> + 0x284c4 (0xe47216b084c4 in /lib/aarch64-linux-gnu/libc.so.6)\nframe pytorch#24: __libc_start_main + 0x98 (0xe47216b08598 in /lib/aarch64-linux-gnu/libc.so.6)\nframe pytorch#25: _start + 0x30 (0x5f6770 in /usr/bin/python)\n\n' To execute this test, run the following from the base repo dir: python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda ``` Pull Request resolved: pytorch#176085 Approved by: https://github.com/eqy

… mode + replication padding (pytorch#177166) Fixes pytorch#170079 ## Context `torch.compile(ReplicationPad1d(...), fullgraph=True)` crashes when `torch.use_deterministic_algorithms(True)` is set on CUDA. The error: Dynamo can't trace through `importlib.import_module`. The deterministic code path exists because the native `replication_pad1d_backward` CUDA kernel uses `atomicAdd` (non-deterministic). `functional.py` calls `_replication_pad` — a Python decomposition using `_unsafe_index`, whose backward uses `index_put` (deterministic). ## Dynamo limitations encountered Three separate Dynamo tracing barriers prevented calling `_replication_pad` directly: ### 1. `importlib.import_module` is marked as skipped ```python @torch.compile(fullgraph=True) def fn(x): import importlib return importlib.import_module("torch").sin(x) fn(torch.randn(3)) # Unsupported: function marked as skipped ``` ### 2. `elementwise_dtypes` returns non-Tensor (from `@pw_cast_for_opmath`) ```python @torch.compile(fullgraph=True) def fn(x): from torch._prims_common import elementwise_dtypes, ELEMENTWISE_TYPE_PROMOTION_KIND dt, _ = elementwise_dtypes(x, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT) return x.to(dt) fn(torch.randn(3)) # Unsupported: torch.* op returned non-Tensor ``` ### 3. `torch._check` with closure lambda ```python @torch.compile(fullgraph=True) def fn(x): dim = x.dim() torch._check(dim in (2, 3), lambda: f"expected 2D or 3D, got {dim}D") return x + 1 fn(torch.randn(3, 3)) # Unsupported: Can't extract message from torch._check() ``` ## Iteration log | # | Approach | Who | Tests | Reviewer pushback | Why it failed | |---|----------|-----|-------|-------------------|---------------| | 1 | Replace `importlib` with `from...import` | Claude | bilinear/trilinear pass, replicate fails | "why do we need bilinear/trilinear tests?" — scoped fix to reported bug only | Hit limitation #2: `@pw_cast_for_opmath` | | 2 | Skip decomposition under compile via `is_compiling()`, rely on AOTAutograd's `@register_decomposition` | Claude | forward-only `backend="eager"` passes | "can you verify at inductor level this is actually deterministic?" — inspect AOT graph | No backward decomposition registered; backward still uses native `replication_pad1d_backward` (non-deterministic) | | 3 | Unwrap `@pw_cast_for_opmath` via `__wrapped__` | Claude | N/A — fails immediately | N/A | Hit limitation #3: `torch._check()` closure | | 4 | `@nonstrict_trace` — Dynamo skips body, AOTAutograd traces through | Reviewer suggestion | `backend="aot_eager"`, forward + backward under `DeterministicGuard(True)` | N/A — fix is correct | N/A | ## Key insight The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd does. `@nonstrict_trace` is exactly this boundary. Each "obvious" fix had passing tests that weren't testing the right thing. Only when the reviewer pushed for backward determinism verification and AOT graph inspection did the weaknesses surface. The backward completing without error under `DeterministicGuard(True)` proves determinism — PyTorch explicitly raises `RuntimeError` if any non-deterministic CUDA kernel executes under this mode. Authored with Claude. Pull Request resolved: pytorch#177166 Approved by: https://github.com/mlazos, https://github.com/williamwen42

Xia-Weiwen and others added 3 commits July 21, 2025 07:19

Add fp8 brgemm (#1)

506f4d0

* [CPUBlas] add brgemm API for fp8 GEMM * Update code

Upgrade oneDNN to v3.9

3055110

Fix CI failures

update xeon launch script for Intel(R) Xeon 6 support

f588fad

yanbing-j force-pushed the yanbing/tf32_dev_branch_for_test branch 2 times, most recently from 70589bb to 6661f7b Compare July 30, 2025 05:50

yanbing-j force-pushed the yanbing/tf32_dev_branch_for_test branch from a55d505 to d68002e Compare August 20, 2025 05:48

yanbing-j force-pushed the yanbing/tf32_dev_branch_for_test branch from d68002e to b31822d Compare September 8, 2025 02:57

Xia-Weiwen force-pushed the yanbing/tf32_dev_branch_for_test branch from b31822d to 93c07c9 Compare September 8, 2025 06:30

yanbing-j force-pushed the yanbing/tf32_dev_branch_for_test branch from 93c07c9 to d51eb60 Compare October 15, 2025 01:37

yanbing-j force-pushed the yanbing/tf32_dev_branch_for_test branch from d51eb60 to b66e7cf Compare November 24, 2025 02:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For launch patch, not merge#3

For launch patch, not merge#3
LifengWang wants to merge 3 commits into
yanbing-j:yanbing/tf32_dev_branch_for_testfrom
LifengWang:185_pr_new_bkc

LifengWang commented Jul 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

LifengWang commented Jul 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants