Test Mock GPU sends expected packets#37
Open
imyixinw wants to merge 72 commits into
Open
Conversation
This patch implements that changes needed to allow a GPU plug-in to detect it is loadable and return a connection URL back to LLDB. LLDB will attach to the GPU process automatically by having lldb-server listen on port 0, figure out the real port and send it back to LLDB. ProcessGDBRemote will create a new target and attach to it automatically.
This patch formalizes the plug-ins and moves the code into the LLDB library lldbPluginProcessGDBRemote. This allows the sever objects GDBRemoteCommunicationServerLLGS to own a list of GPU plug-ins. Removed the process stopped callback and we now iterate over all of the installed GPU plug-ins in the GDBRemoteCommunicationServerLLGS class itself. This will allow more functionality to be implemented for GPU plugins.
GPU plugins can now intialize a GPUPluginInfo info structure in LLDBServerPlugin::InitializePluginInfo() by overriding this function. This info will be sent to LLDB. LLDB will now set any breakpoints that are requested in this structure and the plugins can now specify a breakpoint by name with a shared library basename and specify any symbols that are desired when the breakpoint gets hit. When the breakpoints get hit, the function: void LLDBServerPlugin::BreakpointWasHit(GPUPluginBreakpointHitArgs &args) will get called with any symbol values that were requested.
The response to jGPUPluginBreakpointHit is now JSON and can: - disable the breakpoint if disable_bp is set to true - set new breakpoints if there is an array of new breakpoints to set - return a connect URL to allow reverse connection The functionality for disabling the breakpoint, setting new breakpoints and connecting to the URL is still TODO
Many GDB remote packets send JSON. This patch add new Put methods to StreamGDBRemote to facilitate sendings objects that are converted to JSON and sent as escaped JSON text.
- Added the ability to encode JSON as hex ascii so JSON can be used in stop reply packets - Added exe_path to GPUPluginConnectionInfo - Added StringExtractorGDBRemote::GetFromJSONText to decode objects from JSON text. - Added StringExtractorGDBRemote::GetFromJSONHexASCII to decode objects from JSON HEX ASCII. - Change GDBRemoteCommunicationClient::GPUBreakpointHit to return an optional GPUPluginBreakpointHitResponse and moved TODO work over into caller function where it makes more sense. - Change stop reply GPU connections to use JSON encoding of a GPUPluginConnectionInfo object. - Remove LLDBServerPlugin::GetConnectionURL() and replace with a function that is designed to always create a connection. - Added LLDBServerPlugin::NativeProcessIsStopping() to let the GPU plug-in know that the native process is stopping in case the plug-in wants to start a connection on stop. The Mock GPU plug-in isn't using this feature anymore. - Handle disabling the breakpoint in ProcessGDBRemote::GPUBreakpointHit and also setting any new breakpoints and connecting to LLDB if the breakpoint hit response struct GPUPluginBreakpointHitResponse asks for it. - Rename "gpu-url" in the stop reply packet to "gpu-connection" and convert it to use hex ASCII encoded JSON version of GPUPluginConnectionInfo - Added ProcessGDBRemote::HandleGPUBreakpoints() to handle any breakpoints requested by the GPU plug-ins since those requests can now be done by the plugin info or a breakpoint hit response - Added ProcessGDBRemote::HandleConnectionRequest() to handle reverse connecting from stop reply or from breakpoint hit response. - Fix the size of RegisterContextMockGPU::g_gpr_regnums and g_vec_regnums
This current patch allows GPU plug-ins to return GPUActions structs at various times: - Each plugin must return a GPUActions to response to the function GPUActions LLDBServerPlugin::GetInitializeActions() which replaces the old and deprecated void LLDBServerPlugin::InitializePluginInfo() - Optionally to a LLDBServerPlugin::NativeProcessIsStopping() call each time the process stops - In a breakpoint was hit response. This allows many ways to set breakpoints at various times. The GPUActions struct also allows a connection to be made at all three of the above times for GPUActions. Changes: - The SymbolValue class now has its "value" as an optional uint64_t. If the symbol value that is sent to the GPU plugin during a breakpoint was hit method call doesn't have a value, then the symbol is not avaiable or can't be resolved to a load address. - Removed the GPUPluginInfo class, plugins now return a GPUActions object. - Added a GPUActions object that has the plug-in name, any breakpoints that need to be set, and an optional connection info struct. This allows plug-ins to return a GPUActions in response to the initializing of the plugin, breakpoint was hit or each time the native process stops. - Added GDBRemoteCommunicationServerLLGS::GetCurrentProcess() to allow plug-ins to access the current NativeProcessProtocol. - Removed LLDBServerPlugin::GetPluginInfo() and LLDBServerPlugin::InitializePluginInfo(). Clients now must implement the GPUActions LLDBServerPlugin::GetInitializeActions() - Removed the virtual CreateConnection() from LLDBServerPlugin. - Updated the Mock GPU to set a new breakpoint on the third stop to test that we can return a GPUActions from a native process stop.
Cleaned up the register context code to auto generate the register offset in the RegisterContext class. Also rename m_regs_valid to m_reg_value_is_valid to be more clear that this represents if the value of the register is valid in the data structures.
Added a "jGPUPluginGetDynamicLoaderLibraryInfo" packet that gets GPU dynamic loader information from a GPU. It does this when the stop reason for a thread is set to eStopReasonDynammicLoader. The NativeProcessProtocol::GetGPUDynamicLoaderLibraryInfos() will get called with the arguments to get the full list of shared libraries or a partial list. The shared libraries can be specified in many ways: - path to a library on disk for self contains object files - path to a file on disk that contains an object file at a file offset and file size - native process memory location where the object file is loaded and should be read from Shared libraries can then specify how to get loaded by: - Specify a load address where all sections will be slid evenly - Specify each section within an ELF file and exactly where they should be loaded, this can include subsections being loaded
Dymamic loader for GPUs is implemented by the NativeProcessProtocol function: std::optional<GPUDynamicLoaderResponse> NativeProcessProtocol::GetGPUDynamicLoaderLibraryInfos(const GPUDynamicLoaderArgs &args); Most of the functionality is here, still need to make the sections load correctly.
Changed the definition of GPUSectionInfo to not have children, it now can have an array of section names that define the hiearchy to use when finding the section by names. If there is only one section name, then we find it regardless of the hiearchy. Hooked up the loading of sections and the loading of the entire file.
This patch makes it so that a breakpoint in the native process can cause shared libraries to be loaded in the GPU process. When GPU plug-ins respond to the LLDBServerPlugin method: GPUPluginBreakpointHitResponse LLDBServerPlugin::BreakpointWasHit(GPUPluginBreakpointHitArgs &args); They can now set the GPUPluginBreakpointHitResponse.load_libraries to true. The GPU plug-in should already hvae notified LLDB that it is stopped prior to sending this to ensure that no progress is made on the GPU while shared libraries are loaded and breakpoints get resolved. Code was added to Target.h that allows a target to know about all of the installed GPU plug-ins. This allows a native process to make calls on the GPU process, and also for the GPU process to get the native target. This will allow code to resume both targets when one gets resumed if that is the preferred methodology for the native process and GPU. It also allows us to stop the native target from the GPU target and vice versa. The LLDBServerPluginMockGPU now tests that breakpoints by address work by setting a brekapoint by address from the "gdb_shlib_load" symbol that is requested by the "gpu_initialize" breakpoint.
The idea behind this change is a GDB server can return a "reason:dyld;" by having its NativeThreadProtocol class return a stop reason of lldb::eStopReasonDynammicLoader. This allows the GPU to halt its process and return this stop reason during a LLDBServerPlugin::BreakpointWasHit() call. When the GPU reports this stop the shared libraries will be requested and a stop StopReasonDyld() will be created in LLDB that will auto resume the GPU process.
Adding the this pointer to each log line for sending and receiving packets allows us to see the log for the native process and GPU plug-in so we can tell which communication is being used.
Only allow the native process linux to claim it has GPU plug-ins. Prior to this change the GPU GDB server connection was claiming to have GPU plug-ins as well.
Anytime a GPUActions is returned, clients can now set GPUActions.resume_gpu_process = true if they want to resume the GPU process from the native process.
This patch enables a ModuleSpec to specify the file offset and size with: void SBModuleSpec::SetObjectOffset(uint64_t offset); void SBModuleSpec::SetObjectSize(uint64_t size); And allows LLDB to load the right file for AMD.
If the GPU plug-ins need to halt the process so that the GPU plug-in will receive a LLDBServerPlugin::NativeProcessIsStopping(...) call, then it can call this function. This allows synchronization with the native process and the GPU plug-in can return GPUActions to perform.
resume. GPUActions now has a new "bool wait_for_gpu_process_to_resume" member variable. If this is set then when the native process gets GPUActions, they will wait for the GPU process to resume.
…the logs This allows differenciating GPU and CPU gdb-remote logs emitted by the server. E.g. ``` 1749485814.359431028 [754988/755010] nvidia-gpu.server < 19> read packet: $QStartNoAckMode#b0 1749485814.359459400 [754988/755010] nvidia-gpu.server < 1> send packet: + 1749485813.658699989 [754988/754988] gdb-server < 29> read packet: $qXfer:auxv:read::0,131071#d7 1749485813.658788681 [754988/754988] gdb-server < 341> send packet: $l! ```
This will make rebases for the nvidia branch easier.
* Add tests for amdgpu lldb-server plugin This commit adds some basic unit tests for the AMDGPU plugin. It modifies the test configuration to expose the HIPCC compiler to the make file when the AMDGPU plugin is enabled. It also modifies the dotest framework so that we can compile .hip files using the Makefile.rules like we do for other tests. The unit tests compile a simple hip program and validate that we do create the gpu target succesfully and that we can hit a breakpoint in the gpu code. * Use dynamic line number for breakpoint * Use LLDB_ENABLE_AMDGPU_PLUGIN configuration variable * Rename getRocmArgs to getHipccArgs
It was written with mm instead of m.
The function Target::SetGPUPluginTarget() was not setting the back link to the CPU target correctly which meant that Target::GetNativeTargetForGPU() was not working. This patch fixes the issue.
GPUActions could previously be added to the CPU process' stop reply packets by having the native process iterate over all installed GPU plug-ins and ask each one if they have GPUActions to add to the stop reply packet. But we can also allow GPU NativeProcessProtocol subclasses to add actions to the GPU stop reply packets. This patch adds the ability for NativeProcessProtocol subclasses to add GPUActions to stop reply packets, which allows GPU processes to add action. These actions will be performed on the CPU process. This will allow us to create a fake stop reason which the GPU reports and it can include actions for the CPU process. AMD currently has its GPU debug driver send events that will notify that they want to set a native breakpoint, but the GPU process doesn't have any way to communicate this with LLDB. Now we can create a fake stop reason and inject GPUActions into the GPU stop reply packet which will be executed on the CPU process, and then the GPU process can auto continue.
This removes a check that ensures that at most one plugin was enabled for a build configuration. With this change, multiple plugins can be enabled in the build, which allows for running all nvidia and amd tests, for example.
I'm experimenting with initializating asynchronously and I think it's a good idea to make sync initialization an option in the GPU Actions. Given that AMD was using this by default, I modified it accordingly.
Currently we wet the gpu loader breakpoint by address in the `amd_dbgapi_insert_breakpoint_callback` callback. This means we need to halt the CPU in order to correctly set the breakpoint. When we halt the CPU it shows up as a public stop to the user and interrupts the debugging flow. This PR changes it so that we set the loader breakpoint by name when we first create the gpu connection. We set the breakpoint on the `rocr::_loader_debug_state` function which was discovered by looking up the address passed into the callback. This is a bit of a hack since the function name could potentially change over time or may be unavailable if the runtime was statically linked. But we use it for now to make the debugger experience better until we can find a more permanant solution.
We rely on this value to determine whether to connect to the gpu process synchronously or not. It was not being serialized in the json which is a problem for the amdgpu plugin since we currently require it to be set to true.
The base class has some useful helper functions that can be used to simplify writing tests. This PR is simply a refactoring of the existing test cases but future PRs will also make use of this functionality.
* [amd] Add tests for reading and writing gpr registers This commit adds two tests for registers * Verify we can read and write all vgprs (v0-v255) and sgprs (s0-101) * Verify that updating a subset of lanes in a vector register works To check the reading and writing of gprs we initialize them to a known value using inline assembly. Then we verify that we read the expected value after hitting the initial gpu breakpoint and that we can write and updated value and read it back. The second test (`test_vector_lane_read_write`) makes sure that we can update the elements of a vector register to different values. It is needed because the first test is splatting the same value to all lanes in a wave. Since we have such a large number of registers I added a script (`gen.py`) to help with generating the assembly to initialize the values and the python code to hold the expected values. This is much easier than writing it all out by hand. We did not test the agpr registers in this commit because we seem to have a bug reading the correct values. Will fix the bug and add tests in a future patch.
Added full ArchSpec support for all GPU variants. This fixes 32 bit r600 support and also creates a new ArchSpec::Core definitions for every AMD GPU. I modified the ObjectFileELF.cpp to decode the exact GPU core by parsing the information in the ELF header.
This generalizes clayborg#22 by creating a base test class for all GPU tests. I changed some getters to properties to match the patterns already used by LLDB tests.
We needed to give names to the GDBRemoteCommunication and GDBRemoteCommunicationServer constructors after Walter's changes.
This adds tests that creates minimal ELF files, each with the right cpu type and verifies that they are created as 32 or 64 bit files and have the right target triple.
This patch adds the ability to read memory from address spaces. Important parts of this patch include: - Add a new AddressSpec class that can be used to contain all information needed to read from an address space. - Address spaces can be enabled by the lldb-server connections by overloading NativeProcessProtocol::GetMemorySpaceInfo() and also specifying the Extension::memory_spaces bit in Manager::GetSupportedExtensions(). - Add a "--space <name>" argument to "memory read" command that plumbs through to the new read memory variant that uses the new AddressSpec class. - Add support for the address space reading into the Mock GPU plug-in.
This patch cleans up the address space commit and adds the needed public API to read from memory spaces: - Add lldb::SBProcess::ReadMemoryFromSpec(...) - Add lldb::SBAddressSpec - Modify any needed python/swig stuff to make this work.
The AMDGPU disassembler needs to know the precise cpu in order to disassemble correctly. This is because the triple for these targets is not enough to identify the target architecture. The disassembler in lldb was already passing down the target cpu correctly so we just need to hook it up for the amd plugin. 1. Query the target's architecture for the `GetClangTargetCPU()` value and use that for disassembly if it has a valid value. 2. Pass the cpu type and cpu sub-type back from the gdbserver for amdgpu and use it to create the target architecture. The cpu type and sub-type are currently only enabled for MachO, but we need them for amdgpu elf files as well. There was an existing target property (`GetDisassemblyCPU`) that is used to control the cpu for the disassembler. We wanted to keep the property code simple to just read the property value. So now when the target's arch spec is updated we will use that to set the target property if there is not an existing value.
This commit changes the `GetGPUDynamicLoaderLibraryInfos` callback so that we only return the new modules when `full` is false. I initially thought it would be a simple change, but after investigating I decided to create a new class to manage the loaded module state. The state management is a bit complicated because the gpu debug library only returns the full set of loaded code objects, but lldb wants to be able to just get the list of modules that have changed. Additionally, we need to be able to track when a module has been unloaded. There is no separate event for that so we have to track it by detecting when a previously loaded module is no longer in the current list of active code objects returned by the debug library. The GpuModuleManager class hides those details and provides a convenient interface that works well to bridge the gap between lldb and the debug library. As part of this change I enabled unit tests for the lldb-server plugins using the same cmake variables that we use to enable the plugins. I also fixed a memory leak where we were previously not freeing the code object uri string.
This patchs adds: - The ability to wait for the GPU process to stop in a GPUActions - The ability to wait for the GPU process to resume successfully - Modified the Mock GPU to test all the new features. - Changed breakpoint identifiers from a string to a uint32_t. This allows clients to do comparisons on integters instead of strings. - Move code that adds GPU actions into GDBRemoteCommunicationServerLLGS::SendStopReplyPacketForThread() - Remove extra stop reply args from many GDBRemoteCommunicationServerLLGS methods. - Add a new SyncState class to ProcessGDBRemote to help with process synchronization with complete logging. - Improve logging in ProcessGDBRemote::HandleGPUActions(). - Add an optional uint32_t stop_id into GPUActions to help with process synchronization an added bool wait_for_gpu_process_to_stop to allow the CPU process to wait for the GPU to stop.
…m#31) Update log to use the int format instead of string.
Deleting some code from the plugin that is not actually being called from anywhere.
walter-erquinigo
approved these changes
Sep 25, 2025
7442326 to
c06a83d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We write a unit test to check that all the correct messages are getting sent in between the Mock GPU server and the native server. It ensures that packets are sent and received with the expected content, validating key plugin behaviors such as breakpoint disabling and library loading flags.
Testing