Open
Conversation
We need to hold a lock when we're setting the done flag, and we also need to hold a lock when we are checking the done flag in the processing thread.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of Changes
Adds support for processing device timing events on a separate thread. This can significantly reduce overhead because the application threads may continue executing while prior device timing events are processed. Chrome trace flushing is also moved to a separate thread, which can also improve performance, though additional work is needed to eliminate locks to take full advantage of multiple threads for chrome tracing.
For applications that do not want to create additional threads, this PR also adds a control to continue processing device timing events in the applicaiton threads. This control can be set via
cliloaderby passing the--no-threadsor-ntcommand line options.Finally, adds support for building with the thread sanitizer enabled.
Testing Done
Tested with an openvino benchmark app. Prior to this change, the benchmark app reported ~1450fps without the OpenCL Intercept Layer and ~975fps with device performance timing enabled (note: with max enqueue set to 2M). After this change, the benchmark app reported ~1440fps with device performance timing enabled, meaning that device performance timing was essentially free.