-
Notifications
You must be signed in to change notification settings - Fork 4
Triage guide
The overall goal of doing triage is to make sure we don’t miss something important, completely or until “late”, and also notice any trends we may have with crashes or intermittent failures, or in any particular areas of the code. The idea is to categorize the bugs as they come in so that we know which ones need a jump on, which ones can wait a bit, maybe ask for some information that is missing, maybe CC the right people, etc.
We will cover these components: Canvas: 2D, Canvas: WebGL, GFX: Color Management, Graphics, Graphics: Layers, Graphics: Text, Image Blocking, ImageLib, WebRender.
Some guidelines:
- A good guideline should be ~15 minutes per bug, but some may need more time
- This isn’t about finding a cause, and it isn’t about the full prioritization.
- This is about noticing things sooner.
- This is about asking the bug author for info that may be missing or would help with the triage.
- This is about asking for a regression range, or even getting one if you can reproduce the problem and you have time.
- This is about finding people who might know more/be able to help if something is important
See the triage schedule to get an overview of who is responsible for triage this week, and how we did with past bugs.
Note: if you ever need to change your scheduled week, you can reach out to team members and find someone to swap with. The system checks the "Graphics Triage" Google calendar for "[Incoming] *FullName*", so all that is required is changing the calendar event's name.
As of May 2020, Firefox has changed the approach around triaging bugs to be focused on severity rather than priority. The idea is this will help give a better indication of the health of a release.
| Severity | Description |
|---|---|
| S1 |
Catastrophic - Blocks development/testing, may impact more than 25% of users, causes data loss, potential chemspill, and no workaround available. Useful ways to consider framing this:
|
| S2 |
Serious - Major Functionality/product severely impaired and a satisfactory workaround doesn't exist. Useful ways to consider framing this:
|
| S3 |
Normal - Blocks non-critical functionality and a work around exists. Useful ways to consider framing this:
|
| S4 | Small/Trivial - minor significance, cosmetic issues, low or no impact to users |
| N/A | Not applicable - Only valid for bugs of type Task or Enhancement. |
Tracking status indicates in which releases the defect is present.
The options are:
| Status | Description |
|---|---|
| — | we don’t know whether Firefox N is affected |
| ? | we don’t know whether Firefox N is affected, but we want to find out. |
| affected | present in this release |
| unaffected | not present in this release |
| fixed | a contributor has landed a fix to address the issue |
| verified | the fix has been verified by QA or other contributors |
| disabled | the fix or the feature has been backed out or disabled |
| verified disabled | QA or other contributors confirmed the fix or the feature has been backed out or disabled |
| wontfix | we have decided not to accept/uplift a fix for this release cycle (it is not the same as the bug resolution WONTFIX). This can also mean that we don’t know how to fix that and will ship with this bug |
| fix-optional | we would take a fix for the current release but don’t consider it as important/blocking for the release |
It will not always be clear who should be the first person to investigate an issue after triage. These alphabetically ordered tables represent a best effort for suggestions for who might be able to help out. It may not be immediately obvious what sorts of issues fall into these categories at first, but you will develop a sense of this over time.
If you don't know, just ask Jessie or a random person on the team to help you choose, or worst case, just choose Jeff M (sorry Jeff). Members of the team are encouraged to add themselves and/or update entries as they see fit, but particularly when this list was insufficient and you sought help in the selection. This list is expected to evolve as our projects and priorities evolve.
In case you'd like extra eyes on a bug, you can make it block on the gfx-triage metabug, which gets reviewed once per week on Thursdays.
| Area | Suggestions | Examples |
|---|---|---|
| ANGLE | Kelsey Gilbert | |
| APZ (Scrolling) | Botond Ballo | |
| Android | Jamie Nicol | |
| Canvas | Lee Salzman, Andrew Osmond (Offscreen) | |
| Color Management | Jeff Muizelaar, Andrew Osmond | 1284225 1358664 |
| Compositing | Sotaro Ikeda, Markus Stange | |
| Display Lists | Timothy Nikkel | 1533815 |
| Images | Timothy Nikkel, Andrew Osmond | 1550523 |
| SVG | Nicolas Silva | |
| Linux | Martin Stransky, Robert Mader | |
| MacOS | Brad Werth, Jeff Muizelaar, Markus Stange | |
| Protocols (IPDL) | Sotaro Ikeda, Andrew Osmond | 1533296 1400637 |
| Skia | Lee Salzman | |
| Text | Jonathan Kew, Lee Salzman | |
| Videos | Sotaro Ikeda | 1558100 |
| WebGL | Kelsey Gilbert | |
| WebGPU | Jim Blandy | |
| Windows | Sotaro Ikeda |
| Area | Suggestions | Examples |
|---|---|---|
| General | Glenn Watson, Ashley Hale | |
| Blob images | Jeff Muizelaar, Nicolas Silva | 1565231 |
| Picture Caching | Glenn Watson | 1567472 |
| Snapping | Glenn Watson, Andrew Osmond | 1540200 1533292 |
| Text | Lee Salzman | 1544895 |
| Area | Former Suggestions | Examples |
|---|---|---|
| Printing |
These are some shared queries that may help you find related bugs, duplicates, aid in determining priority or the best person to investigate further:
- Closed as duplicates
- Meta bugs
- Unassigned S1 bugs
- Unassigned S2 bugs
- Unassigned S3 bugs
- Severity changed
- Regression window wanted
- Explicitly requested triage review
- Release blocking bugs: 78 79 80 81 82 83 84 85
- Target blocking bugs: Linux, Linux MVP, Mac Mac MVP
- Blocks old/unsupported release metabug
- New in last 15 days
- New in last 8 days
- Not blocked against another bug, new in the last 30 days
- Open S1 and S2 bugs
- Bugs with severity not set, older than 7 days
Most people receive a significant amount of bugmail, and may well subscribe to most changes within the component. It is easy to lose track of merely being cc'ed on a bug. Thus for S1 or S2 bugs, one should always complete triage with a needinfo. If you believe an issue should be a S3(normal) or lower, but have some doubt on whether it should actually be higher severity, or you are otherwise certain about who needs to investigate, then a needinfo is also appropriate.
If you think that something is and S1 or and S2, also needinfo jbonisteel so she can help push the bug along.
Add relevant keywords to the bug. There is a list of keywords and definitions on Bugzilla.
Some particularly relevant/common keywords:
- "crash" if it's a crash;
- "hang" if it's a hang;
- "perf" if it's a performance related issue;
- "reproducible" if it's reproducible
- "feature" if it's new code, doing something that wasn't done before; note that a "feature" can block a "crash", we want a wide definition;
- "regression"
- "regessionwindow-wanted" if it is a regression and where it happens needs to be tracked down
If you are uncertain about how to move forward with a bug, it can sometimes help to wait a few days. Others (in and outside the team) may discover it and provide additional context allowing it to proceed. While we need to be timely, there is no need to be unnecessarily aggressive in clearing out your queue as soon as a bug enters it.
The caveat is that if any of our components are on the Pending Untriaged sections of the leaderboard, we would like to get off as soon as possible. There are regular reviews said lists, and it invites scrutiny if we find ourselves on it.
Submissions do not always include a clear STR. Even if the STR is clear, it isn't always obvious what exactly caused them to file the issue, nor may we even be able to reproduce. In such cases, a screen recording may be appropriate to request from the submitter.
This is a good first bit of information to needinfo the submitter for if they did not provide it at filing, preferably as a text file attachment. The about:support page contains diagnostic information about the user's hardware and configuration. The most commonly used information will be:
- The version and build ID
- Extensions
- Selected compositor
- GPU and driver identifiers, versions and supported features
- Color profile
- Decision log
- Modified preferences
Here is a quick template you can use to request the reporter's about:support:
1. Enter about:support into the address bar. 2. Click the Copy text to clipboard button. 3. Paste the clipboard contents into a text editor like Notepad, and save the file. 4. Click the Attach New File button above the description here to upload the file.
Often a bug may be filed and it is actually a duplicate of an existing bug. You should search Bugzilla for similar entries. Begin filing a new bug yourself against the component(s) you believe duplicates may be found; there will be no need to submit it. When you enter keywords into the Summary field, it will present a list of potential duplicates. This allows you to quickly iterate on sets of keywords, and is a good way to discover other possible keywords we use when filing related issues.
If the problem is with a particular element on the page, it may be useful to flag or confirm what kind of element it is, and what styles are applied to it. This is pretty easy to confirm using the inspector. Just right click the element, and select "Insert Element." Some properties to take note of:
- Transforms
- Animations
- Size
- Margins
- Padding
The submitter sometimes mentions whether or not it happens to them in another browser. This is a good point of comparison. If you can reproduce in Firefox, does it reproduce with Chrome or Safari? If you can, this does not mean we don't have a bug -- consistency may be a sign that either the standard or web platform tests fell short, or the submitter is mistaken.
- WebRender vs Basic vs Direct2D drawing vs OpenGL/Direct3D compositing
- WebRender picture caching vs not
- WebRender document splitting vs not
- If accelerated and you have multiple GPUs, NVIDIA vs Intel vs AMD/ATI
- Linux vs Windows vs OSX
A large number of bug reports are actually regressions. Identifying the regressing commit is important for finding the root cause and the person best suited to investigate further.
Mozregression is a tool to bisect our build history to find what build introduced a regression (or a fix). If you are able to reproduce the problem, then quickly check to see if it can be reproduced in an older build using Mozregression -- if not, then finding the offending change will greatly increase your odds of sending the bug to the right person.
It is also fair to ask the submitter of the bug to perform this as well. Link to the quickstart when you needinfo them, and set the regressionwindow-wanted keyword.
Keep in mind that preferences and default settings can change over time. For example, it is not safe to assume that WebRender will remain on by default for your system, and you may to force gfx.webrender.all to true.
| Process | Description |
|---|---|
| Content | Each tab is associated with a content process (it may be shared). This is where web content is parsed appropriately and we build the display lists. Depending the compositor in use, we may perform almost no, or most painting inside this process. |
| GPU | If in use, this is known as the compositor process. Painting the display list and/or compositing the layers produced in the content process is performed here. If this process crashes, it should recover without more than a flicker from the user's perspective. If it crashes too many times, we will fallback to the main process. Users are not prompted to submit crash reports for the GPU process, so crashes may be under reported. Typically only users which automatically submit reports will send them. |
| Main / Browser | If the GPU process is disabled, this is also known as the compositor process and will perform similar functions. This is the most privileged process, and thus performs many tasks on behalf of the content and GPU processes. In particular it manages the chrome / UI integration and (re)spawns the GPU and content processes. |
There can be multiple unique signatures for the same crash. This may be because we crash in different places for the same reason, or because platforms produce different signatures. Signatures may also morph from build to build as the code related to the crash evolves. Given a crash report, one can examine the stack trace and search for methods higher in the call stack using the proto signature filter.
For example, in bug 1565231, we have the crash signature
webrender_bindings::moz2d_renderer::{{impl}}::update observed only on Windows with the following simplified stack trace:
GeckoCrash
static void gkrust_shared::panic_hook(struct core::panic::PanicInfo*)
static void core::ops::function::Fn::call<fn(..), (..)>(..)
static void std::panicking::rust_panic_with_hook()
static void std::panicking::continue_panic_fmt()
void std::panicking::begin_panic_fmt()
static void webrender_bindings::moz2d_renderer::{{impl}}::update(..)
static void webrender::resource_cache::ResourceCache::pre_scene_building_update(..)
static bool webrender::render_backend::RenderBackend::process_api_msg(..)
static void webrender::render_backend::RenderBackend::run(struct webrender::profiler::BackendProfileCounters)
static void std::sys_common::backtrace::__rust_begin_short_backtrace<closure, ()>(struct closure)
static void core::ops::function::FnOnce::call_once<closure, ()>(struct closure*)
static void alloc::boxed::{{impl}}::call_once<(), FnOnce<()>>()
In this case,
webrender::resource_cache::ResourceCache::pre_scene_building_updatewould be a good candidate to search for matches to see if there is a signature morph. The search results lead us to discover that
<webrender_bindings::moz2d_renderer::Moz2dBlobImageHandler as webrender_api::image::BlobImageHandler>::updateis a related signature, but this time for Linux.
Assuming you are unable to reproduce the crash, you still have some options on finding a first guess.
When the crash signature field in Bugzilla is populated, it will show a table summarizing the crash frequency for the listed builds and release channels. If this is a recently introduced crash, it may be clear from the table history which build the regression began with.
However just because a crash was first observed in a particular build does not mean that it was the build that introduced the issue. Depending on the reproduction frequency and how actively used the previous builds were, it could easily have been introduced several days prior. Additionally, we may be missing relevant signatures that either platform variations or previous morphs which would reveal an earlier starting point.
You can go from a build ID to a pushlog by visiting https://hg.mozilla.org/mozilla-central/firefoxreleases. Click on the revision of the build ID and from there click on push id. This will provide a summary of all new commits in the given build.
Crashes usually have an acceptable backtrace. Use searchfox's blame integration to find out when relevant lines were added or modified. If the modifications coincide with the crash appearance, then you may be able to pinpoint a particular commit and author.
When you have a set of crash reports, it can be useful to see if this correlates to a particular configuration. This will help us categorize the bug and potentially aid in reproducing it.
| Field | Description |
|---|---|
| Platform Pretty Version | If the signature only happens for a particular OS/version, then it will be evident here. Don't forget about potential variants. |
| Platform version | Break out minor versions of Windows. |
| Build ID | Determine when it may have been introduced. Keep in mind your time period restriction in the search. See facet on. |
| Install time | Proxy to determine how widespread the crash is. Sometimes it is only a handful of installs that produce most of the reports. |
| Proto signature | If there are multiple crashes with the same signature, this may aid in determining the frequency and better search criteria for the crash you care about. |
| Field | Description |
|---|---|
| Adapter vendor ID | Vendor ID of the graphics card in use, e.g. 0x1002 = AMD/ATI, 0x8086 = Intel |
| Adapter device ID | Device ID of the graphics card in use; note that more popular devices may suggest false correlations |
| Adapter driver version | Driver version of the graphics card in use; note that more popular devices may suggest false correlations |
| CPU info | CPU's family, model and stepping |
| CPU microcode version | Microcode version |
| Field | Description |
|---|---|
| graphics critical error | sometimes extra logging we added is in here to debug this or related issues; additionally it can give you a sense of if the GPU process crashed before this (the present crash may be the result of a previous recovery or explain why the GPU process or acceleration got disabled) |
| App notes | this contains a bit of random information (explain various codes and modifiers +-?) |
| Moz crash reason | useful for assertions, sometimes they contain different bits of information, e.g. Rust assertion include the parameters |
| DeviceResetReason | the last reset reason code, if applicable; note that not all device resets are logged to the critical log |
| Field | Description |
|---|---|
| Address |
|
These fields are restricted to those who requested special permissions. Andrew, Jeff M or anyone on the release management team (Marcia Knous, Liz Henry, Julien Cristau, RyanVM) should be able to help you out if you think it would be useful.
| Field | Description |
|---|---|
| User comments | Extra details the user supplied at the time of the crash. E.g. Crashed when I hovered over this part of the page. |
| Occasionally provided, allows us to follow up about the crash. E.g. May be useful if particular users frequently encounter the crash and are able to find a regression window. | |
| URL | Find potential websites where the crash can be reproduced. Note that this is may just be correlated with how popular a website is rather than it being a good candidate for reproduction. Try a few different URLs and be mindful of what you share from this. URLs can sometimes contain personally identifiable data. |
If you perform a super search for a crash signature, under More Options, you are able to select properties to facet the search upon. Add "build id" to the list of facets, ensure that your time period is broad (e.g. 3-6 months), and submit the search. Now in addition to the default Crash Reports and Signatures facets, you now have a Build id facet. Clicking on that tab will show the frequency of crashes for each build, and you can sort the list by the build id. This will allow you to easily discover the first build where we began to receive crash reports.
Look at the result for bug 1565231. It is quite obvious based on the high volume that the signature was introduced in build 20190711095112, which happens to be the same build the regressing change was landed in.
This is particularly useful when the crash summary in Bugzilla is too new, with very recent builds, to discover when the crash was actually introduced.
The Bugzilla tab on the report will show all bugs which have the same signature in its crash signatures field. This is useful to check in case there are duplicates which have already been triaged and/or investigated.
If the report is new, it may not have been processed yet (mostly likely you got there because the crash report is your own). If the report is older than 6 months, it may have been purged and unrecoverable. Please keep this in mind if you are trying to determine a crash window from the first build ID in which it was reproduced.
Occasionally a user will file a report about a crash, but neglects to attach a crash report URL to the bug. In that case all we can do is ask. They can find their recent submitted and unsubmitted crash reports if they go to about:crashes.
Various fuzzing tools are used by our security teams to find crashes and assertions tripped by automated inputs. The fuzzing metabug contains a list of lists of bugs sorted by each tool. These will not usually have a crash signature associated with them, but it is prudent to check the crash reports for similar signatures if you suspect this may be happening in the wild.
Access to security bugs are restricted to those with the appropriate permissions, or are included on the cc list for the bug. If you have decided an unrestricted graphics bug should be flagged as a security issue, please check "Security-Sensitive Graphics Bug" in Security section of the Bugzilla page.
The profiler collects traces to give us an idea where our execution time is being spent. The profiler documentation explains how to use the profiler web app after collecting a trace, as well as platform specific details for Android, where we have some special needs, and Linux, where we have additional options such as using the command line tool perf. When a user is encountering performance issues we are unable to reproduce with a similar platform/configuration, it can be useful to ask them to follow the instructions for the profiler and attach a trace to the bug report.
Specifically for graphics issues, there are particular threads we may care about and may want to request someone reproducing the issue to add them to the thread filter:
- GeckoMain (main thread, builds the display list, along with many other non-graphics tasks)
- Compositor (handles all of our IPC in the compositor process, along with many other graphics tasks)
- WebRender
- WRSceneBuilder (scene building)
- WRRenderBackend (frame building)
- WRWorker (typically blob images)
- Non-WebRender
- PaintThread (OMTP)
- Images
- ImgDecoder (image decode pool)
- ImageIO (image buffering thread)
Our test automation contains performance tests and sometimes we regress them. Typically these are self triaged, as they are filed by the sheriffs with a particular commit identified as the source of the regression. We often just need to set the priority appropriately.
If the user is indicates that Firefox is using too much memory, then having a copy of their memory report from about:memory may help. For the graphics team, the most important information will typically be in the compositor process, or images in the content process.
Note that one or two flickers which then stop may be caused by a GPU process crash, and the flicker is caused by the crash and subsequent recovery not being entirely seamless. Or the GPU process crashes and we fallback to basic which does not suffer from the flickering. Reviewing about:crashes for new entries after reproducing should quickly confirm or deny this if suspected.
MOZ_LOG lines are littered throughout the codebase, which outputs to the console when enabled. Some components produce more logging than others. To enable, set the environment variable:
MOZ_LOG="module1:5,module2:5,.."
5 represents the log level, and is the highest, but most of the time you can just turn it all on and sort through it later.
| Log Module | Description |
|---|---|
| imgRequest | Network cache |
| BMPDecoder | BMP decoder |
| JPEGDecoder, JPEGDecoderAccounting | JPEG decoder |
| PNGDecoder, PNGDecoderAccounting | PNG decoder |
| WebPDecoder | WebP decoder |
There is a detailed guide on debugging issues in WebRender on its wiki.
As of March 2019, we started including WebRender bugs as part of our GFX triage process. Bugs impacting platforms where we have shipped WR, or are planning to do so soon, are the most important.
To find out where we have enabled WebRender check out: www.arewewebrenderyet.com
For upcoming specific shipping targets view (specifically the quarterly breakdown): https://wiki.mozilla.org/Platform/GFX/ And if you think this looks out of date or is missing anything ping @jbonisteel.
We have per-release metabugs that we are using to keep on top of bugs that we would like to fix for each release. Those are gfx-82, gfx-83, etc. We will be doing a weekly review of in those metabugs so they will be triaged further and action taken, if necessary. If you encounter a bug you think is potentially important but you aren't sure what to do about it, block it against the gfx-triage metabug and Jeff, Jessie and Kris will look at that on a weekly basis
We have also created a metabug called wr-wild where we are tracking issues people are seeing in release but that we can't yet reproduce or determine what exactly is happening and we aren't yet able to try and target fixing them for a release. If you are doing triage, and aren't sure where to put something, block it against the metabug for the next release and it will be reviewed further.
The same guidelines for prioritization in terms of whether something should be a P1-P5 apply to WebRender. As in, if you mark a bug as P1 in gfx-82, that means it must get fixed for that release and would be considered a blocker.
We also have tracking bugs set up for some of the platforms we want to target and other important categories we want to track (such as performance). Please also add the appropriate tracking bug as you are doing your triage and please add more tracking bugs to this list if you create any:
- wr-intel
- wr-android
- wr-android-mvp
- wr-mac
- wr-linux
- wr-perf
- wr-snap
- wr-wild
- not-wr
- slow-frames
- webrender-site-issues
There is a separate tracking page for WebRender performance bugs that may be useful.
When we do not know the exact code that fixed an issue, we don't mark it as FIXED. In this case we use WORKSFORME to indicate the issue is not longer reproducible.
When a crash is not reproducible and can't be pinpointed as fixed, we want to keep our eye on it to ensure the frequency doesn't spike up. Therefore we add a blocking tag so the crash can be reviewed again at a later date. For example: a crash in webrender will be tagged with wr-71 to ensure it gets reviewed before FF 71 release.