Bug 1959435 - Document crash ping end-to-end lifecycle r=gsvelto

Differential Revision: https://phabricator.services.mozilla.com/D246756
This commit is contained in:
Alex Franchuk
2025-05-06 15:50:54 +00:00
committed by afranchuk@mozilla.com
parent 1a39c1465a
commit 2944efa928
2 changed files with 109 additions and 0 deletions

View File

@@ -0,0 +1,104 @@
====================
Crash Ping Lifecycle
====================
Crash pings and derived data go through a number of separate programs and
services. To get a better idea of how these components interact, a breakdown of
the lifecycle is presented here.
This description applies to Glean crash ping data.
Origin
======
When a crash occurs, Glean metrics are populated and a Glean crash ping is sent with the data. This
is ingested and made available in BigQuery through the usual Glean infrastructure.
Ping Definitions
----------------
* `Desktop crash ping <https://dictionary.telemetry.mozilla.org/apps/firefox_desktop/pings/crash>`_
* `metrics definition
<https://searchfox.org/mozilla-central/source/toolkit/components/crashes/metrics.yaml>`_
* `ping definition
<https://searchfox.org/mozilla-central/source/toolkit/components/crashes/pings.yaml>`_
* `Fenix crash ping <https://dictionary.telemetry.mozilla.org/apps/fenix/pings/crash>`_
* `metrics definition
<https://searchfox.org/mozilla-central/source/mobile/android/android-components/components/lib/crash/metrics.yaml>`_
* `ping definition
<https://searchfox.org/mozilla-central/source/mobile/android/android-components/components/lib/crash/pings.yaml>`_
BigQuery Tables
---------------
* Desktop view: ``firefox_desktop.crash``.
* Crashreporter client view: ``firefox_crashreporter.crash``. This uses the same metrics/ping definitions
as desktop.
* Combined desktop/crashreporter client view: ``firefox_desktop.desktop_crashes``.
* Fenix view: ``fenix.crash``. This ping has a few different metrics, but is overall very similar to
the desktop ping. As a result, it's a little verbose to combine fenix and desktop pings in a
query, however most metrics exist in both with the same name.
**NOTE**: When querying the source data, you should always use the `crash.app_channel`,
`crash.app_display_version`, and `crash.app_build` metrics rather than the similarly named fields of
the Glean `client_info` struct. These values correspond to the application information *at the time
of the crash*, and moreover the crash reporter client can't fully populate the client_info.
Source
------
All crash ping metrics are set in bulk at the same time, and typically come directly from `crash annotations <https://searchfox.org/mozilla-central/source/toolkit/crashreporter/CrashAnnotations.yaml>`_:
* `Desktop <https://searchfox.org/mozilla-central/rev/b598575345077063c55b618e43ccaa6249505d02/toolkit/components/crashes/CrashManager.in.sys.mjs#787>`_
* `Crashreporter client <https://searchfox.org/mozilla-central/rev/b598575345077063c55b618e43ccaa6249505d02/toolkit/crashreporter/client/app/src/net/ping/glean.rs#11>`_
* `Fenix <https://searchfox.org/mozilla-central/rev/b598575345077063c55b618e43ccaa6249505d02/mobile/android/android-components/components/lib/crash/src/main/java/mozilla/components/lib/crash/service/GleanCrashReporterService.kt#312>`_
Post-Processing
===============
The `crash-ping-ingest <https://github.com/mozilla/crash-ping-ingest>`_ repo is scheduled (using
taskcluster) to run daily ingestion. It will retrieve crash pings with submissions as recent as the
prior UTC day, ensuring that indexed results for the past week are available by default (in case of
outages/hiccups/etc). This runs at 2:00 UTC and takes 1-2 hours, so you can expect data to be
availalbe for the prior UTC day around 4:00 UTC. It also supplies a taskcluster action to manually
generate data for a given date, if necessary.
Data Availability
-----------------
Data was backfilled to 2024-09-01, so you can expect ping data to be available for any date after
then. All nightly and beta pings are processed, while release pings are randomly sampled with about
5000 pings per os/process-type combination.
BigQuery
--------
The ingested output (including symbolicated stacks and crash signatures) is loaded into BigQuery in
the ``moz-fx-data-shared-prod.crash_ping_ingest_external.ingest_output`` table. It is partitioned on
``submission_timestamp`` to match the Glean views/tables, and it can be joined on ``document_id``
(and optionally ``submission_timestamp``) with the fenix/desktop views.
What if post-processing has a bug?
----------------------------------
If there's a problem with the post-processed output, the post-processing bug can be fixed and the
data can be re-generated by running the ingestion for the day(s) affected. The upload script in
`crash-ping-ingest <https://github.com/mozilla/crash-ping-ingest/blob/main/upload.py>`_ will
*replace* the data for the uploading date automatically. To run the ingestion, you must navigate to
the taskcluster **task group** for the commits with the fixes (this is easily found by going to the
taskcluster CI page for the commit on GitHub) and run the action task for "Process Pings (Manual)".
There you can choose which dates to run.
Once the data in BigQuery has been fixed, you must also clear the netlify ``ping-data`` blobs
corresponding to the affected dates. This can be done using the netlify-cli (though you need to
authenticate with netlify, of course).
Presentation
============
The `crash-pings <https://github.com/mozilla/crash-pings>`_ repository contains the code for the
website hosted on netlify: https://crash-pings.mozilla.org. See the README for details about how it
is built and what technologies it uses. It queries BigQuery and caches results, condensing data for
efficient loading in the browser.
Adding data to crash pings
==========================
#. Add crash annotations to the `definition file
<https://searchfox.org/mozilla-central/source/toolkit/crashreporter/CrashAnnotations.yaml>`_ and
populate the annotations with the generated APIs.
#. Define corresponding glean metrics to the files listed in `Ping Definitions`_.
#. Update the code that populates the metrics listed in `Source`_.

View File

@@ -36,6 +36,10 @@ implementation is robust. The Glean `crash` ping can be found
See `bug 1784069 <https://bugzilla.mozilla.org/show_bug.cgi?id=1784069>`_ for details.
Lifecycle and Post-Processing
-----------------------------
The lifecycle of a crash ping can be viewed at :ref:`Crash Ping Lifecycle`.
Other Documents
===============
@@ -44,3 +48,4 @@ Other Documents
:maxdepth: 1
crash-events
crash-ping-lifecycle