On content process shutdown we send a content process ping to ensure we have
up-to-date data from the content process before it goes away. Now we need to
also flush the batched telemetry accumulations to the parent so that it can be
present in the ping.
No attempt is made to synchronize access to IPCTimerFired. It is safe to
re-enter.
No attempt is made to cancel the timer as its firing is benign.
MozReview-Commit-ID: 1gjNH9IPhKf
Take the opportunity presented through changing child telemetry accumulation
to bring the ping form closer to the ideas expressed in bug 1281795.
childPayloads still exists, but without histograms or keyedHistograms which are
now at root.processes.content.{keyedH|h}istograms. This will require coordinated
changes in the aggregator and moztelemetry libraries.
MozReview-Commit-ID: AqG2jmBBC2W
To simplify using child telemetry from the parent process, only allow child
telemetry payloads to be generated once per child process, on shut down.
This will allow us to use the child telemetry's subsession information to leave
childPayloads the way it currently is.
Will need to update test_ChildHistograms.js as it is the only consumer.
MozReview-Commit-ID: 2qSztg0QHV5
Most of this is fixing functions that in some cases return a value but then
can also run to completion without returning anything. ESLint 2 catches this
where previous versions didn't. Unless there was an obvious other choice I just
made these functions return undefined at the end which is effectively what
already happens.
MozReview-Commit-ID: KHYdAkRvhVr
We have some oddities in our jemalloc stats reporting.
- "heap-overhead-ratio" is a strange measurement: overhead / non-overhead,
expressed as a percentage. And it omits "bin_unused", which appears to be an
oversight.
- "heap-committed" also omits "bin_unused".
- There are some minor errors in memory report descriptions.
This patch fixes these and improves the heap reporting. It makes the following
reporting changes:
- "heap-allocated": Duplicated as "heap-committed/allocated". (We keep
"heap-allocated" because that's a special value used in the computation of
"heap-unclassified".)
- "heap-committed/overhead": Added; it's the same as the sum of the
"explicit/heap-overhead/*" values. Together with "heap-committed/allocated"
it shows clearly what fraction of the heap is overhead and what fraction is
useful.
- "heap-committed": Removed; now implicit as the "heap-committed/" node.
- "heap-overhead-ratio":
- Removed from memory reports; now shown as the percentage of the new
"heap-committed/overhead" node.
- Still available as a distinguished amount (because it's useful in
isolation) but renamed to heapOverheadFraction, and the telemetry ID is
renamed as MEMORY_HEAP_OVERHEAD_FRACTION.
- "heap-chunks": Removed; it's not that interesting, and can be manually
computed as "heap-mapped" / "heap-chunksize" if necessary.
What it does:
Adds a new function, TelemetrySession.getChildThreadHangs(), which returns a promise resolving to an array of threadHangStats [1], one per process.
Note that processes that spawn or die while the function's promise is created but not resolved may be excluded from the final result.
How we do this:
1. Parent sends a MESSAGE_TELEMETRY_GET_CHILD_PAYLOAD message to each child, promising the results of these messages.
2. Child processes respond to parent with a MESSAGE_TELEMETRY_THREAD_HANGS, which contains BHR stats in the payload.
3. Parent combines all the child responses together and resolves the promise.
Plus a bunch of synchronization stuff and handling of edge cases since the number of child processes can change at any time.
Also, there is a 200ms timeout since we can't handle all of these cases. Specifically, when a child dies without responding, after all other child processes have responded.
Why we do this:
* We can technically get thread hang stats by retrieving Telemetry pings (see requestChildPayloads() in TelemetrySession for details), but this is very slow and can only be done for one process at a time.
* TelemetrySession is responsible for various Telemetry IPC-related tasks, and so is a natural place to expose this function (i.e., the function blends in well with the rest of the API).
* Statuser [2] uses this for quickly obtaining child process BHR stats. This allows us to get realtime hang monitoring for child processes.
[1]: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/nsITelemetry.idl#146
[2]: https://github.com/chutten/statuser
What it does:
Adds a new function, TelemetrySession.getChildThreadHangs(), which returns a promise resolving to an array of threadHangStats [1], one per process.
Note that processes that spawn or die while the function's promise is created but not resolved may be excluded from the final result.
How we do this:
1. Parent sends a MESSAGE_TELEMETRY_GET_CHILD_PAYLOAD message to each child, promising the results of these messages.
2. Child processes respond to parent with a MESSAGE_TELEMETRY_THREAD_HANGS, which contains BHR stats in the payload.
3. Parent combines all the child responses together and resolves the promise.
Plus a bunch of synchronization stuff and handling of edge cases since the number of child processes can change at any time.
Also, there is a 200ms timeout since we can't handle all of these cases. Specifically, when a child dies without responding, after all other child processes have responded.
Why we do this:
* We can technically get thread hang stats by retrieving Telemetry pings (see requestChildPayloads() in TelemetrySession for details), but this is very slow and can only be done for one process at a time.
* TelemetrySession is responsible for various Telemetry IPC-related tasks, and so is a natural place to expose this function (i.e., the function blends in well with the rest of the API).
* Statuser [2] uses this for quickly obtaining child process BHR stats. This allows us to get realtime hang monitoring for child processes.
[1]: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/nsITelemetry.idl#146
[2]: https://github.com/chutten/statuser
We now have all the necessary measurement APIs to get a full memory picture for
a running multi-process instance. However, there's no way to correlate one
particular RSS measurement on chrome with its USS measurements on content
processes.
So do that in TelemetrySession and report it.
We now have all the necessary measurement APIs to get a full memory picture for
a running multi-process instance. However, there's no way to correlate one
particular RSS measurement on chrome with its USS measurements on content
processes.
So do that in TelemetrySession and report it.
Unique Set Size (USS) could be described as "the amount of memory you could
expect to reclaim if you killed this process"
Resident Set Size (RSS) is USS with the addition of memory allocated by
shared memory.
To get a full picture of the memory use of a multi-process application is
impossible. A better guess than most is
Parent Process' RSS + sum(Child Processes' USS)
Or, from Telemetry:
Parent MEMORY_RESIDENT + sum(Children's MEMORY_UNIQUE)