Bug 1712168 - migrate profiler mdn docs to in tree performance docs r=julienw

***
bug 1712168 - migrate profiler mdn docs to in tree performance docs  r=#firefox-source-docs-reviewers

Differential Revision: https://phabricator.services.mozilla.com/D116154
This commit is contained in:
Kim Moir
2021-06-03 13:47:50 +00:00
parent 58e1cf918c
commit 6b0d1fc092
10 changed files with 304 additions and 196 deletions

View File

@@ -1,185 +0,0 @@
# Call Tree
The Call Tree tells you which JavaScript functions the browser spent the
most time in. By analyzing its results, you can find bottlenecks in your
code - places where the browser is spending a disproportionately large
amount of time.
These bottlenecks are the places where any optimizations you can make
will have the biggest impact.
The Call Tree is a sampling profiler. It periodically samples the state
of the JavaScript engine and records the stack for the code executing at
the time. Statistically, the number of samples taken in which we were
executing a particular function corresponds to the amount of time the
browser spent executing it.
In this article, we'll use the output of a simple program as an
example. If you want to get the program to experiment with your profile,
you can find it
[here](https://github.com/mdn/performance-scenarios/blob/gh-pages/js-call-tree-1/).
You can find the specific profile we discuss
[here](https://github.com/mdn/performance-scenarios/blob/gh-pages/js-call-tree-1/profile/call-tree.json)
- just import it to the performance tool to follow along.
There's a short page describing the structure of this program
[here](sorting_algorithms_comparison.md).
Note that we use the same program - the same profile, in fact - in the
documentation page for the [Flame
Chart](https://developer.mozilla.org/en-US/docs/Tools/Performance/Flame_Chart).
The screenshot below shows the output of a program that compares three
sorting algorithms - bubble sort, selection sort, and quicksort. To do
this, it generates some arrays filled with random integers and sorts
them using each algorithm in turn.
We've [zoomed](https://developer.mozilla.org/en-US/docs/Tools/Performance/UI_Tour#zooming_in) into
the part of the recording that shows a long JavaScript marker:
![](img/perf-call-tree.png)
The Call Tree presents the results in a table. Each row represents a
function in which at least one sample was taken, and the rows are
ordered by the number of samples taken while in that function, highest
to lowest.
*Samples* is the number of samples that were taken when we were
executing this particular function, including its children (the other
functions called by this particular function).
*Total Time* is that number translated into milliseconds, based on the
total amount of time covered by the selected portion of the recording.
These numbers should roughly be the same as the number of samples.
*Total Cost* is that number as a percentage of the total number of
samples in the selected portion of the recording.
*Self Time* is calculated as the time spent in that particular function,
excluding its children. This comes from the captured stacks where this
function is the leafmost function.
*Self Cost* is calculated from *Self Time* as a percentage of the total
number of samples in the selected portion of the recording.
In the current version of the Call Tree, these are the most important
columns. Functions with a relatively high *Self Cost* are good
candidates for optimization, either because they take a long time to
run, or because they are called very often.
[The inverted call tree](#using_an_inverted_aka_bottom-up_call_tree) is
a good way to focus on these *Self Cos*t values.
This screenshot tells us something we probably already knew: Bubble sort
is a very inefficient algorithm. We have about six times as many samples
in bubble sort as in selection sort, and 13 times as many as in
quicksort.
## Walking up the call tree
Next to each function name is a disclosure arrow: Click that, and you
can see the path back up the call tree, from the function in which the
sample was taken, to the root. For example, we can expand the entry for
`bubbleSort()`:
![](img/perf-call-tree-expanded-bubblesort.png)
So we can see the call graph is like this:
sortAll()
-> sort()
-> bubbleSort()
Note also that *Self Cost* for `sort()` here is 1.45%, and note that
this is the same as for the separate entry for `sort()` later in the
list. This is telling us that some samples were taken in `sort()`
itself, rather than in the functions it calls.
Sometimes there's more than one path back from an entry to the top
level. Let's expand the entry for `swap()`:
![](img/perf-call-tree-expanded-sawp.png)
There were 253 samples taken inside `swap()`. But `swap()` was reached
by two different paths: both `bubbleSort()` and `selectionSort()` use
it. We can also see that 252 of the 253 samples in `swap() `were taken
in the `bubbleSort()` branch, and only one in the `selectionSort()`
branch.
This result means that bubble sort is even less efficient than we had
thought! It can shoulder the blame for another 252 samples, or almost
another 10% of the total cost.
With this kind of digging, we can figure out the whole call graph, with
associated sample count:
sortAll() // 8
-> sort() // 37
-> bubbleSort() // 1345
-> swap() // 252
-> selectionSort() // 190
-> swap() // 1
-> quickSort() // 103
-> partition() // 12
## Platform data
You'll also see some rows labeled *Gecko*, *Input & Events*, and so on.
These represent internal browser calls.
This can be useful information too. If your site is making the browser
work hard, this might not show up as samples recorded in your code, but
it is still your problem.
In our example, there are 679 samples assigned to *Gecko* - the
second-largest group after `bubbleSort()`. Let's expand that:
![](img/perf-call-tree-expanded-gecko.png)
This result is telling us that 614 of those samples, or about 20% of the
total cost, are coming from our `sort()` call. If we look at the code
for `sort()`, it should be fairly obvious that the high platform data
cost is coming from repeated calls to `console.log()`:
``` {.brush: .js}
function sort(unsorted) {
console.log(bubbleSort(unsorted));
console.log(selectionSort(unsorted));
console.log(quickSort(unsorted));
}
```
It would certainly be worthwhile considering more efficient ways of
implementing this.
One thing to be aware of here is that idle time is classified as
*Gecko*, so parts of your profile where your JavaScript isn't running
will contribute *Gecko* samples. These aren't relevant to the
performance of your site.
By default, the Call Tree doesn't split platform data out into separate
functions, because they add a great deal of noise, and the details are
not likely to be useful to people not working on Firefox. If you want to
see the details, check \"Show Gecko Platform Data\" in the
[Settings](https://developer.mozilla.org/en-US/docs/Tools/Performance/UI_Tour#toolbar).
## Using an inverted, aka Bottom-Up, Call Tree
An inverted call tree reverses the order of all stacks, putting the
leafmost function calls at the top. The direct consequence is that this
is a view that focuses more on the function's *Self Time* information.
This is a very useful view to find some hot spot in your code.
To display this view, click the gear icon on the right-hand end of the
performance tab and select **Invert Call Tree**.
![](img/performance_menu_invert_call_tree.png)

View File

@@ -0,0 +1,50 @@
# dtrace
`dtrace` is a powerful Mac OS X kernel instrumentation system that can
be used to profile wakeups. This article provides a light introduction
to it.
:::
**Note**: The [power profiling
overview](/en-US/docs/Mozilla/Performance/Power_profiling_overview) is
worth reading at this point if you haven't already. It may make parts
of this document easier to understand.
:::
## Invocation
`dtrace` must be invoked as the super-user. A good starting command for
profiling wakeups is the following.
```
sudo dtrace -n 'mach_kernel::wakeup { @[ustack()] = count(); }' -p $FIREFOX_PID > $OUTPUT_FILE
```
Let's break that down further.
- The` -n` option combined with the `mach_kernel::wakeup` selects a
*probe point*. `mach_kernel` is the *module name* and `wakeup` is
the *probe name*. You can see a complete list of probes by running
`sudo dtrace -l`.
- The code between the braces is run when the probe point is hit. The
above code counts unique stack traces when wakeups occur; `ustack`
is short for \"user stack\", i.e. the stack of the userspace program
executing.
Run that command for a few seconds and then hit [Ctrl]{.kbd} + [C]{.kbd}
to interrupt it. `dtrace` will then print to the output file a number of
stack traces, along with a wakeup count for each one. The ordering of
the stack traces can be non-obvious, so look at them carefully.
Sometimes the stack trace has less information than one would like.
It's unclear how to improve upon this.
## See also
dtrace is *very* powerful, and you can learn more about it by consulting
the following resources:
- [The DTrace one-liner
tutorial](https://wiki.freebsd.org/DTrace/Tutorial) from FreeBSD.
- [DTrace tools](http://www.brendangregg.com/dtrace.html), by Brendan
Gregg.

View File

@@ -27,6 +27,13 @@ explains how to use the Gecko profiler.
* [LogAlloc](https://searchfox.org/mozilla-central/source/memory/replace/logalloc/README) is a tool that dumps a log of memory allocations in Gecko. That log can then be replayed against Firefox's default memory allocator independently or through another replace-malloc library, allowing the testing of other allocators under the exact same workload.
* [See also the documentation on Leak-hunting strategies and tips.](leak_hunting_strategies_and_tips.md)
## Profiling and performance tools
* [Profiling with Instruments](profiling_with_instruments.md) How to use Apple's Instruments tool to profile Mozilla code.
* [Profiling with xperf](profiling_with_xperf.md) How to use Microsoft's Xperf tool to profile Mozilla code.
* [Profiling with Concurrency Visualizer](profiling_with_concurrency_visualizer.md) How to use Visual Studio's Concurrency Visualizer tool to profile Mozilla code.
* [Profiling with Zoom](profiling_with_zoom.md) Zoom is a profiler for Linux done by the people who made Shark.
* [Adding a new telemetry probe](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/start/adding-a-new-probe.html) Information on how to add a new measurement to the Telemetry performance-reporting system
## Power Profiling

View File

@@ -103,10 +103,7 @@ ordered by the size of the allocations they made:
![](../img/memory-tool-call-stack.png)
\
The structure of this view is very much like the structure of the [Call
Tree](call_tree.md), only it shows
allocations rather than processor samples. So, for example, the first
entry says that:
The first entry says that:
- 4,832,592 bytes, comprising 93% of the total heap usage, were
allocated in a function at line 35 of \"alloc.js\", **or in

View File

@@ -252,7 +252,7 @@ the code as being responsible.
high-context measurements. This is useful because high CPU usage
typically causes high power consumption.
- Some tools can provide high-context wakeup measurements:
[dtrace](/en-US/docs/Mozilla/Performance/dtrace) (on Mac) and
[dtrace](dtrace.md) (on Mac) and
[perf](perf.md) (on Linux.)
- Source-level instrumentation, such as [TimerFirings
logging](timerfirings_logging.md), can
@@ -295,7 +295,7 @@ power consumption.
tools profiler, the Gecko Profiler, or generic performance
profilers.
- For high wakeup counts, use
[dtrace](/en-US/docs/Mozilla/Performance/dtrace) or
[dtrace](dtrace.md) or
[perf](perf.md) or [TimerFirings logging](timerfirings_logging.md).
- On Mac workloads that use graphics, Activity Monitor's "Energy"
tab can tell you if the high-performance GPU is being used, which

View File

@@ -0,0 +1,5 @@
# Profiling with Concurrency Visualizer
Concurrency Visualizer is an excellent alternative to xperf. In newer versions of Visual Studio, it is an addon that needs to be downloaded.
Here are some scripts that you can be used for manipulating the profiles that have been exported to CSV: [https://github.com/jrmuizel/concurrency-visualizer-scripts](https://github.com/jrmuizel/concurrency-visualizer-scripts)

View File

@@ -0,0 +1,54 @@
# Profiling with Instruments
Instruments can be used for memory profiling and for statistical
profiling.
## Official Apple documentation
- [Instruments User
Guide](https://developer.apple.com/library/mac/documentation/DeveloperTools/Conceptual/InstrumentsUserGuide/)
- [Instruments User
Reference](https://developer.apple.com/library/mac/documentation/AnalysisTools/Reference/Instruments_User_Reference/)
- [Instruments Help
Articles](https://developer.apple.com/library/mac/recipes/Instruments_help_articles/)
- [Instruments
Help](https://developer.apple.com/library/mac/recipes/instruments_help-collection/)
- [Performance
Overview](https://developer.apple.com/library/mac/documentation/Performance/Conceptual/PerformanceOverview/)
### Basic Usage
- Select \"Time Profiler\" from the \"Choose a profiling template
for:\" dialog.
- In the top left, next to the record and pause button, there will be
a \"\[machine name\] \> All Processes\". Click \"All Processes\" and
select \"firefox\" from the \"Running Applications\" section.
- Click the record button (red circle in top left)
- Wait for the amount of time that you want to profile
- Click the stop button
## Command line tools
There is
[instruments](https://developer.apple.com/library/mac/documentation/Darwin/Reference/Manpages/man1/instruments.1.html)
and
[iprofiler](https://developer.apple.com/library/mac/documentation/Darwin/Reference/Manpages/man1/iprofiler.1.html).
How do we monitor performance counters (cache miss etc.)? Instruments
has a \"Counters\" instrument that can do this.
## Memory profiling
Instruments will record a call stack at each allocation point. The call
tree view can be quite helpful here. Switch from \"Statistics\". This
`malloc` profiling is done using the `malloc_logger` infrastructure
(similar to `MallocStackLogging`). Currently this means you need to
build with jemalloc disabled (`ac_add_options --disable-jemalloc`). You
also need the fix to [Bug
719427](https://bugzilla.mozilla.org/show_bug.cgi?id=719427 "https://bugzilla.mozilla.org/show_bug.cgi?id=719427")
The `DTPerformanceSession` api can be used to control profiling from
applications like the old CHUD API we use in Shark builds. [Bug
667036](https://bugzilla.mozilla.org/show_bug.cgi?id=667036 "https://bugzilla.mozilla.org/show_bug.cgi?id=667036")
System Trace might be useful.

View File

@@ -0,0 +1,180 @@
# Profiling with xperf
Xperf is part of the Microsoft Windows Performance Toolkit, and has
functionality similar to that of Shark, oprofile, and (for some things)
dtrace/Instruments. For stack walking, Windows Vista or higher is
required; I haven't tested it at all on XP.
This page applies to xperf version **4.8.7701 or newer**. To see your
xperf version, either run '`xperf`' on a command line with no
arguments, or start '`xperfview`' and look at Help -\> About
Performance Analyzer. (Note that it's not the first version number in
the About window; that's the Windows version.)
If you have an older version, you will experience bugs, especially
around symbol loading for local builds.
### Installation
For all versions, the tools are part of the latest [Windows 7 SDK (SDK
Version
7.1)](http://www.microsoft.com/downloads/details.aspx?FamilyID=6b6c21d2-2006-4afa-9702-529fa782d63b&displaylang=en "http://www.microsoft.com/downloads/details.aspx?FamilyID=6b6c21d2-2006-4afa-9702-529fa782d63b&displaylang=en"){.external}.
Use the web installer to install at least the \"Win32 Development
Tools\". Once the SDK installs, execute either `wpt_x86.msi` or
`wpt_x64.msi` in the `Redist/Windows Performance Toolkit `folder of the
SDK's install location (typically Program Files/Microsoft
SDKs/Windows/v7.1/Redist/Windows Performance Toolkit) to actually
install the Windows Performance Toolkit tools.
It might already be installed by the Windows SDK. Check if C:\\Program
Files\\Microsoft Windows Performance Toolkit already exists.
For 64-bit Windows 7 or Vista, you'll need to do a registry tweak and
then restart to enable stack walking:\
\
`REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f`
### Symbol Server Setup
With the latest versions of the Windows Performance Toolkit, you can
modify the symbol path directly from within the program via the Trace
menu. Just make sure you set the symbol paths before enabling \"Load
Symbols\" and before opening a summary view. You can also modify the
`_NT_SYMBOL_PATH` and `_NT_SYMCACHE_PATH` environment variables to make
these changes permanent.
The standard symbol path that includes both Mozilla's and Microsoft's
symbol server configuration is as follows:
`_NT_SYMCACHE_PATH: C:\symbols _NT_SYMBOL_PATH: srv*c:\symbols*http://msdl.microsoft.com/download/symbols;srv*c:\symbols*http://symbols.mozilla.org/firefox/`
To add symbols **from your own builds**, add
`C:\path\to\objdir\dist\bin` to `_NT_SYMBOL_PATH`. As with all Windows
paths, the symbol path uses semicolons (`;`) as separators.
Make sure you select the Trace -\> Load Symbols menu option in the
Windows Performance Analyzer (xperfview).
There seems to be a bug in xperf and symbols; it is very sensitive to
when the symbol path is edited. If you change it within the program,
you'll have to close all summary tables and reopen them for it to pick
up the new symbol path data.
You'll have to agree to a EULA for the Microsoft symbols \-- if you're
not prompted for this, then something isn't configured right in your
symbol path. (Again, make sure that the directories exist; if they
don't, it's a silent error.)
### Quick Start
All these tools will live, by default, in C:\\Program Files\\Microsoft
Windows Performance Toolkit. Either run these commands from there, or
add the directory to your path. You will need to use an elevated command
prompt to start or stop profiling.
Start recording data:
`xperf -on latency -stackwalk profile`
\"Latency\" is a special provider name that turns on a few predefined
kernel providers; run \"xperf -providers k\" to view a full list of
providers and groups. You can combine providers, e.g., \"xperf -on
DiagEasy+FILE_IO\". \"-stackwalk profile\" tells xperf to capture a
stack for each PROFILE event; you could also do \"-stackwalk
profile+file_io\" to capture a stack on each cpu profile tick and each
file io completion event.
Stop:
`xperf -d out.etl`
View:
`xperfview out.etl`
The MSDN
\"[Quickstart](http://msdn.microsoft.com/en-us/library/ff190971%28v=VS.85%29.aspx){.external}\"
page goes over this in more detail, and also has good explanations of
how to use xperfview. I'm not going to repeat it here, because I'd be
using essentially the same screenshots, so go look there.
The 'stack' view will give results similar to shark.
### Heap Profiling
xperf has good tools for heap allocation profiling, but they have one
major limitation: you can't build with jemalloc and get heap events
generated. The stock windows CRT allocator is horrible about
fragmentation, and causes memory usage to rise drastically even if only
a small fraction of that memory is in use. However, even despite this,
it's a useful way to track allocations/deallocations.
#### Capturing Heap Data
The \"-heap\" option is used to set up heap tracing. Firefox generates
lots of events, so you may want to play with the
BufferSize/MinBuffers/MaxBuffers options as well to ensure that you
don't get dropped events. Also, when recording the stack, I've found
that a heap trace is often missing module information (I believe this is
a bug in xperf). It's possible to get around that by doing a
simultaneous capture of non-heap data.
To start a trace session, launching a new Firefox instance:
`xperf -on base xperf -start heapsession -heap -PidNewProcess "./firefox.exe -P test -no-remote" -stackwalk HeapAlloc+HeapRealloc -BufferSize 512 -MinBuffers 128 -MaxBuffers 512`
To stop a session and merge the resulting files:
`xperf -stop heapsession -d heap.etl xperf -d main.etl xperf -merge main.etl heap.etl result.etl`
\"result.etl\" will contain your merged data; you can delete main.etl
and heap.etl. Note that it's possible to capture even more data for the
non-heap profile; for example, you might want to be able to correlate
heap events with performance data, so you can do
\"`xperf -on base -stackwalk profile`\".
In the viewer, when summary data is viewed for heap events (Heap
Allocations Outstanding, etc. all lead to the same summary graphs), 3
types of allocations are listed \-- AIFI, AIFO, AOFI. This is shorthand
for \"Allocated Inside, Freed Inside\", \"Allocated Inside, Freed
Outside\", \"Allocated Outside, Freed Inside\". These refer to the time
range that was selected for the summary graph; for example, something
that's in the AOFI category was allocated before the start of the
selected time range, but the free event happened inside.
### Tips
- In the summary views, the yellow bar can be dragged left and right
to change the grouping \-- for example, drag it to the left of the
Module column to have grouping happen only by process (stuff that's
to the left), so that you get symbols in order of weight, regardless
of what module they're in.
- Dragging the columns around will change grouping in various ways;
experiment to get the data that you're looking for. Also experiment
with turning columns on and off; removing a column will allow data
to be aggregated without considering that column's contributions.
- Disabling all but one core will make the numbers add up to 100%.
This can be done by running 'msconfig' and going to Advance
Options from the \"Boot\" tab.
### Building Firefox
To get good data from a Firefox build, it is important to build with the
following options in your mozconfig:
`export CFLAGS="-Oy-" export CXXFLAGS="-Oy-"`
This disables frame-pointer optimization which lets xperf do a much
better job unwinding the stack. Traces can be captured fine without this
option (for example, from nightlies), but the stack information will not
be useful.
`ac_add_options --enable-debug-symbols`
This gives us symbols.
### For More Information
Microsoft's [documentation for xperf](http://msdn.microsoft.com/en-us/library/ff191077.aspx "http://msdn.microsoft.com/en-us/library/ff191077.aspx")
is pretty good; there is a lot of depth to this tool, and you should
look there for more details.

View File

@@ -0,0 +1,5 @@
# Profiling with Zoom
Zoom is a profiler very similar to Shark for Linux.
You can get the profiler from here: <http://www.rotateright.com/>

View File

@@ -1,10 +1,5 @@
# Sorting algorithms comparison
This article describes a simple example program that we use in two of
the Performance guides: the guide to the [Call
Tree](call_tree.md) and the guide to the
[Flame Chart](https://developer.mozilla.org/en-US/docs/Tools/Performance/Flame_Chart).
This program compares the performance of three different sorting
algorithms: