Bug 1712168 - migrate profiler mdn docs to in tree performance docs r=julienw

*** bug 1712168 - migrate profiler mdn docs to in tree performance docs r=#firefox-source-docs-reviewers Differential Revision: https://phabricator.services.mozilla.com/D116154
2021-06-03 13:47:50 +00:00
parent 58e1cf918c
commit 6b0d1fc092
10 changed files with 304 additions and 196 deletions
--- a/docs/performance/call_tree.md
+++ b/docs/performance/call_tree.md
@@ -1,185 +0,0 @@
-# Call Tree
-
-The Call Tree tells you which JavaScript functions the browser spent the
-most time in. By analyzing its results, you can find bottlenecks in your
-code - places where the browser is spending a disproportionately large
-amount of time.
-
-These bottlenecks are the places where any optimizations you can make
-will have the biggest impact.
-
-The Call Tree is a sampling profiler. It periodically samples the state
-of the JavaScript engine and records the stack for the code executing at
-the time. Statistically, the number of samples taken in which we were
-executing a particular function corresponds to the amount of time the
-browser spent executing it.
-
-In this article, we'll use the output of a simple program as an
-example. If you want to get the program to experiment with your profile,
-you can find it
-[here](https://github.com/mdn/performance-scenarios/blob/gh-pages/js-call-tree-1/).
-You can find the specific profile we discuss
-[here](https://github.com/mdn/performance-scenarios/blob/gh-pages/js-call-tree-1/profile/call-tree.json)
- just import it to the performance tool to follow along.
-
-There's a short page describing the structure of this program
-[here](sorting_algorithms_comparison.md).
-
-Note that we use the same program - the same profile, in fact - in the
-documentation page for the [Flame
-Chart](https://developer.mozilla.org/en-US/docs/Tools/Performance/Flame_Chart).
-
-The screenshot below shows the output of a program that compares three
-sorting algorithms - bubble sort, selection sort, and quicksort. To do
-this, it generates some arrays filled with random integers and sorts
-them using each algorithm in turn.
-
-We've [zoomed](https://developer.mozilla.org/en-US/docs/Tools/Performance/UI_Tour#zooming_in) into
-the part of the recording that shows a long JavaScript marker:
-
-![](img/perf-call-tree.png)
-
-The Call Tree presents the results in a table. Each row represents a
-function in which at least one sample was taken, and the rows are
-ordered by the number of samples taken while in that function, highest
-to lowest.
-
-*Samples* is the number of samples that were taken when we were
-executing this particular function, including its children (the other
-functions called by this particular function).
-
-*Total Time* is that number translated into milliseconds, based on the
-total amount of time covered by the selected portion of the recording.
-These numbers should roughly be the same as the number of samples.
-
-*Total Cost* is that number as a percentage of the total number of
-samples in the selected portion of the recording.
-
-*Self Time* is calculated as the time spent in that particular function,
-excluding its children. This comes from the captured stacks where this
-function is the leafmost function.
-
-*Self Cost* is calculated from *Self Time* as a percentage of the total
-number of samples in the selected portion of the recording.
-
-In the current version of the Call Tree, these are the most important
-columns. Functions with a relatively high *Self Cost* are good
-candidates for optimization, either because they take a long time to
-run, or because they are called very often.
-
-[The inverted call tree](#using_an_inverted_aka_bottom-up_call_tree) is
-a good way to focus on these *Self Cos*t values.
-
-This screenshot tells us something we probably already knew: Bubble sort
-is a very inefficient algorithm. We have about six times as many samples
-in bubble sort as in selection sort, and 13 times as many as in
-quicksort.
-
-## Walking up the call tree
-
-Next to each function name is a disclosure arrow: Click that, and you
-can see the path back up the call tree, from the function in which the
-sample was taken, to the root. For example, we can expand the entry for
-`bubbleSort()`:
-
-![](img/perf-call-tree-expanded-bubblesort.png)
-
-So we can see the call graph is like this:
-
-    sortAll()
-
-        -> sort()
-
-            -> bubbleSort()
-
-Note also that *Self Cost* for `sort()` here is 1.45%, and note that
-this is the same as for the separate entry for `sort()` later in the
-list. This is telling us that some samples were taken in `sort()`
-itself, rather than in the functions it calls.
-
-Sometimes there's more than one path back from an entry to the top
-level. Let's expand the entry for `swap()`:
-
-![](img/perf-call-tree-expanded-sawp.png)
-
-There were 253 samples taken inside `swap()`. But `swap()` was reached
-by two different paths: both `bubbleSort()` and `selectionSort()` use
-it. We can also see that 252 of the 253 samples in `swap() `were taken
-in the `bubbleSort()` branch, and only one in the `selectionSort()`
-branch.
-
-This result means that bubble sort is even less efficient than we had
-thought! It can shoulder the blame for another 252 samples, or almost
-another 10% of the total cost.
-
-With this kind of digging, we can figure out the whole call graph, with
-associated sample count:
-
-    sortAll()                         //    8
-
-        -> sort()                     //   37
-
-            -> bubbleSort()           // 1345
-
-                -> swap()             //  252
-
-            -> selectionSort()        //  190
-
-                -> swap()             //    1
-
-            -> quickSort()            //  103
-
-                -> partition()        //   12
-
-## Platform data
-
-You'll also see some rows labeled *Gecko*, *Input & Events*, and so on.
-These represent internal browser calls.
-
-This can be useful information too. If your site is making the browser
-work hard, this might not show up as samples recorded in your code, but
-it is still your problem.
-
-In our example, there are 679 samples assigned to *Gecko* - the
-second-largest group after `bubbleSort()`. Let's expand that:
-
-![](img/perf-call-tree-expanded-gecko.png)
-
-This result is telling us that 614 of those samples, or about 20% of the
-total cost, are coming from our `sort()` call. If we look at the code
-for `sort()`, it should be fairly obvious that the high platform data
-cost is coming from repeated calls to `console.log()`:
-
-``` {.brush: .js}
-function sort(unsorted) {
-  console.log(bubbleSort(unsorted));
-  console.log(selectionSort(unsorted));
-  console.log(quickSort(unsorted));
-}
-```
-
-It would certainly be worthwhile considering more efficient ways of
-implementing this.
-
-One thing to be aware of here is that idle time is classified as
-*Gecko*, so parts of your profile where your JavaScript isn't running
-will contribute *Gecko* samples. These aren't relevant to the
-performance of your site.
-
-By default, the Call Tree doesn't split platform data out into separate
-functions, because they add a great deal of noise, and the details are
-not likely to be useful to people not working on Firefox. If you want to
-see the details, check \"Show Gecko Platform Data\" in the
-[Settings](https://developer.mozilla.org/en-US/docs/Tools/Performance/UI_Tour#toolbar).
-
-## Using an inverted, aka Bottom-Up, Call Tree
-
-An inverted call tree reverses the order of all stacks, putting the
-leafmost function calls at the top. The direct consequence is that this
-is a view that focuses more on the function's *Self Time* information.
-This is a very useful view to find some hot spot in your code.
-
-To display this view, click the gear icon on the right-hand end of the
-performance tab and select **Invert Call Tree**.
-
-![](img/performance_menu_invert_call_tree.png)
--- a/docs/performance/dtrace.md
+++ b/docs/performance/dtrace.md
@@ -0,0 +1,50 @@
+# dtrace
+
+`dtrace` is a powerful Mac OS X kernel instrumentation system that can
+be used to profile wakeups. This article provides a light introduction
+to it.
+
+:::
+**Note**: The [power profiling
+overview](/en-US/docs/Mozilla/Performance/Power_profiling_overview) is
+worth reading at this point if you haven't already. It may make parts
+of this document easier to understand.
+:::
+
+## Invocation
+
+`dtrace` must be invoked as the super-user. A good starting command for
+profiling wakeups is the following.
+
+``` 
+sudo dtrace -n 'mach_kernel::wakeup { @[ustack()] = count(); }' -p $FIREFOX_PID > $OUTPUT_FILE
+```
+
+Let's break that down further.
+
+-   The` -n` option combined with the `mach_kernel::wakeup` selects a
+    *probe point*. `mach_kernel` is the *module name* and `wakeup` is
+    the *probe name*. You can see a complete list of probes by running
+    `sudo dtrace -l`.
+-   The code between the braces is run when the probe point is hit. The
+    above code counts unique stack traces when wakeups occur; `ustack`
+    is short for \"user stack\", i.e. the stack of the userspace program
+    executing.
+
+Run that command for a few seconds and then hit [Ctrl]{.kbd} + [C]{.kbd}
+to interrupt it. `dtrace` will then print to the output file a number of
+stack traces, along with a wakeup count for each one. The ordering of
+the stack traces can be non-obvious, so look at them carefully.
+
+Sometimes the stack trace has less information than one would like.
+It's unclear how to improve upon this.
+
+## See also
+
+dtrace is *very* powerful, and you can learn more about it by consulting
+the following resources:
+
+-   [The DTrace one-liner
+    tutorial](https://wiki.freebsd.org/DTrace/Tutorial) from FreeBSD.
+-   [DTrace tools](http://www.brendangregg.com/dtrace.html), by Brendan
+    Gregg.
--- a/docs/performance/index.md
+++ b/docs/performance/index.md
@@ -27,6 +27,13 @@ explains how to use the Gecko profiler.
 * [LogAlloc](https://searchfox.org/mozilla-central/source/memory/replace/logalloc/README) is a tool that dumps a log of memory allocations in Gecko. That log can then be replayed against Firefox's default memory allocator independently or through another replace-malloc library, allowing the testing of other allocators under the exact same workload.
 * [See also the documentation on Leak-hunting strategies and tips.](leak_hunting_strategies_and_tips.md) 

+## Profiling and performance tools
+
+* [Profiling with Instruments](profiling_with_instruments.md) How to use Apple's Instruments tool to profile Mozilla code.
+* [Profiling with xperf](profiling_with_xperf.md) How to use Microsoft's Xperf tool to profile Mozilla code.
+* [Profiling with Concurrency Visualizer](profiling_with_concurrency_visualizer.md) How to use Visual Studio's Concurrency Visualizer tool to profile Mozilla code.
+* [Profiling with Zoom](profiling_with_zoom.md) Zoom is a profiler for Linux done by the people who made Shark.
+* [Adding a new telemetry probe](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/start/adding-a-new-probe.html) Information on how to add a new measurement to the Telemetry performance-reporting system

 ## Power Profiling

--- a/docs/performance/memory/aggregate_view.md
+++ b/docs/performance/memory/aggregate_view.md
@@ -103,10 +103,7 @@ ordered by the size of the allocations they made:

 ![](../img/memory-tool-call-stack.png)
 \
-The structure of this view is very much like the structure of the [Call
-Tree](call_tree.md), only it shows
-allocations rather than processor samples. So, for example, the first
-entry says that:
+The first entry says that:

 -   4,832,592 bytes, comprising 93% of the total heap usage, were
    allocated in a function at line 35 of \"alloc.js\", **or in
--- a/docs/performance/power_profiling_overview.md
+++ b/docs/performance/power_profiling_overview.md
@@ -252,7 +252,7 @@ the code as being responsible.
    high-context measurements. This is useful because high CPU usage
    typically causes high power consumption.
 -   Some tools can provide high-context wakeup measurements:
-    [dtrace](/en-US/docs/Mozilla/Performance/dtrace) (on Mac) and
+    [dtrace](dtrace.md) (on Mac) and
    [perf](perf.md) (on Linux.)
 -   Source-level instrumentation, such as [TimerFirings
    logging](timerfirings_logging.md), can
@@ -295,7 +295,7 @@ power consumption.
        tools profiler, the Gecko Profiler, or generic performance
        profilers.
    -   For high wakeup counts, use
-        [dtrace](/en-US/docs/Mozilla/Performance/dtrace) or
+        [dtrace](dtrace.md) or
        [perf](perf.md) or [TimerFirings logging](timerfirings_logging.md).
 -   On Mac workloads that use graphics, Activity Monitor's "Energy"
    tab can tell you if the high-performance GPU is being used, which
--- a/docs/performance/profiling_with_concurrency_visualizer.md
+++ b/docs/performance/profiling_with_concurrency_visualizer.md
@@ -0,0 +1,5 @@
+# Profiling with Concurrency Visualizer
+
+Concurrency Visualizer is an excellent alternative to xperf. In newer versions of Visual Studio, it is an addon that needs to be downloaded.
+
+Here are some scripts that you can be used for manipulating the profiles that have been exported to CSV: [https://github.com/jrmuizel/concurrency-visualizer-scripts](https://github.com/jrmuizel/concurrency-visualizer-scripts)
--- a/docs/performance/profiling_with_instruments.md
+++ b/docs/performance/profiling_with_instruments.md
@@ -0,0 +1,54 @@
+# Profiling with Instruments
+
+Instruments can be used for memory profiling and for statistical
+profiling.
+
+## Official Apple documentation
+
+-   [Instruments User
+    Guide](https://developer.apple.com/library/mac/documentation/DeveloperTools/Conceptual/InstrumentsUserGuide/)
+-   [Instruments User
+    Reference](https://developer.apple.com/library/mac/documentation/AnalysisTools/Reference/Instruments_User_Reference/)
+-   [Instruments Help
+    Articles](https://developer.apple.com/library/mac/recipes/Instruments_help_articles/)
+-   [Instruments
+    Help](https://developer.apple.com/library/mac/recipes/instruments_help-collection/)
+-   [Performance
+    Overview](https://developer.apple.com/library/mac/documentation/Performance/Conceptual/PerformanceOverview/)
+
+### Basic Usage
+
+-   Select \"Time Profiler\" from the \"Choose a profiling template
+    for:\" dialog.
+-   In the top left, next to the record and pause button, there will be
+    a \"\[machine name\] \> All Processes\". Click \"All Processes\" and
+    select \"firefox\" from the \"Running Applications\" section.
+-   Click the record button (red circle in top left)
+-   Wait for the amount of time that you want to profile
+-   Click the stop button
+
+## Command line tools
+
+There is
+[instruments](https://developer.apple.com/library/mac/documentation/Darwin/Reference/Manpages/man1/instruments.1.html)
+and
+[iprofiler](https://developer.apple.com/library/mac/documentation/Darwin/Reference/Manpages/man1/iprofiler.1.html).
+
+How do we monitor performance counters (cache miss etc.)? Instruments
+has a \"Counters\" instrument that can do this.
+
+## Memory profiling
+
+Instruments will record a call stack at each allocation point. The call
+tree view can be quite helpful here. Switch from \"Statistics\". This
+`malloc` profiling is done using the `malloc_logger` infrastructure
+(similar to `MallocStackLogging`). Currently this means you need to
+build with jemalloc disabled (`ac_add_options --disable-jemalloc`). You
+also need the fix to [Bug
+719427](https://bugzilla.mozilla.org/show_bug.cgi?id=719427 "https://bugzilla.mozilla.org/show_bug.cgi?id=719427")
+
+The `DTPerformanceSession` api can be used to control profiling from
+applications like the old CHUD API we use in Shark builds. [Bug
+667036](https://bugzilla.mozilla.org/show_bug.cgi?id=667036 "https://bugzilla.mozilla.org/show_bug.cgi?id=667036")
+
+System Trace might be useful.
--- a/docs/performance/profiling_with_xperf.md
+++ b/docs/performance/profiling_with_xperf.md
@@ -0,0 +1,180 @@
+# Profiling with xperf
+
+Xperf is part of the Microsoft Windows Performance Toolkit, and has
+functionality similar to that of Shark, oprofile, and (for some things)
+dtrace/Instruments. For stack walking, Windows Vista or higher is
+required; I haven't tested it at all on XP.
+
+This page applies to xperf version **4.8.7701 or newer**. To see your
+xperf version, either run '`xperf`' on a command line with no
+arguments, or start '`xperfview`' and look at Help -\> About
+Performance Analyzer. (Note that it's not the first version number in
+the About window; that's the Windows version.)
+
+If you have an older version, you will experience bugs, especially
+around symbol loading for local builds.
+
+### Installation
+
+For all versions, the tools are part of the latest [Windows 7 SDK (SDK
+Version
+7.1)](http://www.microsoft.com/downloads/details.aspx?FamilyID=6b6c21d2-2006-4afa-9702-529fa782d63b&displaylang=en "http://www.microsoft.com/downloads/details.aspx?FamilyID=6b6c21d2-2006-4afa-9702-529fa782d63b&displaylang=en"){.external}.
+Use the web installer to install at least the \"Win32 Development
+Tools\". Once the SDK installs, execute either `wpt_x86.msi` or
+`wpt_x64.msi` in the `Redist/Windows Performance Toolkit `folder of the
+SDK's install location (typically Program Files/Microsoft
+SDKs/Windows/v7.1/Redist/Windows Performance Toolkit) to actually
+install the Windows Performance Toolkit tools.
+
+It might already be installed by the Windows SDK. Check if C:\\Program
+Files\\Microsoft Windows Performance Toolkit already exists.
+
+For 64-bit Windows 7 or Vista, you'll need to do a registry tweak and
+then restart to enable stack walking:\
+\
+`REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f`
+
+### Symbol Server Setup
+
+With the latest versions of the Windows Performance Toolkit, you can
+modify the symbol path directly from within the program via the Trace
+menu. Just make sure you set the symbol paths before enabling \"Load
+Symbols\" and before opening a summary view. You can also modify the
+`_NT_SYMBOL_PATH` and `_NT_SYMCACHE_PATH` environment variables to make
+these changes permanent.
+
+The standard symbol path that includes both Mozilla's and Microsoft's
+symbol server configuration is as follows:
+
+`_NT_SYMCACHE_PATH: C:\symbols  _NT_SYMBOL_PATH: srv*c:\symbols*http://msdl.microsoft.com/download/symbols;srv*c:\symbols*http://symbols.mozilla.org/firefox/`
+
+To add symbols **from your own builds**, add
+`C:\path\to\objdir\dist\bin` to `_NT_SYMBOL_PATH`. As with all Windows
+paths, the symbol path uses semicolons (`;`) as separators.
+
+Make sure you select the Trace -\> Load Symbols menu option in the
+Windows Performance Analyzer (xperfview).
+
+There seems to be a bug in xperf and symbols; it is very sensitive to
+when the symbol path is edited. If you change it within the program,
+you'll have to close all summary tables and reopen them for it to pick
+up the new symbol path data.
+
+You'll have to agree to a EULA for the Microsoft symbols \-- if you're
+not prompted for this, then something isn't configured right in your
+symbol path. (Again, make sure that the directories exist; if they
+don't, it's a silent error.)
+
+### Quick Start
+
+All these tools will live, by default, in C:\\Program Files\\Microsoft
+Windows Performance Toolkit. Either run these commands from there, or
+add the directory to your path. You will need to use an elevated command
+prompt to start or stop profiling.
+
+Start recording data:
+
+`xperf -on latency -stackwalk profile`
+
+\"Latency\" is a special provider name that turns on a few predefined
+kernel providers; run \"xperf -providers k\" to view a full list of
+providers and groups. You can combine providers, e.g., \"xperf -on
+DiagEasy+FILE_IO\". \"-stackwalk profile\" tells xperf to capture a
+stack for each PROFILE event; you could also do \"-stackwalk
+profile+file_io\" to capture a stack on each cpu profile tick and each
+file io completion event.
+
+Stop:
+
+`xperf -d out.etl`
+
+View:
+
+`xperfview out.etl`
+
+The MSDN
+\"[Quickstart](http://msdn.microsoft.com/en-us/library/ff190971%28v=VS.85%29.aspx){.external}\"
+page goes over this in more detail, and also has good explanations of
+how to use xperfview. I'm not going to repeat it here, because I'd be
+using essentially the same screenshots, so go look there.
+
+The 'stack' view will give results similar to shark.
+
+### Heap Profiling
+
+xperf has good tools for heap allocation profiling, but they have one
+major limitation: you can't build with jemalloc and get heap events
+generated. The stock windows CRT allocator is horrible about
+fragmentation, and causes memory usage to rise drastically even if only
+a small fraction of that memory is in use. However, even despite this,
+it's a useful way to track allocations/deallocations.
+
+#### Capturing Heap Data
+
+The \"-heap\" option is used to set up heap tracing. Firefox generates
+lots of events, so you may want to play with the
+BufferSize/MinBuffers/MaxBuffers options as well to ensure that you
+don't get dropped events. Also, when recording the stack, I've found
+that a heap trace is often missing module information (I believe this is
+a bug in xperf). It's possible to get around that by doing a
+simultaneous capture of non-heap data.
+
+To start a trace session, launching a new Firefox instance:
+
+`xperf -on base  xperf -start heapsession -heap -PidNewProcess "./firefox.exe -P test -no-remote" -stackwalk HeapAlloc+HeapRealloc -BufferSize 512 -MinBuffers 128 -MaxBuffers 512`
+
+To stop a session and merge the resulting files:
+
+`xperf -stop heapsession -d heap.etl  xperf -d main.etl  xperf -merge main.etl heap.etl result.etl`
+
+\"result.etl\" will contain your merged data; you can delete main.etl
+and heap.etl. Note that it's possible to capture even more data for the
+non-heap profile; for example, you might want to be able to correlate
+heap events with performance data, so you can do
+\"`xperf -on base -stackwalk profile`\".
+
+In the viewer, when summary data is viewed for heap events (Heap
+Allocations Outstanding, etc. all lead to the same summary graphs), 3
+types of allocations are listed \-- AIFI, AIFO, AOFI. This is shorthand
+for \"Allocated Inside, Freed Inside\", \"Allocated Inside, Freed
+Outside\", \"Allocated Outside, Freed Inside\". These refer to the time
+range that was selected for the summary graph; for example, something
+that's in the AOFI category was allocated before the start of the
+selected time range, but the free event happened inside.
+
+### Tips
+
+-   In the summary views, the yellow bar can be dragged left and right
+    to change the grouping \-- for example, drag it to the left of the
+    Module column to have grouping happen only by process (stuff that's
+    to the left), so that you get symbols in order of weight, regardless
+    of what module they're in.
+-   Dragging the columns around will change grouping in various ways;
+    experiment to get the data that you're looking for. Also experiment
+    with turning columns on and off; removing a column will allow data
+    to be aggregated without considering that column's contributions.
+-   Disabling all but one core will make the numbers add up to 100%.
+    This can be done by running 'msconfig' and going to Advance
+    Options from the \"Boot\" tab.
+
+### Building Firefox
+
+To get good data from a Firefox build, it is important to build with the
+following options in your mozconfig:
+
+`export CFLAGS="-Oy-"  export CXXFLAGS="-Oy-"`
+
+This disables frame-pointer optimization which lets xperf do a much
+better job unwinding the stack. Traces can be captured fine without this
+option (for example, from nightlies), but the stack information will not
+be useful.
+
+`ac_add_options --enable-debug-symbols`
+
+This gives us symbols.
+
+### For More Information
+
+Microsoft's [documentation for xperf](http://msdn.microsoft.com/en-us/library/ff191077.aspx "http://msdn.microsoft.com/en-us/library/ff191077.aspx")
+is pretty good; there is a lot of depth to this tool, and you should
+look there for more details.
--- a/docs/performance/profiling_with_zoom.md
+++ b/docs/performance/profiling_with_zoom.md
@@ -0,0 +1,5 @@
+# Profiling with Zoom
+
+Zoom is a profiler very similar to Shark for Linux.
+
+You can get the profiler from here: <http://www.rotateright.com/>
--- a/docs/performance/sorting_algorithms_comparison.md
+++ b/docs/performance/sorting_algorithms_comparison.md
@@ -1,10 +1,5 @@
 # Sorting algorithms comparison

-This article describes a simple example program that we use in two of
-the Performance guides: the guide to the [Call
-Tree](call_tree.md) and the guide to the
-[Flame Chart](https://developer.mozilla.org/en-US/docs/Tools/Performance/Flame_Chart).
-
 This program compares the performance of three different sorting
 algorithms: