The overall goal of this patch is to make the StartupCache accessible anywhere.
There's two main pieces to that equation:
1. Allowing it to be accessed off main thread, which means modifying the
mutex usage to ensure that all data accessed from non-main threads is
protected.
2. Allowing it to be accessed out of the chrome process, which means passing
a handle to a shared cache buffer down to child processes.
Number 1 is somewhat fiddly, but it's all generally straightforward work. I'll
hope that the comments and the code are sufficient to explain what's going on
there.
Number 2 has some decisions to be made:
- The first decision was to pass a handle to a frozen chunk of memory down to
all child processes, rather than passing a handle to an actual file. There's
two reasons for this: 1) since we want to compress the underlying file on
disk, giving that file to child processes would mean they have to decompress
it themselves, eating CPU time. 2) since they would have to decompress it
themselves, they would have to allocate the memory for the decompressed
buffers, meaning they cannot all simply share one big decompressed buffer.
- The drawback of this decision is that we have to load and decompress the
buffer up front, before we spawn any child processes. We attempt to
mitigate this by keeping track of all the entries that child processes
access, and only including those in the frozen decompressed shared buffer.
- We base our implementation of this approach off of the shared preferences
implementation. Hopefully I got all of the pieces to fit together
correctly. They seem to work in local testing and on try, but I think
they require a set of experienced eyes looking carefully at them.
- Another decision was whether to send the handles to the buffers over IPC or
via command line. We went with the command line approach, because the startup
cache would need to be accessed very early on in order to ensure we do not
read from any omnijars, and we could not make that work via IPC.
- Unfortunately this means adding another hard-coded FD, similar to
kPrefMapFileDescriptor. It seems like at the very least we need to rope all
of these together into one place, but I think that should be filed as a
follow-up?
Lastly, because this patch is a bit of a monster to review - first, thank you
for looking at it, and second, the reason we're invested in this is because we
saw a >10% improvement in cold startup times on reference hardware, with a p
value less than 0.01. It's still not abundantly clear how reference hardware
numbers translate to numbers on release, and they certainly don't translate
well to Nightly numbers, but it's enough to convince me that it's worth some
effort.
Depends on D78584
Differential Revision: https://phabricator.services.mozilla.com/D77635
We need to be able to init StartupCache before the Omnijar in order to cache
all of the Omnijar contents we access. This patch implements that.
Depends on D77632
Differential Revision: https://phabricator.services.mozilla.com/D77633
The overall goal of this patch is to make the StartupCache accessible anywhere.
There's two main pieces to that equation:
1. Allowing it to be accessed off main thread, which means modifying the
mutex usage to ensure that all data accessed from non-main threads is
protected.
2. Allowing it to be accessed out of the chrome process, which means passing
a handle to a shared cache buffer down to child processes.
Number 1 is somewhat fiddly, but it's all generally straightforward work. I'll
hope that the comments and the code are sufficient to explain what's going on
there.
Number 2 has some decisions to be made:
- The first decision was to pass a handle to a frozen chunk of memory down to
all child processes, rather than passing a handle to an actual file. There's
two reasons for this: 1) since we want to compress the underlying file on
disk, giving that file to child processes would mean they have to decompress
it themselves, eating CPU time. 2) since they would have to decompress it
themselves, they would have to allocate the memory for the decompressed
buffers, meaning they cannot all simply share one big decompressed buffer.
- The drawback of this decision is that we have to load and decompress the
buffer up front, before we spawn any child processes. We attempt to
mitigate this by keeping track of all the entries that child processes
access, and only including those in the frozen decompressed shared buffer.
- We base our implementation of this approach off of the shared preferences
implementation. Hopefully I got all of the pieces to fit together
correctly. They seem to work in local testing and on try, but I think
they require a set of experienced eyes looking carefully at them.
- Another decision was whether to send the handles to the buffers over IPC or
via command line. We went with the command line approach, because the startup
cache would need to be accessed very early on in order to ensure we do not
read from any omnijars, and we could not make that work via IPC.
- Unfortunately this means adding another hard-coded FD, similar to
kPrefMapFileDescriptor. It seems like at the very least we need to rope all
of these together into one place, but I think that should be filed as a
follow-up?
Lastly, because this patch is a bit of a monster to review - first, thank you
for looking at it, and second, the reason we're invested in this is because we
saw a >10% improvement in cold startup times on reference hardware, with a p
value less than 0.01. It's still not abundantly clear how reference hardware
numbers translate to numbers on release, and they certainly don't translate
well to Nightly numbers, but it's enough to convince me that it's worth some
effort.
Depends on D78584
Differential Revision: https://phabricator.services.mozilla.com/D77635
We need to be able to init StartupCache before the Omnijar in order to cache
all of the Omnijar contents we access. This patch implements that.
Depends on D77632
Differential Revision: https://phabricator.services.mozilla.com/D77633
The overall goal of this patch is to make the StartupCache accessible anywhere.
There's two main pieces to that equation:
1. Allowing it to be accessed off main thread, which means modifying the
mutex usage to ensure that all data accessed from non-main threads is
protected.
2. Allowing it to be accessed out of the chrome process, which means passing
a handle to a shared cache buffer down to child processes.
Number 1 is somewhat fiddly, but it's all generally straightforward work. I'll
hope that the comments and the code are sufficient to explain what's going on
there.
Number 2 has some decisions to be made:
- The first decision was to pass a handle to a frozen chunk of memory down to
all child processes, rather than passing a handle to an actual file. There's
two reasons for this: 1) since we want to compress the underlying file on
disk, giving that file to child processes would mean they have to decompress
it themselves, eating CPU time. 2) since they would have to decompress it
themselves, they would have to allocate the memory for the decompressed
buffers, meaning they cannot all simply share one big decompressed buffer.
- The drawback of this decision is that we have to load and decompress the
buffer up front, before we spawn any child processes. We attempt to
mitigate this by keeping track of all the entries that child processes
access, and only including those in the frozen decompressed shared buffer.
- We base our implementation of this approach off of the shared preferences
implementation. Hopefully I got all of the pieces to fit together
correctly. They seem to work in local testing and on try, but I think
they require a set of experienced eyes looking carefully at them.
- Another decision was whether to send the handles to the buffers over IPC or
via command line. We went with the command line approach, because the startup
cache would need to be accessed very early on in order to ensure we do not
read from any omnijars, and we could not make that work via IPC.
- Unfortunately this means adding another hard-coded FD, similar to
kPrefMapFileDescriptor. It seems like at the very least we need to rope all
of these together into one place, but I think that should be filed as a
follow-up?
Lastly, because this patch is a bit of a monster to review - first, thank you
for looking at it, and second, the reason we're invested in this is because we
saw a >10% improvement in cold startup times on reference hardware, with a p
value less than 0.01. It's still not abundantly clear how reference hardware
numbers translate to numbers on release, and they certainly don't translate
well to Nightly numbers, but it's enough to convince me that it's worth some
effort.
Depends on D78584
Differential Revision: https://phabricator.services.mozilla.com/D77635
We need to be able to init StartupCache before the Omnijar in order to cache
all of the Omnijar contents we access. This patch implements that.
Depends on D77632
Differential Revision: https://phabricator.services.mozilla.com/D77633
Prior to this patch, the startupcache created its own mWriteThread off which it
wrote to disk. It's initialized by MaybeSpawnWriteThread, which got called
at shutdown, to do the shutdown write if there was any reason to do so, and
from a timer that is re-initialized after every addition to the startup cache,
to run 60s after the last change to the cache.
It then joined that write thread on the main thread (in other words, blocks
on that off-main-thread write completing from the main thread) when:
- xpcom-shutdown fired
- the startupcache itself gets destroyed
- someone calls any of:
* HasEntry
* GetBuffer
* PutBuffer
* InvalidateCache
This patch removes the separate write thread, and instead dispatches a task to
the background task queue, indicating it can block. The task is started in
the same circumstances where we previously used to write (timer from the last
PutBuffer call, and shutdown if necessary).
To ensure it cannot be trying to use the data it writes out (mTable) from
the other thread while that data changes on the main thread, we use a mutex.
The task locks the mutex before starting, and unlocks when finished.
Enumerating the cases that we used to block on joining the thread:
In terms of application shutdown, we expect the background task queue to
either finish the write task, or fail to run it if it hasn't started it yet.
In the FastStartup case, we check if a write was necessary; if so, we
attempt to gain the lock without waiting. If we're successful, the write has
not yet started, and we instead run the write on the main thread. Otherwise,
we retry gaining the lock, blocking this time, thus guaranteeing the
off-the-main-thread write completes.
The task keeps a reference to the startupcache object, so it cannot be
destroyed while the task is pending.
Because the write does not modify `mTable`, and neither does `HasEntry`,
we do not need to do anything there.
In the `GetBuffer` case, we do not modify the table unless we have to read
the entry off disk (memmapped into `mCacheData`). This can only happen if
`mCacheData.initialized()` returns true, and we specifically call
`mCacheData.reset()` before firing off the write task to avoid this.
`mCacheData` is only re-initialized if someone calls `LoadArchive()`,
which can only happen from `Init()` (which is guaranteed not to run
again because this is a singleton), or `InvalidateCache()`, where we lock
the mutex (see below). So this is safe - but we assert on the lock to try
and avoid people breaking this chain of assumptions in the future.
When `PutBuffer` is called, we try to lock the mutex - but if locking fails
(ie the background thread is writing), we simply fail to store the entry
in the startupcache. In practice, this should be rare - it'd happen if
new calls to PutBuffer happen while writing during shutdown (when really,
we don't care) or when it's been 60 seconds since the last PutBuffer so
we started writing the startupcache.
When InvalidateCache is called, we lock the mutex - we shouldn't try to
write while invalidating, or invalidate while writing. This may be slow,
but in practice nothing should call `InvalidateCache` except developer
restarts or the `-purgecaches` commandline flag, so it shouldn't
matter a great deal.
Differential Revision: https://phabricator.services.mozilla.com/D70413
The exact circumstances of how this is showing up in the wild aren't
clear - there seem to be a couple of ways we can get here. However it
all revolves around early shutdowns (i.e., from the select profile popup)
- before the StartupCache is ever initialized. In any case, the solution
shouldn't change based on the exact circumstances - if we don't have a
StartupCache, there's no need to write one. Also, don't bother lazy
initializing it if it doesn't exist yet.
Differential Revision: https://phabricator.services.mozilla.com/D63208
Reordering the mWrittenOnce check should be sufficient to eliminate
the data race; however, I made mWrittenOnce an atomic just to reduce
the fragility of this since it is intended to be written from and
read to on multiple threads.
Differential Revision: https://phabricator.services.mozilla.com/D62949
The initial thought for getting the StartupCache out of the shutdown
path was to simply not write it during shutdown, as it should write
60 seconds after startup, and the theory was that if the user shut
down within the first 60 seconds of use, they were likely updating or
something and it shouldn't matter. However, considering how many of
our users only ever open one tab, I think it's rather likely that
users are starting up firefox to go to a web site, then closing it
when done with that website, and then maybe opening up a new instance
in order to go to a different website. Accordingly it still makes
sense to continue writing it. However, we may as well leverage a
background thread for this and get it kicked off earlier during
shutdown, so we don't sit there blocking in the destructor late
during shutdown.
Differential Revision: https://phabricator.services.mozilla.com/D62294
The initial thought for getting the StartupCache out of the shutdown
path was to simply not write it during shutdown, as it should write
60 seconds after startup, and the theory was that if the user shut
down within the first 60 seconds of use, they were likely updating or
something and it shouldn't matter. However, considering how many of
our users only ever open one tab, I think it's rather likely that
users are starting up firefox to go to a web site, then closing it
when done with that website, and then maybe opening up a new instance
in order to go to a different website. Accordingly it still makes
sense to continue writing it. However, we may as well leverage a
background thread for this and get it kicked off earlier during
shutdown, so we don't sit there blocking in the destructor late
during shutdown.
Differential Revision: https://phabricator.services.mozilla.com/D62294
The initial thought for getting the StartupCache out of the shutdown
path was to simply not write it during shutdown, as it should write
60 seconds after startup, and the theory was that if the user shut
down within the first 60 seconds of use, they were likely updating or
something and it shouldn't matter. However, considering how many of
our users only ever open one tab, I think it's rather likely that
users are starting up firefox to go to a web site, then closing it
when done with that website, and then maybe opening up a new instance
in order to go to a different website. Accordingly it still makes
sense to continue writing it. However, we may as well leverage a
background thread for this and get it kicked off earlier during
shutdown, so we don't sit there blocking in the destructor late
during shutdown.
Differential Revision: https://phabricator.services.mozilla.com/D62294
The inclusions were removed with the following very crude script and the
resulting breakage was fixed up by hand. The manual fixups did either
revert the changes done by the script, replace a generic header with a more
specific one or replace a header with a forward declaration.
find . -name "*.idl" | grep -v web-platform | grep -v third_party | while read path; do
interfaces=$(grep "^\(class\|interface\).*:.*" "$path" | cut -d' ' -f2)
if [ -n "$interfaces" ]; then
if [[ "$interfaces" == *$'\n'* ]]; then
regexp="\("
for i in $interfaces; do regexp="$regexp$i\|"; done
regexp="${regexp%%\\\|}\)"
else
regexp="$interfaces"
fi
interface=$(basename "$path")
rg -l "#include.*${interface%%.idl}.h" . | while read path2; do
hits=$(grep -v "#include.*${interface%%.idl}.h" "$path2" | grep -c "$regexp" )
if [ $hits -eq 0 ]; then
echo "Removing ${interface} from ${path2}"
grep -v "#include.*${interface%%.idl}.h" "$path2" > "$path2".tmp
mv -f "$path2".tmp "$path2"
fi
done
fi
done
Differential Revision: https://phabricator.services.mozilla.com/D55444
The first run loads more things into the StartupCache than are
used on the second and subsequent runs. This just ensures that if
the StartupCache diverges too far from its actual use that we will
rebuild it.
Differential Revision: https://phabricator.services.mozilla.com/D34654
I am not aware of anything that depends on StartupCache being a
zip file, and since I want to use lz4 compression because inflate
is showing up quite a lot in profiles, it's simplest to just use
a custom format. This loosely mimicks the ScriptPreloader code,
with a few diversions:
- Obviously the contents of the cache are compressed. I used lz4
for this as I hit the same file size as deflate at a compression
level of 1, which is what the StartupCache was using previously,
while decompressing an order of magnitude faster. Seemed like
the most conservative change to make. I think it's worth
investigating what the impact of slower algs with higher ratios
would be, but for right now I settled on this. We'd probably
want to look at zstd next.
- I use streaming compression for this via lz4frame. This is not
strictly necessary, but has the benefit of not requiring as
much memory for large buffers, as well as giving us a built-in
checksum, rather than relying on the much slower CRC that we
were doing with the zip-based approach.
- I coded the serialization of the headers inline, since I had to
jump back to add the offset and compressed size, which would
make the nice Code(...) method for the ScriptPreloader stuff
rather more complex. Open to cleaner solutions, but moving it
out just felt like extra hoops for the reader to jump through
to understand without the benefit of being more concise.
Differential Revision: https://phabricator.services.mozilla.com/D34652
This will not behave exactly the same if we had previously written bad
data for the entry that would fail to decompress. I imagine this is rare
enough, and the consequences are not severe enough, that this should be
fine.
Differential Revision: https://phabricator.services.mozilla.com/D30643
The first run loads more things into the StartupCache than are
used on the second and subsequent runs. This just ensures that if
the StartupCache diverges too far from its actual use that we will
rebuild it.
Differential Revision: https://phabricator.services.mozilla.com/D34654
I am not aware of anything that depends on StartupCache being a
zip file, and since I want to use lz4 compression because inflate
is showing up quite a lot in profiles, it's simplest to just use
a custom format. This loosely mimicks the ScriptPreloader code,
with a few diversions:
- Obviously the contents of the cache are compressed. I used lz4
for this as I hit the same file size as deflate at a compression
level of 1, which is what the StartupCache was using previously,
while decompressing an order of magnitude faster. Seemed like
the most conservative change to make. I think it's worth
investigating what the impact of slower algs with higher ratios
would be, but for right now I settled on this. We'd probably
want to look at zstd next.
- I use streaming compression for this via lz4frame. This is not
strictly necessary, but has the benefit of not requiring as
much memory for large buffers, as well as giving us a built-in
checksum, rather than relying on the much slower CRC that we
were doing with the zip-based approach.
- I coded the serialization of the headers inline, since I had to
jump back to add the offset and compressed size, which would
make the nice Code(...) method for the ScriptPreloader stuff
rather more complex. Open to cleaner solutions, but moving it
out just felt like extra hoops for the reader to jump through
to understand without the benefit of being more concise.
Differential Revision: https://phabricator.services.mozilla.com/D34652
This will not behave exactly the same if we had previously written bad
data for the entry that would fail to decompress. I imagine this is rare
enough, and the consequences are not severe enough, that this should be
fine.
Differential Revision: https://phabricator.services.mozilla.com/D30643
In bug 1264235 we have some indication that observed bugs with the
startup cache might have been resolved, but we don't really know
until we collect data. Collecting these stats will give us the
ability to have more certainty that the startup cache is functioning
correctly in the wild.
Differential Revision: https://phabricator.services.mozilla.com/D19573
This is a best effort attempt at ensuring that the adverse impact of
reformatting the entire tree over the comments would be minimal. I've used a
combination of strategies including disabling of formatting, some manual
formatting and some changes to formatting to work around some clang-format
limitations.
Differential Revision: https://phabricator.services.mozilla.com/D13371
This was done automatically replacing:
s/mozilla::Move/std::move/
s/ Move(/ std::move(/
s/(Move(/(std::move(/
Removing the 'using mozilla::Move;' lines.
And then with a few manual fixups, see the bug for the split series..
MozReview-Commit-ID: Jxze3adipUh
Right now, NS_GENERIC_FACTORY_SINGLETON_CONSTRUCTOR expects singleton
constructors to return already-addrefed raw pointers, and while it accepts
constructors that return already_AddRefed, most existing don't do so.
Meanwhile, the convention elsewhere is that a raw pointer return value is
owned by the callee, and that the caller needs to addref it if it wants to
keep its own reference to it.
The difference in convention makes it easy to leak (I've definitely caused
more than one shutdown leak this way), so it would be better if we required
the singleton getters to return an explicit already_AddRefed, which would
behave the same for all callers.
This also cleans up several singleton constructors that left a dangling
pointer to their singletons when their initialization methods failed, when
they released their references without clearing their global raw pointers.
MozReview-Commit-ID: 9peyG4pRYcr