Bug 1839299 - Use a finer-grained strategy to protect RtlLookupFunctionEntry against deadlocks. r=win-reviewers,rkraesig
On 64-bit Windows (x86_64, aarch64), stack walking relies on RtlLookupFunctionEntry to navigate from one frame to the next. This function acquires up to two ntdll internal locks when it is called. The profiler and the background hang monitor both need to walk the stacks of suspended threads. This can lead to deadlock situations, which so far we have avoided with stack walk suppressions. We guard some critical paths to mark them as suppressing stack walk, and we forbid stack walking when any thread is currently on such path. While stack walk suppression has helped remove most deadlock situations, some can remain because it is hard to detect and manually annotate all the paths that could lead to a deadlock situation. Another drawback is that stack walk suppression disables stack walking for much larger portions of code than required. For example, we disable stack walking for LdrLoadDll, so we cannot collect stacks while we are loading a DLL. Yet, the lock that could lead to a deadlock situation is only held during a very small portion of the whole time spent in LdrLoadDll. This patch addresses these two issues by implementing a finer-grained strategy to avoid deadlock situations. We acquire the pointers to the internel ntdll locks through a single-stepped execution of RtlLookupFunctionEntry. This allows us to try to acquire the locks non-blockingly so that we can guarantee safe stack walking with no deadlock. If we fail to collect pointers to the locks, we fall back to using stack walk suppressions like before. This way we get the best of both worlds: if we are confident that the situation is under control, we will use the new strategy and get better profiler accuracy and no deadlock; in case of doubt, we can still use the profiler thanks to stack walk suppressions. Differential Revision: https://phabricator.services.mozilla.com/D223498
This commit is contained in:
@@ -913,6 +913,12 @@ class MOZ_RAII PEHeaders final {
|
||||
IMAGE_SCN_MEM_READ);
|
||||
}
|
||||
|
||||
// There may be other data sections in the binary besides .data
|
||||
Maybe<Span<const uint8_t>> GetDataSectionInfo() const {
|
||||
return FindSection(".data", IMAGE_SCN_CNT_INITIALIZED_DATA |
|
||||
IMAGE_SCN_MEM_READ | IMAGE_SCN_MEM_WRITE);
|
||||
}
|
||||
|
||||
static bool IsValid(PIMAGE_IMPORT_DESCRIPTOR aImpDesc) {
|
||||
return aImpDesc && aImpDesc->OriginalFirstThunk != 0;
|
||||
}
|
||||
|
||||
@@ -6,6 +6,7 @@
|
||||
|
||||
/* API for getting a stack trace of the C/C++ stack on the current thread */
|
||||
|
||||
#include "mozilla/Array.h"
|
||||
#include "mozilla/ArrayUtils.h"
|
||||
#include "mozilla/Atomics.h"
|
||||
#include "mozilla/Attributes.h"
|
||||
@@ -123,10 +124,57 @@ class FrameSkipper {
|
||||
CRITICAL_SECTION gDbgHelpCS;
|
||||
|
||||
# if defined(_M_AMD64) || defined(_M_ARM64)
|
||||
// Because various Win64 APIs acquire function-table locks, we need a way of
|
||||
// preventing stack walking while those APIs are being called. Otherwise, the
|
||||
// stack walker may suspend a thread holding such a lock, and deadlock when the
|
||||
// stack unwind code attempts to wait for that lock.
|
||||
// We must use RtlLookupFunctionEntry to do stack walking on x86-64 and arm64,
|
||||
// but internally this function does a blocking shared acquire of SRW locks
|
||||
// that live in ntdll and are not exported. This is problematic when we want to
|
||||
// suspend a thread and walk its stack, like we do in the profiler and the
|
||||
// background hang reporter. If the suspended thread happens to hold one of the
|
||||
// locks exclusively while suspended, then the stack walking thread will
|
||||
// deadlock if it calls RtlLookupFunctionEntry.
|
||||
//
|
||||
// Note that we only care about deadlocks between the stack walking thread and
|
||||
// the suspended thread. Any other deadlock scenario is considered out of
|
||||
// scope, because they are unlikely to be our fault -- these other scenarios
|
||||
// imply that some thread that we did not suspend is stuck holding one of the
|
||||
// locks exclusively, and exclusive acquisition of these locks only happens for
|
||||
// a brief time during Microsoft API calls (e.g. LdrLoadDll, LdrUnloadDll).
|
||||
//
|
||||
// We use one of two alternative strategies to gracefully fail to capture a
|
||||
// stack instead of running into a deadlock:
|
||||
// (1) collect pointers to the ntdll internal locks at stack walk
|
||||
// initialization, then try to acquire them non-blockingly before
|
||||
// initiating any stack walk;
|
||||
// or (2) mark all code paths that can potentially end up doing an exclusive
|
||||
// acquisition of the locks as stack walk suppression paths, then check
|
||||
// if any thread is currently on a stack walk suppression path before
|
||||
// initiating any stack walk;
|
||||
//
|
||||
// Strategy (2) can only avoid all deadlocks under the easily wronged
|
||||
// assumption that we have correctly identified all existing paths that should
|
||||
// be stack suppression paths. With strategy (2) we cannot collect stacks e.g.
|
||||
// during the whole duration of a DLL load happening on any thread so the
|
||||
// profiling results are worse.
|
||||
//
|
||||
// Strategy (1) guarantees no deadlock. It also gives better profiling results
|
||||
// because it is more fine-grained. Therefore we always prefer strategy (1),
|
||||
// and we only use strategy (2) as a fallback.
|
||||
|
||||
// Strategy (1): Ntdll Internal Locks
|
||||
//
|
||||
// The external stack walk initialization code will feed us pointers to the
|
||||
// ntdll internal locks. Once we have them, we no longer need to rely on
|
||||
// strategy (2).
|
||||
static Atomic<bool> sStackWalkLocksInitialized;
|
||||
static Array<SRWLOCK*, 2> sStackWalkLocks;
|
||||
|
||||
MFBT_API
|
||||
void InitializeStackWalkLocks(const Array<void*, 2>& aStackWalkLocks) {
|
||||
sStackWalkLocks[0] = reinterpret_cast<SRWLOCK*>(aStackWalkLocks[0]);
|
||||
sStackWalkLocks[1] = reinterpret_cast<SRWLOCK*>(aStackWalkLocks[1]);
|
||||
sStackWalkLocksInitialized = true;
|
||||
}
|
||||
|
||||
// Strategy (2): Stack Walk Suppressions
|
||||
//
|
||||
// We're using an atomic counter rather than a critical section because we
|
||||
// don't require mutual exclusion with the stack walker. If the stack walker
|
||||
@@ -157,6 +205,24 @@ AutoSuppressStackWalking::~AutoSuppressStackWalking() {
|
||||
DesuppressStackWalking();
|
||||
}
|
||||
|
||||
bool IsStackWalkingSafe() {
|
||||
// Use strategy (1), if initialized.
|
||||
if (sStackWalkLocksInitialized) {
|
||||
bool isSafe = false;
|
||||
if (::TryAcquireSRWLockShared(sStackWalkLocks[0])) {
|
||||
if (::TryAcquireSRWLockShared(sStackWalkLocks[1])) {
|
||||
isSafe = true;
|
||||
::ReleaseSRWLockShared(sStackWalkLocks[1]);
|
||||
}
|
||||
::ReleaseSRWLockShared(sStackWalkLocks[0]);
|
||||
}
|
||||
return isSafe;
|
||||
}
|
||||
|
||||
// Otherwise, fall back to strategy (2).
|
||||
return sStackWalkSuppressions == 0;
|
||||
}
|
||||
|
||||
static uint8_t* sJitCodeRegionStart;
|
||||
static size_t sJitCodeRegionSize;
|
||||
uint8_t* sMsMpegJitCodeRegionStart;
|
||||
@@ -375,17 +441,18 @@ static void DoMozStackWalkThread(MozWalkStackCallback aCallback,
|
||||
# endif
|
||||
|
||||
# if defined(_M_AMD64) || defined(_M_ARM64)
|
||||
// If there are any active suppressions, then at least one thread (we don't
|
||||
// know which) is holding a lock that can deadlock RtlVirtualUnwind. Since
|
||||
// that thread may be the one that we're trying to unwind, we can't proceed.
|
||||
// If at least one thread (we don't know which) may be holding a lock that
|
||||
// can deadlock RtlLookupFunctionEntry, we can't proceed because that thread
|
||||
// may be the one that we're trying to walk the stack of.
|
||||
//
|
||||
// But if there are no suppressions, then our target thread can't be holding
|
||||
// a lock, and it's safe to proceed. By virtue of being suspended, the target
|
||||
// thread can't acquire any new locks during the unwind process, so we only
|
||||
// need to do this check once. After that, sStackWalkSuppressions can be
|
||||
// changed by other threads while we're unwinding, and that's fine because
|
||||
// we can't deadlock with those threads.
|
||||
if (sStackWalkSuppressions) {
|
||||
// But if there is no such thread by this point, then our target thread can't
|
||||
// be holding a lock, so it's safe to proceed. By virtue of being suspended,
|
||||
// the target thread can't acquire any new locks during our stack walking, so
|
||||
// we only need to do this check once. Other threads may temporarily acquire
|
||||
// the locks while we're walking the stack, but that's mostly fine -- calling
|
||||
// RtlLookupFunctionEntry will make us wait for them to release the locks,
|
||||
// but at least we won't deadlock.
|
||||
if (!IsStackWalkingSafe()) {
|
||||
return;
|
||||
}
|
||||
|
||||
|
||||
@@ -7,12 +7,30 @@
|
||||
#ifndef mozilla_StackWalk_windows_h
|
||||
#define mozilla_StackWalk_windows_h
|
||||
|
||||
#include "mozilla/Array.h"
|
||||
#include "mozilla/Types.h"
|
||||
|
||||
#if defined(_M_AMD64) || defined(_M_ARM64)
|
||||
/**
|
||||
* Allow stack walkers to work around the egregious win64 dynamic lookup table
|
||||
* list API by locking around SuspendThread to avoid deadlock.
|
||||
* This function enables strategy (1) for avoiding deadlocks between the stack
|
||||
* walking thread and the suspended thread. In aStackWalkLocks the caller must
|
||||
* provide pointers to the two ntdll-internal SRW locks acquired by
|
||||
* RtlLookupFunctionEntry. These locks are LdrpInvertedFunctionTableSRWLock and
|
||||
* RtlpDynamicFunctionTableLock -- we don't need to know which one is which.
|
||||
* Until InitializeStackWalkLocks function is called, strategy (2) is used.
|
||||
*
|
||||
* See comment in StackWalk.cpp
|
||||
*/
|
||||
MFBT_API
|
||||
void InitializeStackWalkLocks(const mozilla::Array<void*, 2>& aStackWalkLocks);
|
||||
|
||||
/**
|
||||
* As part of strategy (2) for avoiding deadlocks between the stack walking
|
||||
* thread and the suspended thread, we mark stack walk suppression paths by
|
||||
* putting them under the scope of a AutoSuppressStackWalking object. Any code
|
||||
* path that may do an exclusive acquire of LdrpInvertedFunctionTableSRWLock or
|
||||
* RtlpDynamicFunctionTableLock should be marked this way, to ensure that
|
||||
* strategy (2) can properly mitigate all deadlock scenarios.
|
||||
*
|
||||
* See comment in StackWalk.cpp
|
||||
*/
|
||||
|
||||
@@ -13,7 +13,7 @@
|
||||
#include <windows.h>
|
||||
#include <winternl.h>
|
||||
|
||||
#if defined(MOZ_DIAGNOSTIC_ASSERT_ENABLED) && defined(_M_X64)
|
||||
#if defined(_M_AMD64)
|
||||
|
||||
namespace mozilla {
|
||||
|
||||
@@ -23,9 +23,9 @@ static bool sIsSingleStepping = false;
|
||||
|
||||
MFBT_API AutoOnSingleStepCallback::AutoOnSingleStepCallback(
|
||||
OnSingleStepCallback aOnSingleStepCallback, void* aState) {
|
||||
MOZ_DIAGNOSTIC_ASSERT(!sIsSingleStepping && !sOnSingleStepCallback &&
|
||||
!sOnSingleStepCallbackState,
|
||||
"Single-stepping is already active");
|
||||
MOZ_RELEASE_ASSERT(!sIsSingleStepping && !sOnSingleStepCallback &&
|
||||
!sOnSingleStepCallbackState,
|
||||
"Single-stepping is already active");
|
||||
|
||||
sOnSingleStepCallback = std::move(aOnSingleStepCallback);
|
||||
sOnSingleStepCallbackState = aState;
|
||||
@@ -42,7 +42,7 @@ MFBT_API AutoOnSingleStepCallback::~AutoOnSingleStepCallback() {
|
||||
// a first single-step exception. It is then up to the exception handler to
|
||||
// keep the trap flag enabled so that a new single step exception gets
|
||||
// triggered with the following instruction.
|
||||
MFBT_API MOZ_NEVER_INLINE __attribute__((naked)) void EnableTrapFlag() {
|
||||
MFBT_API MOZ_NEVER_INLINE MOZ_NAKED void EnableTrapFlag() {
|
||||
asm volatile(
|
||||
"pushfq;"
|
||||
"orw $0x100,(%rsp);"
|
||||
@@ -53,7 +53,7 @@ MFBT_API MOZ_NEVER_INLINE __attribute__((naked)) void EnableTrapFlag() {
|
||||
// This function does not do anything special, but when we reach its address
|
||||
// while single-stepping the exception handler will know that it is now time to
|
||||
// leave the trap flag turned off.
|
||||
MFBT_API MOZ_NEVER_INLINE __attribute__((naked)) void DisableTrapFlag() {
|
||||
MFBT_API MOZ_NEVER_INLINE MOZ_NAKED void DisableTrapFlag() {
|
||||
asm volatile("retq;");
|
||||
}
|
||||
|
||||
@@ -78,4 +78,4 @@ MFBT_API LONG SingleStepExceptionHandler(_EXCEPTION_POINTERS* aExceptionInfo) {
|
||||
|
||||
} // namespace mozilla
|
||||
|
||||
#endif // MOZ_DIAGNOSTIC_ASSERT_ENABLED && _M_X64
|
||||
#endif // _M_AMD64
|
||||
|
||||
@@ -86,9 +86,7 @@ struct WinErrorState {
|
||||
bool operator!=(WinErrorState const& that) const { return !operator==(that); }
|
||||
};
|
||||
|
||||
// TODO This code does not have tests. Only use it on paths that are already
|
||||
// known to crash. Add tests before using it in release builds.
|
||||
#if defined(MOZ_DIAGNOSTIC_ASSERT_ENABLED) && defined(_M_X64)
|
||||
#if defined(_M_AMD64)
|
||||
|
||||
using OnSingleStepCallback = std::function<bool(void*, CONTEXT*)>;
|
||||
|
||||
@@ -108,9 +106,6 @@ MFBT_API MOZ_NEVER_INLINE __attribute__((naked)) void EnableTrapFlag();
|
||||
MFBT_API MOZ_NEVER_INLINE __attribute__((naked)) void DisableTrapFlag();
|
||||
MFBT_API LONG SingleStepExceptionHandler(_EXCEPTION_POINTERS* aExceptionInfo);
|
||||
|
||||
// This block uses nt::PEHeaders and thus depends on NativeNt.h.
|
||||
# if !defined(IMPL_MFBT)
|
||||
|
||||
// Run aCallbackToRun instruction by instruction, and between each instruction
|
||||
// call aOnSingleStepCallback. Single-stepping ends when aOnSingleStepCallback
|
||||
// returns false (in which case aCallbackToRun will continue to run
|
||||
@@ -140,6 +135,9 @@ CollectSingleStepData(CallbackToRun aCallbackToRun,
|
||||
return WindowsDiagnosticsError::None;
|
||||
}
|
||||
|
||||
// This block uses nt::PEHeaders and thus depends on NativeNt.h.
|
||||
# if !defined(IMPL_MFBT)
|
||||
|
||||
template <int NMaxSteps, int NMaxErrorStates>
|
||||
struct ModuleSingleStepData {
|
||||
uint32_t mStepsLog[NMaxSteps]{};
|
||||
@@ -288,7 +286,7 @@ WindowsDiagnosticsError CollectModuleSingleStepData(
|
||||
|
||||
# endif // !IMPL_MFBT
|
||||
|
||||
#endif // MOZ_DIAGNOSTIC_ASSERT_ENABLED && _M_X64
|
||||
#endif // _M_AMD64
|
||||
|
||||
} // namespace mozilla
|
||||
|
||||
|
||||
Reference in New Issue
Block a user