Shimming d3d12.dll for fun and profit, part 2
This post is a follow-up to Shimming d3d12.dll for fun and profit. If you haven’t read it yet I would strongly recommend doing so, as this one builds directly on top of it. The companion article on how to actually use the tools is GPU profiling for WebGPU workloads on Windows with Chrome.
What broke and why
The short answer is Chrome. The long answer is also Chrome.
At some point Chrome hardened its DLL loading policies and stopped loading our custom d3d12.dll even when launched with --disable-gpu-sandbox --disable-gpu-watchdog. The original trick relied on Windows’ DLL search order: place our d3d12.dll next to chrome.exe and Windows would pick it up before the one in System32. Chrome decided it didn’t want that anymore, which is honestly fair enough from a security standpoint. It just means we need a different plan.
The goal is the same as before: intercept D3D12CreateDevice, wrap the device, catch ID3D12SharingContract::Present and call an actual swapchain present to trigger the profiler. We just need to get our code into Chrome’s process without relying on the DLL search order.
The new plan
The solution has two parts.
First, a launcher executable (webgpu_profiling_launcher.exe) that creates Chrome in a suspended state, injects our hook DLL into the process, and then resumes it. Chrome wakes up with our code already loaded, none the wiser.
Second, a hook DLL (d3d12_webgpu_hook.dll) that, instead of replacing d3d12.dll entirely, uses Microsoft Detours to patch D3D12CreateDevice’s function pointer in-memory at runtime. Same interception, different delivery mechanism. The wrapper classes from part 1 are reused wholesale, all that work wasn’t for nothing!
There’s actually a third problem hiding in here that I’ll get to later, but let’s start with the injection.
Injecting the DLL
The classic way to inject a DLL into a process is CreateRemoteThread + LoadLibrary. The idea is straightforward: LoadLibraryA is a regular function in kernel32.dll, and CreateRemoteThread lets you start a thread in another process at an arbitrary address. Since kernel32.dll is loaded at the same address in every process (a Windows guarantee), we can grab LoadLibraryA’s address in our own process, use it as the thread entry point in the target process, and pass the DLL path as the thread argument. The remote thread calls LoadLibrary on our DLL as if it were the target process itself doing it.
FARPROC loadLibAddr = GetProcAddress(GetModuleHandleA("kernel32.dll"), "LoadLibraryA");
HANDLE thread = CreateRemoteThread(process, NULL, 0,
(LPTHREAD_START_ROUTINE)loadLibAddr, remoteMem, 0, NULL);
Before that we of course need to write the DLL path string into the target process’s address space, remoteMem, using VirtualAllocEx and WriteProcessMemory, since the string lives in our process memory and the remote thread can’t read it from there. We then wait for the remote thread to finish with WaitForSingleObject before moving on.
The launcher does all of this on a Chrome process it created with CREATE_SUSPENDED. Chrome is frozen at its very first instruction, we inject, and then we call ResumeThread to let it run. Our hook is active before Chrome has had a chance to do anything.
BOOL created = CreateProcessA(
chromePath.c_str(), (LPSTR)cmdLine.c_str(),
NULL, NULL, FALSE,
CREATE_SUSPENDED, // <-- freeze Chrome before it starts
NULL, NULL, &si, &pi);
InjectDllIntoProcess(pi.hProcess, hookDll.c_str(), true);
ResumeThread(pi.hThread);
Hooking with Detours
Now that our DLL is loaded into Chrome’s process, we can patch D3D12CreateDevice. Microsoft Detours makes this almost embarrassingly easy. You give it the real function pointer and your hook function, and it rewrites the first few bytes of the real function with a jump to yours, saving the original bytes in a trampoline so you can still call through.
static PFN_D3D12_CREATE_DEVICE Real_D3D12CreateDevice = nullptr;
static HRESULT WINAPI Hook_D3D12CreateDevice(
IUnknown* pAdapter,
D3D_FEATURE_LEVEL MinimumFeatureLevel,
REFIID riid,
void** ppDevice)
{
HRESULT res = Real_D3D12CreateDevice(pAdapter, MinimumFeatureLevel, riid, ppDevice);
if (res == S_OK && ppDevice && *ppDevice)
{
ID3D12Device_webgpu_shim* device = new ID3D12Device_webgpu_shim(reinterpret_cast<IUnknown*>(*ppDevice));
*ppDevice = device;
}
return res;
}
Installing and removing the hook is wrapped in a Detours transaction, which handles the thread-safe patching:
static LONG AttachHooks()
{
DetourTransactionBegin();
DetourUpdateThread(GetCurrentThread());
DetourAttach(&(PVOID&)Real_D3D12CreateDevice, Hook_D3D12CreateDevice);
return DetourTransactionCommit();
}
Once D3D12CreateDevice is hooked, the rest of the logic is identical to part 1. We wrap the returned device, intercept command queue creation to get at ID3D12SharingContract, and call swapChain_->Present(0, 0) from there. The profiler sees a present, catches the frame boundary, and the capture completes.
One small thing worth noting: in DllMain we call DetourRestoreAfterWith() before attaching the hooks. This is a Detours formality required when a DLL was loaded via DetourCreateProcessWithDll or injection, it restores the import address table to a clean state before we start patching. Skipping it causes undefined behaviour. I skipped it the first time and spent a while confused about why things were crashing. Don’t be me.
case DLL_PROCESS_ATTACH:
DetourRestoreAfterWith();
DisableThreadLibraryCalls(hinstDLL);
GetModuleFileNameA(hinstDLL, g_hookDllPath, MAX_PATH);
AttachHooks();
break;
The GPU child process problem
Here is the part I hinted at earlier. Chrome is not a single process. It has a main browser process and a separate GPU process (--type=gpu) that is actually responsible for the Direct3D work. When we inject into the main Chrome process and patch D3D12CreateDevice, we are patching it in the wrong process. The GPU process is spawned after Chrome starts and doesn’t inherit our hook.
The fix is to also hook CreateProcessW in the main Chrome process. Every time Chrome spawns a child process, our hook runs first. We inspect the command line and if we see --type=gpu, we know this is the GPU process. We force CREATE_SUSPENDED, inject ourselves into it the same way the launcher injected us into Chrome, and then resume it.
static BOOL WINAPI Hook_CreateProcessW(
LPCWSTR lpApplicationName, LPWSTR lpCommandLine, ...)
{
bool isGpuProcess = wcsstr(lpCommandLine, L"--type=gpu") != nullptr;
DWORD flags = dwCreationFlags;
if (isGpuProcess) flags |= CREATE_SUSPENDED;
BOOL result = Real_CreateProcessW(lpApplicationName, lpCommandLine, ..., flags, ...);
if (result && isGpuProcess)
{
InjectDllIntoProcess(lpProcessInformation->hProcess, g_hookDllPath, false);
if (!(dwCreationFlags & CREATE_SUSPENDED))
ResumeThread(lpProcessInformation->hThread);
}
return result;
}
To know which DLL to inject, we store our own path at load time using GetModuleFileNameA(hinstDLL, g_hookDllPath, MAX_PATH) in DllMain. The hook DLL injects itself.
This propagation step is why Detours tries to hook D3D12CreateDevice gracefully even when d3d12.dll isn’t loaded yet. In the main Chrome process, d3d12.dll probably isn’t loaded at all, that’s fine, the hook just skips it. In the GPU process, d3d12.dll will be there and the hook lands properly.
The Nsight problem
There is one more wrinkle I haven’t mentioned yet. I actually released this blog post and the code before dealing with it, oops. I got lazy, didn’t test Nsight and forgot it had to launch the process itself. For some reason, launching the launcher for Nsight doesn’t work well and so I had to come up with another solution.
First of all, while investigating I realized that Nsight wouldn’t attach to Chrome anymore. It used to attach fine, the issue was that it couldn’t capture properly. So I started Googling the issue and found a thread which actually figured out the issue. (Funnily enough, this thread mentions my blog so I’ll close the loop) Nsight needs to run as administrator to be able to do a GPU trace. And when Chrome is launched by an elevated process, it notices and de-elevates itself: it spawns a new non-elevated Chrome process and the original one exits.
This is likely a Chrome security feature meant to prevent accidentally running the browser with admin rights. Reasonable! But it means Nsight loses track of the process it just started. The fix is one flag: --do-not-de-elevate. Passing it to Chrome suppresses the de-elevation step entirely. Chrome stays at the same privilege level as Nsight and it is able to keep the connection to the correct process.
Good, we’ve solved one problem. However, launching the launcher from Nsight with the added flag still doesn’t work. My best guess is that if Nsight isn’t the one creating Chrome’s process, it won’t attach to it. Fine, we’ll inject ourselves after the fact then.
The implementation is straightforward. Given a process name like chrome.exe, it uses CreateToolhelp32Snapshot to walk the list of running processes and collect every matching PID. It then calls OpenProcess on each one and runs the same InjectDllIntoProcess function the launcher uses, VirtualAllocEx, WriteProcessMemory, CreateRemoteThread on LoadLibraryA, wait for it to finish. Same mechanism, just pointed at a process we didn’t create ourselves.
One thing worth noting: since OpenProcess with the flags required for remote thread injection is a privileged operation, the injector is compiled with a requireAdministrator UAC manifest. It will always prompt for elevation.
One last thing, injecting into the main Chrome process gets the CreateProcessW hook in place, but if the tab with the WebGPU workload was already opened, the GPU child process was already spawned before we arrived. Reloading the tab kills and recreates the GPU process, which our hook then intercepts and injects into automatically.
Closing thoughts
Honestly the new approach feels more practical than the old one, even if it involves more moving parts. We’re no longer touching Chrome’s installation directory at all, which is less footgun-y. Detours is a proper library designed for exactly this use case, compared to rolling a full DLL replacement by hand. The wrapper classes from part 1 are unchanged, that code is doing the same job it always did, just getting delivered differently.
That said, it is still a massive hack. Whether Chrome will keep spawning its GPU process with --type=gpu in the command line forever, I have absolutely no idea. It could change any day. Enjoy it while it works!
If you’d like to see this kind of profiler support available natively in Chrome without any of this, please upvote this Chromium issue.