Collecting dumps; under the hood
Or how dotnet-dump collect
works on Linux
If you are trying to diagnose the behaviour of a mis-behaving
application in production, one option is to use the
dotnet-dump
tool to get a memory dump of the application so you can study it
offline. With a dump, you can get stack traces, look at all the
threads and even dig through the entire heap offline.
But how does dotnet-dump
actually work?
This post will try and summarize what I found as I tried digging
through what dotnet dump collect
does under the hood.
The client
Let’s start by looking at the first result of a google search for
“dotnet-dump”: the official documentation for dotnet-dump
command
.
That points to the NuGet Package
page
for dotnet-dump
.
And the NuGet Package page has a link to the source
repository
.
Now we can poke around in https://github.com/dotnet/diagnostics
1
repository to find the source code for the dotnet dump collect
command.
There’s a src/Tools
directory which contains a dotnet-dump
directory. That seems like a likely candidate for the source code of
this tool. The
Program.cs
seems like a good place to start digging into the code.
There’s a CollectCommand
sub-command in that file that, which
matches with the collect
sub-command name in the dotnet dump
CLI
tool. The code looks roughly like this:
private static Command CollectCommand() =>
new Command( name: "collect", description: "Capture dumps from a process")
{
// Handler
CommandHandler.Create<...>(new Dumper().Collect),
// Options
ProcessIdOption(), OutputOption(), DiagnosticLoggingOption(), CrashReportOption(), TypeOption(), ProcessNameOption()
};
There’s a bit of boiler-plate involving how subcommands, arguments and
options are handled through System.CommandLine
. The actual work is
done (using what System.CommandLine
calls a Handler
) by calling
Dumper.Collect
.
Let’s look into that. The
Dumper.Collect()
method looks something like this:
public partial class Dumper
{
public int Collect(....)
{
// Lots of error handling
if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows)
{
// ...
Windows.CollectDump(...);
}
else
{
var client = new DiagnosticsClient(processId);
// ...
client.WriteDump(...) ;
}
}
}
That checks for the OS - Windows or Linux/macOS - and then calls
OS-specific code to handle each condition. For Linux, it calls into
the DiagnosticsClient
class to create a dump.
The (relevant) core logic of the DiagnosticsClient class is this:
public sealed class DiagnosticsClient
{
public DiagnosticsClient(int processId): this(new PidIpcEndPoint(processId))
{
}
// ...
public void WriteDump(...)
{
IpcMessage request = CreateWriteDumpMessage(...);
IpcMessage response = IpcClient.SendMessage(...);
// lots of error handling and fallback if IPC response
// indicates failure
}
}
This client is process-id based (this will come in handy later). To
write a dump, it creates a message and then sends it over some IPC
(Inter-Process Communication) mechanism via the IpcClient
. A valid
response confirms that the dump was created.
What is this IPC mechanism? Who’s listening on the other side?
.NET diagnostics IPC
The interesting bits of
IpcClient
look roughly like this:
internal class IpcClient
{
public static IpcMessage SendMessage(IpcEndpoint endpoint, IpcMessage message)
{
IpcResponse response = SendMessageGetContunation(endpoint, message);
return response.Message;
}
public static IpcResponse SendMessageGetContnuation(IpcEndpoint endpoint, IpcMessage message)
{
Stream stream = endpoint.Connect(...);
Write(stream, message);
IpcMessage response = Read(stream);
return new IpcResponse(...);
}
public static void Write(Stream stream, IpcMessage message)
{
byte[] buffer = message.Serialize();
stream.Write(buffer, ...);
}
}
When we call SendMessage
, it calls SendMessageWithContinuation
to
the heavy work and then returns the response.
SendMessageWithContinuation
connects to an endpoint, uses some form
of serialization to convert the request message to an array of bytes
and then writes those bytes into a stream.
Lets dig into these one by one
-
The
endpoint
is represented by theIpcEndpoint
class. Remember howDiagnosticsClient
had created aPidIpcEndpoint
instance explcitly?The
PidIpcEndpoint
and related classes look roughly like this:internal abstract class IpcEndpoint { // ... } internal class PidIpcEndpoint : IpcEndpoint { public static stirng IpcRootPath { get; } = Path.GetTempPath(); int _pid; public override Stream Connect(TimeSpan timeout) { string address = GetDefaultAddress(); return IpcEndPointHelper.Connect(address, timeout) } private string GetDefaultAddress() { // ... TryGetDefaultAddress(_pid, out string transportName); return transportName; } private static bool TryGetDefaultAddress(int pid, out string defaultAddress) { defaultAddress = Directory.GetFiles(IpcRootPath, $"dotnet-diagnostic-{pid}-*-socket") .FirstOrDefualt(); return defaultAddress; } } internal class IpcEndpointHelper { public static Stream Connect(...) { var socket = new IpcUnixDomainSocket() socket.Connect(new IpcUnixDomainSocketEndpoint(...)); return new ExposedSocketNetworkStream(socket); } }
The entrypoint to this code is supposed to be the
EndPoint.Connect
method. WhenPidIpcEndpoint.Connect
is called, To summarize, it looks for a file matching a specific file name pattern in/tmp
. After finding the file, the code opens the file as a unix domain socket and uses that for sending (and receiving) data. -
The serialization mechanism
internal class IpcMessage { public byte[] Serialize() { using (var writer = new BinaryWriter(...)) { writer.Write(...); writer.Flush(); serializedData = stream.ToArray(); return serializedData; } } }
This uses simple
BinaryWriter
-based serialization.
Okay, so now we know that we are using BinaryWriter
-based
serialization to send a message to a socket. Is this something ad-hoc
or part of an intentionally designed feature in .NET?
.NET diagnostic sockets
It turns out that this socket is an intentional part of the of the .NET runtime.
Whenever a .NET process starts, it creates a socket (file) at
/tmp/dotnet-diagnostics-${pid}-${random}-socket
:
$ ps aux | grep 1688[7]
omajid 16887 0.0 0.3 273609920 103020 pts/6 Sl+ 16:42 0:00 /home/omajid/local/dotnet/microsoft/7.0.101/dotnet --roll-forward major bin/Debug/net6.0/Pause.dll
$ ls -al /tmp/dotnet-diagnostic-16887*socket
srw-------. 1 omajid omajid 0 Dec 22 16:42 /tmp/dotnet-diagnostic-16887-1636599-socket
$ stat /tmp/dotnet-diagnostic-16887-1636599-socket
File: /tmp/dotnet-diagnostic-16887-1636599-socket
Size: 0 Blocks: 0 IO Block: 4096 socket
Device: 0,37 Inode: 259 Links: 1
Access: (0600/srw-------) Uid: ( 1000/ omajid) Gid: ( 1000/ omajid)
Context: unconfined_u:object_r:user_tmp_t:s0
Access: 2022-12-22 16:42:42.097080354 -0500
Modify: 2022-12-22 16:42:42.097080354 -0500
Change: 2022-12-22 16:42:42.097080354 -0500
Birth: 2022-12-22 16:42:42.097080354 -0500
A custom protocol - based on BinaryWriter
serialization - is used to
send messages across this. This is what the all the code that we have
seen so far has been doing.
The full IPC protocol is documented in
ipc-protocol.md
.
We still have a remaining question: who is listening on the other side and how do they handle these messages? The IPC protocol gives a great hint:
.. IPC Protocol [is] used for communicating with the dotnet core runtime’s Diagnostics Server
What is this?
The .NET Runtime
Following the hint, lets try and dig through the dotnet/runtime code. If you want to follow along, you can find the source code for code for the .NET runtime at https://github.com/dotnet/runtime/ .
We can start by searching the CoreCLR VM in the runtime for anything related to diagnostics:
$ find src/coreclr/vm -iname '*diagnostic*'
src/coreclr/vm/diagnosticserveradapter.h
That seems like a great starting point! It seems to defer everything
to ds-server.c
.
The initialization code of the diagnostics server is defined in
ds_server_init
.
It looks, roughly like this:
bool ds_server_init(void)
{
// lots of initialization
ep_rt_thread_create(server_threads, ...);
}
void server_thread()
{
// ...
while (!server_shutting_down)
{
DiagnosticsIpcMessage message;
ds_ipc_message_init(&message)
ds_ipc_message_inititalize_stream(&message, stream)
switch (ds_ipc_header_get_command(...))
{
case DS_SERVER_COMMANDSET_DUMP:
ds_dump_protocol_helper_handle_ipc_message (&message, stream);
break;
}
}
}
This thread runs forever, waiting for any diagnostics commands.
When a diagnostics command is received, this calls
ds_dump_protocol_helper_handle_ipc_message
to handle the message and
write the dump to an on-disk location:
https://github.com/dotnet/runtime/blob/e467a5f65a4fb6b0b703a5c1c22c519114e99845/src/native/eventpipe/ds-dump-protocol.c#L243
That eventually leads to
PAL_GenerateCoreDump
.
That looks roughly like this:
PAL_GenerateCoreDump(
...)
{
// ...
std::vector<const char*> argvCreateDump
char* program = nullptr;
char* pidarg = nullptr;
PROCBuildCreateDumpCommandLine(argvCreateDump, &program, &pidarg);
PROCCreateCrashDump(argvCreateDump);
}
PROCBuildCreateDumpCommandLine(
std::vector<const char*>& argv,
...)
{
// ...
const char* DumpGeneratorName = "createdump";
argv.push_back(program);
argv.push_back(pidarg);
// ...
}
Hang on a second, this just runs the createdump
command! This
command is included with the .NET runtime, on your disk:
$ find /usr/lib64/dotnet/ -name createdump
/usr/lib64/dotnet/shared/Microsoft.NETCore.App/7.0.2/createdump
/usr/lib64/dotnet/shared/Microsoft.NETCore.App/6.0.13/createdump
createdump
is also included with self-contained applications.
There’s some discussion on how it should be removed
here
.
What does createdump
do?
We can start by searching the dotnet/runtime repository for any file that might look like it’s relevant to the createdump command. Searching for such files leads me to a promisingly named createdump/main.cpp file: https://github.com/dotnet/runtime/blob/f1bdd5a6182f43f3928b389b03f7bc26f826c8bc/src/coreclr/debug/createdump/main.cpp
Let’s start with the main
method. It looks like this:
int __cdecl main(const int argc, const char* argv[])
{
// lots of argument parsing
if (CreateDump(dumpPathtTemplate, pid, ...))
{
// success
}
// cleanup and exit
}
CreateDump
is defined in createdumpunix.cpp for Linux and macOS
https://github.com/dotnet/runtime/blob/f1bdd5a6182f43f3928b389b03f7bc26f826c8bc/src/coreclr/debug/createdump/createdumpunix.cpp#L14
There’s a ton of code to dig through, and a lot of it goes down into Linux-specific detail. I might do a detailed walk-through in another post. But here are the important points:
-
CreateDump
callsCrashInfo::EnumerateAndSuspendThreads
, which usesptrace(2)
to suspend all threads in the .NET application -
CreateDump
callsCrashInfo::GatherCrashInfo
to collect data:-
Get information from
/proc/$PID/auxv
(the auxillary vector data). -
Get information from
/proc/pid/maps
about the memory regions -
Use the DAC (Data Access Component) of the runtime to find the managed modules
-
Unwind all the threads
-
-
Use the DAC again to enumerate the managed memory regions.
-
Write the dump out as an ELF file.
On a side note, the last point is particularly interesting.
createdump
writes out a regular ELF core file. This is a standard
core file, similar to those produced by other tools like gcore
. It’s
in a format that’s readable by both dotnet dump analyze
but also
native debugging tools like lldb
and gdb
. The corefile can be used
to debugged applications using gdb
/lldb
, but they will need the
unmanaged (or native) debug symbols. That’s not true for dotnet dump analyze
which - surprise - again makes use of the DAC to figure out
the managed state of the application.
When all that is done, we finally get the core file that we were looking for!
Summary
That was a lot to chew through. So let’s do a quick recap of what happens when we use `dotnet dump collect.
-
The
dotnet dump
tool parses the user’s command and figures out that the user wants to trigger a particular type of dump. -
dotnet dump
creates a specially crafted message that it sends to the target .NET application over the .NET diagnostics socket. -
The .NET runtime receives the message over the socket and parses it.
-
The runtime then runs
createdump
as a separate process, pointingcreatedump
to the .NET application itself. -
createdump
pauses the target .NET application and collects everything needed from the application by walking through the managed memory (with the help of the DAC) and the unmanaged memory. -
createdump
writes out the dump to disk.
At the end of this, we finally have a file on disk that contains the application’s dump
There are some interesting consequences that come up because of this approach:
-
.NET runtimes provides a mechanism for other applications to request information from them. If this mechanism is turned off (eg, via
DOTNET_EnableDiagnostics
), then tools likedotnet-dump
become useless. -
Thanks to a single protocol, it’s possible for
dotnet dump collect
to work against any number of different .NET runtimes and versions. And All runtime-specific detail is handled by the runtime’s built-increatedump
command. -
Some folks have tried removing the
createdump
binary from their published applications to save on size. Removingcreatedump
from those applications means tools likedotnet-dump
aren’t fully funtional against those applications. Their applications can become harder to diagnose. -
If you really need to, you can take advantage of the diagnostics protocol and write your own custom tools to talk to the .NET runtime.
-
I have linked to the classes/files on GitHub, but it might easier to clone the repo and look through it using your favourite tools. ↩︎