Building FunctionTrace, a graphical Python profiler

Note: A version of this article was originally published by Mozilla Hacks.

I recently built a tool to help developers better understand what their Python code is doing. FunctionTrace is a non-sampled profiler for Python that runs on unmodifed Python applications with very low (<5%) overhead. Importantly, it's integrated with the Firefox Profiler which allows you to graphically interact with profiles, making it easier to spot patterns and make improvements to your codebase. In this post, I'll discuss why FunctionTrace was built, some technical details on its implementation, and how tools like this can target the Firefox Profiler as a powerful open-source visualization tool. To help follow along, you can also play with a small demo of it!

Technical debt as motivation

When working on complex projects with many people, codebases have a tendency to grow larger over time. Some languages have great support for dealing with this, such as Java with all of its IDE capabilities built up over decades, or Rust with its strong type system that makes refactoring a breeze. Codebases in other languages sometimes seem to become increasingly less maintainable as they grow. Particularly in older Python codebases (at least we're all on Python 3 now, right?), I've observed that it can be extremely difficult to make broad changes or refactor pieces of code you're not familiar with. In contrast, I've found that when I'm able to see what a program is doing and all of the interactions it has, I have a much easier time making correct changes. Often, I'll even find myself making improvements to pieces of the code that I'd never intended to touch, as inefficiencies become glaringly obvious when presented on my screen.

I wanted to be able to understand what the Python codebases I was working in were doing without needing to read through hundreds of files, but was unable to find any existing tools for Python that were satisfactory for this, and I mostly lost interest in building a tool myself due the amount of UI work that would be necessary. However, when I stumbled across the Firefox Profiler, my hopes of quickly understanding a program's execution were reignited. The profiler provided all of the "hard" pieces - an intuitive open-source UI that could display stack charts, time-correlated log markers, a flame graph, and the stability coming from ties with a major web browser. Any tool able to emit a properly-formatted JSON profile would be able to reuse all of the previously mentioned graphical analysis features.

Design of FunctionTrace

Luckily, I already had a week of vacation scheduled for a few days after I discovered the Firefox Profiler, and knew another friend who was taking time off that week and was interested in building it with me.


We had several goals when starting to build FunctionTrace:

  1. Give the ability to see everything occurring in the program
  2. Be able to handle multithreaded/multiprocess applications
  3. Be low enough overhead that we could use it without a performance tradeoff

The first goal had a significant impact on the design, while the latter two mostly just added engineering complexity. From having used other tools like this, we both knew the frustration of not being able to see function calls that are too short - when you're sampling at 1ms but have important functions that run faster than that, you're missing significant pieces of what's occurring inside your program! As a result, we knew we'd need to be able to trace all function calls and could not use a sampling profiler. Additionally, I'd recently spent a lot of time in a codebase where Python functions would exec other Python code (frequently via an intermediary shell script), so we knew we'd want to be able to also trace descendant Python processes.

Initial Implementation

To support multiple processes and descendants, we settled on a client-server model. We'd instrument Python clients, which send trace data to a Rust server to aggregate and compress before generating a profile that could be consumed by the Firefox Profiler. We chose Rust for several reasons, including the strong type system, a desire for stable performance and predictable memory usage, and ease of prototyping and refactoring.

We prototyped the client as a Python module, called via python -m functiontrace This allowed us to easily use Python's builtin tracing hooks to log what was executed. The initial implementation looked very similar to the following:

def profile_func(frame, event, arg):
    if event == "call" or event == "return" or
      event == "c_call" or event == "c_return":
        data = (event, time.time())


For the server, we listened on a Unix domain socket for client connections, then read data from the client and converted them into Firefox Profiler's JSON format. While the Firefox Profiler supports various profile types, such as perf logs, we decided to emit directly to their internal format because it requires less space and maintenance than adding a new supported format. Importantly, the Firefox Profiler maintains backwards compatibility for profile versions, meaning any profile we emit targeting the current format version will be automatically converted to the latest version when loaded in the future. Additionally, the profiler format references strings by integer IDs, allowing significant space savings via deduplication (while being trivial to implement using indexmap).

This initial base generally worked - on every function call/return Python would call our hook, which would send a JSON message out over a socket for the server to convert into the proper format. However, it was incredibly slow - even after batching the socket calls we observed at least 8x overhead on some of our test programs! At this point, we dropped down to C using Python's C API instead and got down to 1.1x overhead on the same programs. After that, we were able to do another key optimization by replacing calls to time.time() with raw rdtsc operations. This reduced the performance overhead for function calls to one instruction and emitting 64 bits of data, rather than having a chain of Python calls and complex arithmetic in the critical path.

I've mentioned that we support tracing multiple threads and descendant processes. Since this was one of the more difficult pieces of the client, it's worth discussing some lower-level details.

Supporting multiple threads

We install a handler on all threads via threading.setprofile() (note that registering via a handler like this has the benefit of us knowing that Python is running and the GIL is currently held when we're setting up our thread state, allowing us to simplify some assumptions):

// This is installed as the setprofile() handler for new threads by
// threading.setprofile().  On its first execution, it initializes tracing for
// the thread, including creating the thread state, before replacing itself with
// the normal Fprofile_FunctionTrace handler.
static PyObject* Fprofile_ThreadFunctionTrace(..args..) {

    // Replace our setprofile() handler with the real one, then manually call
    // it to ensure this call is recorded.

When our Fprofile_ThreadFunctionTrace() hook is called, it allocates a struct ThreadState which contains information the thread will need to log events and communicate to the server. We then send an initial message to the profile server, notifying it that a new thread has started and giving it some initial information (time, PID, etc). After this initialization, we replace the hook with Fprofile_FunctionTrace() which does the actual tracing in the future.

Supporting descendant processes

When handling multiple processes, we make the assumption that children are being run via a python interpreter. Unfortunately, the children won't be called with -m functiontrace, so we won't know to trace them. To ensure that children processes are traced, on startup we modify the $PATH environment variable to ensure python is pointing to an executable that knows to load functiontrace.

# Generate a temp directory to store our wrappers in.  We'll temporarily
# add this directory to our path.
tempdir = tempfile.mkdtemp(prefix="py-functiontrace")
os.environ["PATH"] = tempdir + os.pathsep + os.environ["PATH"]

# Generate wrappers for the various Python versions we support to ensure
# they're included in our PATH.
wrap_pythons = ["python", "python3", "python3.6", "python3.7", "python3.8"]
for python in wrap_pythons:
    with open(os.path.join(tempdir, python), "w") as f:
        os.chmod(, 0o755) 

Inside the wrappers, we simply need to call the real python interpreter with the additional argument of -m functiontrace. To wrap this support up, on startup we add an environment variable saying what socket we're using to communicate to the profile server. If a client initializes and sees this environment variable already set, it knows that it's a descendant process and to connect to the existing server instance, allowing us to correlate its trace with that of the original client.

Current implementation

The overall implementation of FunctionTrace today shares many similarities with the above descriptions. At a high level, the client is traced via FunctionTrace when invoked as python -m functiontrace, which loads a Python module for some setup, then calls into our C module to install various tracing hooks. These hooks include the sys.setprofile hooks mentioned above, memory allocation hooks, and custom hooks on various "interesting" functions, like builtins.print or builtins.__import__. Additionally, we spawn a functiontrace-server instance, setup a socket for talking to it, and ensure that future threads and descendant processes will be talking to the same server.

On every trace event, the Python client emits a small MessagePack record containing minimal event information and a timestamp to a thread-local memory buffer. When the buffer fills up (every 128KB), it is dumped to the server via a shared socket and the client continues to execute. The server listens asynchronously to each of the clients, quickly consuming their trace logs into a separate buffer to avoid blocking them. A thread corresponding to each client is then able to parse each trace event and convert it into the proper end format. Once all connected clients exit, the per-thread logs are aggregated into a full profile log and emitted to a file, where they can then be used with the Firefox Profiler.

Lessons Learned

Having a Python C module gives significantly more power and performance, but comes at the cost of requiring more code, being harder to find good documentation on, and having few features easily accessible. While C modules appear to be an under-utilized tool for writing high performance Python modules (based on some FunctionTrace profiles I've seen), we'd recommend a balance of writing most of the non-performance critical code in Python and calling into inner loops or setup code in C for the pieces where Python doesn't shine.

JSON encoding/decoding can be incredibly slow when the human-readable aspect isn't necessary. We switched to MessagePack for client-server communication and found it just as easy to work with while cutting some of our benchmark times in half!

Multithreading profiling support in Python is pretty hairy, so it's understandable why it doesn't seem to have been a key feature in previous mainstream Python profilers. It took several different approaches and many segfaults before we had a good understanding of how to operate around the GIL while maintaining high performance.

Please extend the profiler ecosystem!

This project wouldn't have existed without the Firefox Profiler, as it would've simply been too time consuming to create a complex frontend without even knowing if it would be beneficial. We hope to see other projects targeting the Firefox Profiler, either by adding native support for the Profiler format like FunctionTrace did, or by contributing support for their own formats. While FunctionTrace isn't entirely done yet, I hope seeing it or this blog can help show other developers of the Firefox Profiler's potential. This is a fantastic opportunity for some key development tooling to move beyond the command line and into a GUI that's far better suited for quickly extracting relevant information.

Other articles