Quickly optimizing Python code with FunctionTrace
I recently did a quick optimization pass over glyphsLib, using FunctionTrace to improve performance by ~30% in under half an hour. This is writeup of my approach, which should be mostly extensible to optimizing other codebases. Hopefully this writeup gives you a sense for how productive non-sampled profilers such as FunctionTrace can be when it comes to optimizing code, as well as exploring a new codebase.
NOTE: This article includes high-resolution images, and to get the most out of it you'll need to view them at their original size. As such, it may not be an enjoyable experience on mobile.
A few important things before we start:
- This exploration was driven by a bug report submitted for FunctionTrace where it OOMed on noto-amalgamated after generating too large of a profile. This led me to optimize for number of function calls rather than strictly performance. Luckily these tend to be highly correlated in Python.
- I'd never seen or heard of this library before loading it in FunctionTrace. As a result, I didn't try to make broader algorithmic changes, and I attempted to make small targeted changes that were likely to be correct rather than more significant refactors.
- This was a quick hack-and-slash optimization session. No measurements were taken for intermediate optimizations, and my benchmark consisted of running the before/after code once each on an unloaded CPU via time — nothing involved with this process was rigorous!
Backstory
FunctionTrace is a low-overhead non-sampled profiler for Python (which I've written about before) that uses the Firefox Profiler as a GUI. Earlier this month I received a bug report saying that it was crashing with an out-of-memory error when run on noto-amalgamated, a project that uses Noto. After reproducing the issue, it became obvious that FunctionTrace on this tool generated too much data for the current version of the Firefox Profiler to handle. While there are some interesting options for solving this in the future, I wanted to unblock the immediate use-case. The majority of the profiler information FunctionTrace emits comes from recording information on every function call entry and exit, so if we could quickly reduce the number of function calls we'd be able to profile proportionally more of the program's execution.
Diving In
We've been given the following code to reproduce the issue:
$ cd /tmp
$ git clone --depth 1 https://github.com/googlefonts/noto-source/
$ git clone https://github.com/madig/noto-amalgamated.git
$ cd noto-amalgamated && python3 -m venv venv && . venv/bin/activate && pip install glyphsLib==6.0.0b4
$ functiontrace amalgamate-noto.py
Looking at htop, we can quickly see FunctionTrace consuming >10GB of RAM. As the Firefox Profiler runs in a browser and therefore is limited to a 32-bit heap, we know that it won't be able to work with profiles larger than 4GB. To profile this program, we'll need to only look at a short chunk of its execution. We'll be conservative and start with 10s:
$ timeout 10s functiontrace amalgamate-noto.py
...
[FunctionTrace] Wrote profile data to /tmp/noto-amalgamated-plain/functiontrace.json (1021.3 MB)
That looks like it should fit quite nicely into a 32-bit heap! Loading this into the Firefox Profiler gives us the following view:
FunctionTrace 101
A FunctionTrace profile stores the execution history of a program over time. For this article, we'll only be using the Stack Chart view, which allows you to easily explore that execution history. Each rectangle in the view represents a function call that occurred during our program's execution, and the width specifies how long the function ran for.
The stack chart behaves like a tree, with functions higher up having called functions below it. For example, exec() is the uppermost function, and as it was called to load the program and begin it's execution, it's the parent of all other functions. We're typically not interested in functions in the middle (vertically) of the chart — functions near the top tend to have names that can inform us of what code is running, while functions at the bottom tend to be ripe for optimization.
If you'd like to better understand FunctionTrace, there's a live demo on the FunctionTrace website you can play around with. Otherwise, let's dive in!
Our initial view of the profile looks pretty repetitive. We can see that there are repeated sections of load() calls followed by to_ufos(). These probably cover different conceptual pieces of the program, so we can optimize load() and to_ufos() separately.
load()
We start by picking an instance of load() and looking at it, and quickly realize that it's also roughly composed of two repeating subsections:
This function is spending lots of time in various deepcopy() calls, as well as in _parse_nodes_dict(). It's probably hard to optimize deepcopy() usage without knowing more about the program, but let's look closer at _parse_nodes_dict():
There's some initialization, then series of repeated <lambda>, __init__, read, and append calls. This looks like a potential target, so we'll dive into the source code for this function, which is actually part of glyphsLib.
def _parse_nodes_dict(self, parser, d):
for x in d:
if parser.format_version == 3:
self.nodes.append(GSNode().read_v3(x))
else:
self.nodes.append(GSNode().read(x))
That explains the sequence of append() calls! A loop of append()s looks a lot like what extend() in the Python stdlib does, so let's quickly replace that:
def _parse_nodes_dict(self, parser, d):
if parser.format_version == 3:
self.nodes.extend(GSNode().read_v3(x) for x in d)
else:
self.nodes.extend(GSNode().read(x) for x in d)
Looking a little closer, we can also explain the <lambda> calls — self.nodes is a computed property:
nodes = property(
lambda self: PathNodesProxy(self),
lambda self, value: PathNodesProxy(self).setter(value),
)
This appears to be a common pattern in the glyphsLib codebase, and we'll get a few easy wins from removing it. In this case, we'd expect it to no longer show up in our profile since we're not loading self.nodes in a loop anymore, and after profiling again we can see that our assumption is correct.
GSNode.read()
Now we're spending our time in a single extend() call that's repeatedly running read(), which spends a fair amount of time calling into parse_float_or_int(). When we look at read()'s source, we can see that it must be doing something like parsing the input into (x, y) coordinates:
def read(self, line):
m = self._PLIST_VALUE_RE.match(line).groups()
self.position = Point(parse_float_or_int(m[0]), parse_float_or_int(m[1]))
self.type = m[2].lower()
self.smooth = bool(m[3])
if m[4] is not None and len(m[4]) > 0:
value = self._decode_dict_as_string(m[4])
parser = Parser()
self._userData = parser.parse(value)
return self
This looks too much like interesting logic for me to try to change most of this, but parse_float_or_int() stood out in the profile.
def parse_float_or_int(value_string):
v = float(value_string)
if v.is_integer():
return int(v)
return v
That is_integer() call looks expensive! Let's avoid it (and the float conversion) when possible by using exceptions instead. This will probably be a bit more expensive if the input is a float, but should be significantly cheaper if it is an int, as well as removing a function call.
def parse_float_or_int(value_string):
try:
return int(value_string)
except ValueError:
return float(value_string)
Looking back at GSNode.read(), we'd now expect it to be pretty efficient. But why does the profile show isinstance() calls?
@property
def position(self):
return self._position
@position.setter
def position(self, value):
if not isinstance(value, Point):
value = Point(value[0], value[1])
self._position = value
There are those pesky computed properties again. Since we know value will be a Point, we can write to the underlying self._position instead:
def read(self, line):
...
self._position = Point(parse_float_or_int(m[0]), parse_float_or_int(m[1]))
...
After applying these changes, the profile for load() looks much smoother:
to_ufos()
We now switch our attention to the other half of this program, to_ufos(). It appears to be spending most of its time in to_ufo_glyph().
The _ensureMasterLayers() function stands out as taking a long time, as well as repeatedly being called from to_ufo_glyph().
From the instances of <lambda> in the profile, we could guess that it's spending lots of time looking up computed properties, and looking at the code confirms that.
def _ensureMasterLayers(self):
# Ensure existence of master-linked layers (even for iteration, len() etc.)
# if accidentally deleted
if not self._owner.parent:
return
for master in self._owner.parent.masters:
if self._owner.parent.masters[master.id] is None:
newLayer = GSLayer()
newLayer.associatedMasterId = master.id
newLayer.layerId = master.id
self._owner._setupLayer(newLayer, master.id)
self.__setitem__(master.id, newLayer)
A quick skim makes me think self._owner.parent.masters doesn't need to be computed this often, so we can hoist it out of the loop, then generate a new profile.
def _ensureMasterLayers(self):
# Ensure existence of master-linked layers (even for iteration, len() etc.)
# if accidentally deleted
if not self._owner.parent:
return
masters = self._owner.parent.masters
for master in masters:
if masters[master.id] is None:
newLayer = GSLayer()
newLayer.associatedMasterId = master.id
newLayer.layerId = master.id
self._owner._setupLayer(newLayer, master.id)
self.__setitem__(master.id, newLayer)
_ensureMasterLayers() now looks much smaller compared to the other function calls, and is itself making fewer function calls.
We've already made a substantial improvement on how busy the profile looks, and we're running out of obvious next steps. It's time to benchmark!
Before: python3 amalgamate-noto.py 1001.63s user 8.12s system 99% cpu 16:50.58 total
After: python3 amalgamate-noto.py 875.36s user 7.48s system 99% cpu 14:42.88 total
Not bad, given that we only spent ~15 minutes, and changed 11 lines of code in a codebase we've never seen before!
Just a bit more
Normally I would call it quits here, but seeing a function like _ensureMasterLayers() being called so frequently (in every invocation of __next__()) gives me pause. Looking at the codebase, we can see that it is called at the start of many other functions for its class (GlyphLayerProxy). That's a pretty expensive way to ensure invariants which are likely already true!
It may not be safe to remove this function entirely, but there are likely workarounds. We could elide this check in release builds, or only run it after we know invariants may have been broken. Since the owner of the code could make an optimization like this, for now I'm happy to do a hacky version and remove _ensureMasterLayers from all of the functions but GlyphLayerProxy.append(). Re-running our benchmark, we see that our hunch paid off.
Before: python3 amalgamate-noto.py 1001.63s user 8.12s system 99% cpu 16:50.58 total
After: python3 amalgamate-noto.py 780.09s user 8.27s system 99% cpu 13:09.09 total
Some of our changes might not be production-worthy yet, but our quick pass with a non-sampled profiler has at least shown that there's >25% low-hanging fruit for the maintainer to pluck!