No battle plan survives contact with the enemy - triage revisited several years later
Several years ago, I made a blog post about the triage process
we have adopted at System Era, the company I work for and any good plan, no
matter how well it’s been crafted you only really measure it by applying it to
the real world and seeing how well it works.
So, after a hiatus that has been longer than I have wanted, I am coming
back to the blog and am giving an update on where we are with it. The last few years have been a good test for it
as we have been working on a new game and the volume of bugs that go through it
has grown quite a bit.
So how did we do?
Let’s look at the numbers for last year for Starseeker, System Era’s next game
due out sometime this year. As you can imagine,
the year before release has seen a lot of development work that came with a lot
of bugs, all of which go through and stress the triage process. In previous years when we operated on a more
traditional triage process that required meetings to discuss how to prioritize,
assign and schedule bugs we had reached a point where I had implemented the five-minute
rule. That was a response to what often
occurs in triage; discussion of the bug itself and how to fix it or to clarify
within the design discipline what they wanted.
I mention this because for the sake of this discussion I use a benchmark
of five minutes per bug for triage by meeting – not all bugs lead to that discussion
but they do require that people read the bug, discuss who it goes to and for
everyone to come to an agreement on fix versions and priorities. On average, five minutes per bug seems
reasonable.
In 2025 the QA team, both our internal team as well as our outsourced test team,
wrote a little over 3,400 bugs last year.
A wee bit more, but for purposes of keeping the math simple we will use
that rounded number. Applying a five
minute per bug assumption that would mean that in 2025 we should have had 283.3
hours of triage meetings. Assuming that
there are five people present in the meeting; a producer, a QA team member, an
engineer, a designer and an artist, we can then calculate the person-hour cost
of those triage meetings. For System Era
using the fully loaded cost for an employee for wages & benefits that comes
out to somewhere around $83-85 dollars.
Those triage meetings would have cost us somewhere around $117,000
dollars.
During 2025 we had exactly 0 bug triage meetings.
Did we save $117k on meetings? No. What we did was spread that time and cost
throughout the day by using an asynchronous process that brought in only those
that needed to weigh in on the bug.
Sometimes that did involve @ notifying people outside of the triage
group that normally would have decided who to assign it to, how to prioritize
it and what fix version to shoot for along with some other data we use in our
production process like the Jira Epic the bug falls under which is how production
tracks both tasks and bugs related to specific work efforts.
In those targeted, asynchronous conversations we often did have lengthy discussions
about what was going wrong technically, or whether we needed to make design decisions
to help solve the issue. And those
conversations would have gone beyond our previous 5-minute rule – but we could
painlessly deal with them through Slack.
The $117k figure for meeting time is something we looked at in retrospect
when we looked at the numbers from the previous year, we were not feeling the
impact of having meetings and were able to process sometimes dozens of bugs a
day seemingly effortlessly and without disruption. Not only that, but it was fast.
One of the disadvantages of waiting until you have enough bugs to call a triage
meeting is that there is a delay between Perforce check-ins, testing, bug
writing and getting that back to developers.
Because of the nature of our build process combined with the rapid, QA tester-initiated
triage process we were able to remove the lag.
I have seen bugs go into triage and get assigned out in a matter of
minutes. When we ran into issues that were
blocking issues the process allowed us to respond immediately, it didn’t
require any special escalation process.
We had all the tools built into the process already.
In short, even when we had to scale the triage process it
worked well. No process is perfect though;
there were areas where we had some friction so let’s talk about that.
The use of Slack and notifying key people only when it’s time to triage can be
a double-edged sword. Slack
notifications can be disruptive to concentration; we did get some feedback on
that. The fixes, however, are easy. The first line of defense for that is how
developers set their notification settings.
Just because somebody hits you up on Slack, it does not mean that you
must, or even should, drop everything to respond. For some people this is a hard habit to break
and for them we suggested that they set Slack so that they can work uninterrupted
by pings and pop-ups. Secondly, we suggested
that developers involved in the triage process reserve time at the start and
end of the day to respond to the threads.
We had an aspiration to get our bugs fully through triage in a 1 business
day SLA – it wasn’t necessary for everyone to drop everything to get them done.
But that’s also where the next pain point comes in.
When dozens of bugs a day are going through triage it is inevitable that some won’t
get fully resolved. Sometimes it’s
because it is a tricky issue that has started a long conversation about what
might be going wrong. Sometimes it is
because key people are out or not responding.
Other times it’s because it was sent to the Slack channel late in the
afternoon on a Friday and just got overlooked when people returned to work on
Monday. Sometimes it was because the QA
tester that had initiated the thread had not followed up on the message because
they were busy with writing the next handful of bugs that needed triage and
forgot to ensure the process went all the way through to completion. It’s a mixed bag of all those things.
But the reality is that the fix for all of that is pretty
easy, much easier than holding meetings at least.
The way the process works is that when all of the meta data we as QA were looking
for was set; that the bug had an owner (or was backlogged with no specific assignment),
priority, fix version and the other meta data the production team asked us to
include to help them track the bug is all set the QA tester who started the triage
thread is supposed to edit their comment and strikethrough all of the text.
As the QA manager I would periodically scroll up through all of the posts in
our triage channel. Because they are
threaded conversations the only thing I would see are the top-level messages. Anything with a strike through was completed
and the ones that did not have a strike through were the ones that needed to be
followed up on. That was a mixed bag of
reminding QA testers to strike through the text or re-pinging @ notification
groups or specific individuals. It’s not
my favorite way to spend my time but it was a pretty minor task and was a lot
less labor intensive than traditional meetings.
The final issue wasn’t so much of a problem as it was a work effort we knew
going into the process was going to require constant attention. Specifically, keeping parity with how the
production team organized our various work efforts and that was mostly a matter
of updating our Slack @ notification groups and our Confluence
documentation. We went through large
stretches of the year without having to do much of this, it only became an
issue when strike teams were adjusted due to changing needs.
System Era is a company of somewhere above 60 people or so
with a small number of outsourced partners and contractors. Looking back at 2025 I was extremely happy
with how well the triage process dealt with a higher volume of bugs than we had
dealt with in previous years. I do spend
some time thinking about even higher volumes out of curiosity – there are games
and companies of much larger scale than we are, how would this scale to a
Rockstar level of game production? I don’t
know for sure, but I think it could be done.
In the meantime System Era is going to find out this year how it manages
the final push getting the game shipped.
But so far, so good.