Monday, February 16, 2026

 No battle plan survives contact with the enemy - triage revisited several years later

Several years ago, I made a blog post about the triage process we have adopted at System Era, the company I work for and any good plan, no matter how well it’s been crafted you only really measure it by applying it to the real world and seeing how well it works.  So, after a hiatus that has been longer than I have wanted, I am coming back to the blog and am giving an update on where we are with it.  The last few years have been a good test for it as we have been working on a new game and the volume of bugs that go through it has grown quite a bit. 

So how did we do?

Let’s look at the numbers for last year for Starseeker, System Era’s next game due out sometime this year.  As you can imagine, the year before release has seen a lot of development work that came with a lot of bugs, all of which go through and stress the triage process.  In previous years when we operated on a more traditional triage process that required meetings to discuss how to prioritize, assign and schedule bugs we had reached a point where I had implemented the five-minute rule.  That was a response to what often occurs in triage; discussion of the bug itself and how to fix it or to clarify within the design discipline what they wanted.  I mention this because for the sake of this discussion I use a benchmark of five minutes per bug for triage by meeting – not all bugs lead to that discussion but they do require that people read the bug, discuss who it goes to and for everyone to come to an agreement on fix versions and priorities.  On average, five minutes per bug seems reasonable.

In 2025 the QA team, both our internal team as well as our outsourced test team, wrote a little over 3,400 bugs last year.  A wee bit more, but for purposes of keeping the math simple we will use that rounded number.  Applying a five minute per bug assumption that would mean that in 2025 we should have had 283.3 hours of triage meetings.  Assuming that there are five people present in the meeting; a producer, a QA team member, an engineer, a designer and an artist, we can then calculate the person-hour cost of those triage meetings.  For System Era using the fully loaded cost for an employee for wages & benefits that comes out to somewhere around $83-85 dollars.  Those triage meetings would have cost us somewhere around $117,000 dollars.

During 2025 we had exactly 0 bug triage meetings. 

Did we save $117k on meetings?  No.  What we did was spread that time and cost throughout the day by using an asynchronous process that brought in only those that needed to weigh in on the bug.  Sometimes that did involve @ notifying people outside of the triage group that normally would have decided who to assign it to, how to prioritize it and what fix version to shoot for along with some other data we use in our production process like the Jira Epic the bug falls under which is how production tracks both tasks and bugs related to specific work efforts.

In those targeted, asynchronous conversations we often did have lengthy discussions about what was going wrong technically, or whether we needed to make design decisions to help solve the issue.  And those conversations would have gone beyond our previous 5-minute rule – but we could painlessly deal with them through Slack.  The $117k figure for meeting time is something we looked at in retrospect when we looked at the numbers from the previous year, we were not feeling the impact of having meetings and were able to process sometimes dozens of bugs a day seemingly effortlessly and without disruption.  Not only that, but it was fast.

One of the disadvantages of waiting until you have enough bugs to call a triage meeting is that there is a delay between Perforce check-ins, testing, bug writing and getting that back to developers.  Because of the nature of our build process combined with the rapid, QA tester-initiated triage process we were able to remove the lag.  I have seen bugs go into triage and get assigned out in a matter of minutes.  When we ran into issues that were blocking issues the process allowed us to respond immediately, it didn’t require any special escalation process.  We had all the tools built into the process already.

In short, even when we had to scale the triage process it worked well.  No process is perfect though; there were areas where we had some friction so let’s talk about that.

The use of Slack and notifying key people only when it’s time to triage can be a double-edged sword.  Slack notifications can be disruptive to concentration; we did get some feedback on that.  The fixes, however, are easy.  The first line of defense for that is how developers set their notification settings.  Just because somebody hits you up on Slack, it does not mean that you must, or even should, drop everything to respond.  For some people this is a hard habit to break and for them we suggested that they set Slack so that they can work uninterrupted by pings and pop-ups.  Secondly, we suggested that developers involved in the triage process reserve time at the start and end of the day to respond to the threads.  We had an aspiration to get our bugs fully through triage in a 1 business day SLA – it wasn’t necessary for everyone to drop everything to get them done.

But that’s also where the next pain point comes in.

When dozens of bugs a day are going through triage it is inevitable that some won’t get fully resolved.  Sometimes it’s because it is a tricky issue that has started a long conversation about what might be going wrong.  Sometimes it is because key people are out or not responding.  Other times it’s because it was sent to the Slack channel late in the afternoon on a Friday and just got overlooked when people returned to work on Monday.  Sometimes it was because the QA tester that had initiated the thread had not followed up on the message because they were busy with writing the next handful of bugs that needed triage and forgot to ensure the process went all the way through to completion.  It’s a mixed bag of all those things.

But the reality is that the fix for all of that is pretty easy, much easier than holding meetings at least.

The way the process works is that when all of the meta data we as QA were looking for was set; that the bug had an owner (or was backlogged with no specific assignment), priority, fix version and the other meta data the production team asked us to include to help them track the bug is all set the QA tester who started the triage thread is supposed to edit their comment and strikethrough all of the text.

As the QA manager I would periodically scroll up through all of the posts in our triage channel.  Because they are threaded conversations the only thing I would see are the top-level messages.  Anything with a strike through was completed and the ones that did not have a strike through were the ones that needed to be followed up on.  That was a mixed bag of reminding QA testers to strike through the text or re-pinging @ notification groups or specific individuals.  It’s not my favorite way to spend my time but it was a pretty minor task and was a lot less labor intensive than traditional meetings.

The final issue wasn’t so much of a problem as it was a work effort we knew going into the process was going to require constant attention.  Specifically, keeping parity with how the production team organized our various work efforts and that was mostly a matter of updating our Slack @ notification groups and our Confluence documentation.  We went through large stretches of the year without having to do much of this, it only became an issue when strike teams were adjusted due to changing needs. 

System Era is a company of somewhere above 60 people or so with a small number of outsourced partners and contractors.  Looking back at 2025 I was extremely happy with how well the triage process dealt with a higher volume of bugs than we had dealt with in previous years.  I do spend some time thinking about even higher volumes out of curiosity – there are games and companies of much larger scale than we are, how would this scale to a Rockstar level of game production?  I don’t know for sure, but I think it could be done.  In the meantime System Era is going to find out this year how it manages the final push getting the game shipped. 

But so far, so good.

No comments:

Post a Comment

 No battle plan survives contact with the enemy - triage revisited several years later Several years ago, I made a blog post about the triag...