Monday, February 16, 2026

 No battle plan survives contact with the enemy - triage revisited several years later

Several years ago, I made a blog post about the triage process we have adopted at System Era, the company I work for and any good plan, no matter how well it’s been crafted you only really measure it by applying it to the real world and seeing how well it works.  So, after a hiatus that has been longer than I have wanted, I am coming back to the blog and am giving an update on where we are with it.  The last few years have been a good test for it as we have been working on a new game and the volume of bugs that go through it has grown quite a bit. 

So how did we do?

Let’s look at the numbers for last year for Starseeker, System Era’s next game due out sometime this year.  As you can imagine, the year before release has seen a lot of development work that came with a lot of bugs, all of which go through and stress the triage process.  In previous years when we operated on a more traditional triage process that required meetings to discuss how to prioritize, assign and schedule bugs we had reached a point where I had implemented the five-minute rule.  That was a response to what often occurs in triage; discussion of the bug itself and how to fix it or to clarify within the design discipline what they wanted.  I mention this because for the sake of this discussion I use a benchmark of five minutes per bug for triage by meeting – not all bugs lead to that discussion but they do require that people read the bug, discuss who it goes to and for everyone to come to an agreement on fix versions and priorities.  On average, five minutes per bug seems reasonable.

In 2025 the QA team, both our internal team as well as our outsourced test team, wrote a little over 3,400 bugs last year.  A wee bit more, but for purposes of keeping the math simple we will use that rounded number.  Applying a five minute per bug assumption that would mean that in 2025 we should have had 283.3 hours of triage meetings.  Assuming that there are five people present in the meeting; a producer, a QA team member, an engineer, a designer and an artist, we can then calculate the person-hour cost of those triage meetings.  For System Era using the fully loaded cost for an employee for wages & benefits that comes out to somewhere around $83-85 dollars.  Those triage meetings would have cost us somewhere around $117,000 dollars.

During 2025 we had exactly 0 bug triage meetings. 

Did we save $117k on meetings?  No.  What we did was spread that time and cost throughout the day by using an asynchronous process that brought in only those that needed to weigh in on the bug.  Sometimes that did involve @ notifying people outside of the triage group that normally would have decided who to assign it to, how to prioritize it and what fix version to shoot for along with some other data we use in our production process like the Jira Epic the bug falls under which is how production tracks both tasks and bugs related to specific work efforts.

In those targeted, asynchronous conversations we often did have lengthy discussions about what was going wrong technically, or whether we needed to make design decisions to help solve the issue.  And those conversations would have gone beyond our previous 5-minute rule – but we could painlessly deal with them through Slack.  The $117k figure for meeting time is something we looked at in retrospect when we looked at the numbers from the previous year, we were not feeling the impact of having meetings and were able to process sometimes dozens of bugs a day seemingly effortlessly and without disruption.  Not only that, but it was fast.

One of the disadvantages of waiting until you have enough bugs to call a triage meeting is that there is a delay between Perforce check-ins, testing, bug writing and getting that back to developers.  Because of the nature of our build process combined with the rapid, QA tester-initiated triage process we were able to remove the lag.  I have seen bugs go into triage and get assigned out in a matter of minutes.  When we ran into issues that were blocking issues the process allowed us to respond immediately, it didn’t require any special escalation process.  We had all the tools built into the process already.

In short, even when we had to scale the triage process it worked well.  No process is perfect though; there were areas where we had some friction so let’s talk about that.

The use of Slack and notifying key people only when it’s time to triage can be a double-edged sword.  Slack notifications can be disruptive to concentration; we did get some feedback on that.  The fixes, however, are easy.  The first line of defense for that is how developers set their notification settings.  Just because somebody hits you up on Slack, it does not mean that you must, or even should, drop everything to respond.  For some people this is a hard habit to break and for them we suggested that they set Slack so that they can work uninterrupted by pings and pop-ups.  Secondly, we suggested that developers involved in the triage process reserve time at the start and end of the day to respond to the threads.  We had an aspiration to get our bugs fully through triage in a 1 business day SLA – it wasn’t necessary for everyone to drop everything to get them done.

But that’s also where the next pain point comes in.

When dozens of bugs a day are going through triage it is inevitable that some won’t get fully resolved.  Sometimes it’s because it is a tricky issue that has started a long conversation about what might be going wrong.  Sometimes it is because key people are out or not responding.  Other times it’s because it was sent to the Slack channel late in the afternoon on a Friday and just got overlooked when people returned to work on Monday.  Sometimes it was because the QA tester that had initiated the thread had not followed up on the message because they were busy with writing the next handful of bugs that needed triage and forgot to ensure the process went all the way through to completion.  It’s a mixed bag of all those things.

But the reality is that the fix for all of that is pretty easy, much easier than holding meetings at least.

The way the process works is that when all of the meta data we as QA were looking for was set; that the bug had an owner (or was backlogged with no specific assignment), priority, fix version and the other meta data the production team asked us to include to help them track the bug is all set the QA tester who started the triage thread is supposed to edit their comment and strikethrough all of the text.

As the QA manager I would periodically scroll up through all of the posts in our triage channel.  Because they are threaded conversations the only thing I would see are the top-level messages.  Anything with a strike through was completed and the ones that did not have a strike through were the ones that needed to be followed up on.  That was a mixed bag of reminding QA testers to strike through the text or re-pinging @ notification groups or specific individuals.  It’s not my favorite way to spend my time but it was a pretty minor task and was a lot less labor intensive than traditional meetings.

The final issue wasn’t so much of a problem as it was a work effort we knew going into the process was going to require constant attention.  Specifically, keeping parity with how the production team organized our various work efforts and that was mostly a matter of updating our Slack @ notification groups and our Confluence documentation.  We went through large stretches of the year without having to do much of this, it only became an issue when strike teams were adjusted due to changing needs. 

System Era is a company of somewhere above 60 people or so with a small number of outsourced partners and contractors.  Looking back at 2025 I was extremely happy with how well the triage process dealt with a higher volume of bugs than we had dealt with in previous years.  I do spend some time thinking about even higher volumes out of curiosity – there are games and companies of much larger scale than we are, how would this scale to a Rockstar level of game production?  I don’t know for sure, but I think it could be done.  In the meantime System Era is going to find out this year how it manages the final push getting the game shipped. 

But so far, so good.

Tuesday, July 26, 2022

 Rethinking the Triage Process in the Age of Covid

    

    I think it doesn’t really require much explanation to describe what our industry has gone through these last few years while going through Covid.  Once busy and bustling offices have been turned into ghost towns and even now as the worst of it is behind us the return to office has been slow and often resisted by those who have found work from home to be a better way to be productive.  In that time entire organizations have sprung up with distributed teams and have no office space to go back to.  The writing is on the wall, distributed work forces are here to stay.

    For my company, System Era Softworks publishers of the indy game Astroneer that has been our story, finding ourselves working in a whole new fashion once we left the office to work from home and our processes were not always the best solution for the new reality we found ourselves in.  There is a proverb that has been with us since the ancient world; “necessity is the mother of invention”.  In response to the challenges work from home we were forced by necessity to look for new solutions. The one we found, that I want to explain in this post is one that we will keep whether or not we all return to the office.

    The specific challenge we faced was finding a better process for triaging bugs.  

    The solution we came up with offers some key benefits; it’s asynchronous, eliminates meetings, reduces management overhead, and empowers our QA Analysts to initiate the triage process once they have written a bug and to work directly with the relevant stakeholders-which reduces distraction for those who don’t need to be involved in the process.

    Triaging bugs is something that throughout my entire career was something for me as a QA professional that had been a consistent format.  It was typically a semi-regular meeting of stakeholders.  We'd go through a list of bugs and for each make some decisions about what to do with them  (i.e., what release we want to target a fix for, the sprint we’ll fix it in, who the bug will be assigned to and of course as the word triage implies what the relative importance of each bug is).  Historically this has been done in a sit-down meeting with representatives from key disciplines, for example Production, Engineering and of course QA.  It was easy enough to turn those sit-down meetings into virtual ones once we started work from home but as time went on it became clear that some additional challenges had presented themselves.    

    The most obvious challenge that prompted us to look at the big picture and search for a solution was scheduling, it was harder to do in general.  Meeting times had become more scarce because while our core hours remained unchanged, we found we had more meetings on the calendar because conversations that could be had deskside were turning into scheduled meetings.  Thus time slots with availability for everyone became more difficult to identify and that had the effect of pushing meetings out further which was slowing down the delivery of bugs to Engineers, Designers and Artists.  Finally, when we did put another meeting on people's calendar it contributed to their own issues with time management - balancing meeting time with their own IC work. 

    So, what was the answer, could we just get rid of the constant triage meetings?  As it turns out, we could.  

    We leaned into the communication tools we were already using, specifically Slack and made this an asynchronous process.  In full disclosure, we do still have traditional triage meetings for bugs that are found in the retail version of our game – those are discovered less frequently and they are generally minor issues unless Customer Service or the Community team tells us otherwise.  But for the majority of bugs we started triaging constantly and asynchronously.  As a result we were able to eliminate meetings for most of the bugs coming in.

    So, what what was the next step?  We looked at who really needed to be involved in the triage process.

    System Era had evolved over the years from a small team where a single stakeholder could speak for their entire discipline to a larger organization where we had created multiple strike teams working on features across multiple releases each with their own discipline leads.  As the single stakeholder for each discipline no longer worked for us, the solution was to look at triage as a process that was targeted to each work effort. 

    As we looked at triage without meetings and a more defined list of stakeholders we realized that we were using a QA management process that no longer made sense.  Typically somebody in QA would oversee the list of bugs requiring triage and would schedule triage meetings as appropriate, but without meetings we no longer needed that.  Somebody had to start the triage process and the solution was obvious.  Each strike team has an embedded QA Analyst.  By empowering them to triage bugs directly with their strike teams we eliminated most of the management overhead (except for monitoring the bugs already in retail).  

    So now that we knew we didn't want to use meetings, had a reduced list of participants and had a smarter workflow for the bug pipeline we defined the process.  Let me describe how it works.  

    We created a Slack channel called #bugs_triage for the singular purpose of triaging bugs.  When a QA Analyst has finalized writing up a bug in Jira they start a new thread in that channel with the bug title, a link to the Jira bug and a brief comment that includes any important information about the bug not obvious from the title alone as well as the QA recommendation for priority.  Most importantly that comment includes @ notifications for the relevant producer, designer, engineer or any other discipline lead like an artist or audio engineer that needs to be involved.  Once posted QA does not expect nor require an immediate response, this is the asynchronous process at work.  Triage can wait until the people notified find a time that works best for them to check in on their Slack notifications and address the bug.  When they respond to it they do it in a threaded response.  That keeps the #bugs_triage channel manageable and readable – it’s only showing top level comments for each bug.  In those threaded conversations we talk about what we’d like to do with the bug including giving those stakeholders a chance to ask for more information as well as making the key decisions about priority, target fix, sprint or who it is assigned to in the same way we would do if we were all in a meeting together.  Once consensus has been reached, the QA team member then updates the bug in Jira and closes out the threaded conversation by editing the original comment by putting a strikethrough on the text of the original post.  Anyone looking into the channel can see which bugs have been triaged and which are still waiting for decisions.  In general, we treat these posts as having a 24 hour SLA – occasionally a QA team member needs to prompt somebody for follow up but in general we’ve had no issues taking care of bugs in a way that has been much faster than the traditional method we were using.  Bugs have gone from published to accepted, assigned and all relevant meta data updated in as little as 15 minutes. Even if it takes 24 hours it is still faster than the results we had gotten from periodic meetings.

    The asynchronous nature of the process is its most important feature, we have minimized disruptions to people by being selective about who we notify and by not taking up a block of their time for a meeting.  A close runner up in importance is the self-service nature of the process, it is initiated by the QA Analysts working with each strike team and has decentralized how QA as an organization responds to managing the flow of bugs and reduced the work involved.

    No process is perfect the first time it’s implemented.  While the process is simple enough and it took little time to teach our team members how it works, we have had a few minor pain points along the way.  Specifically, they turned out to be issues with the @ notifications we used in our bug triage posts.  The problem came from occasional confusion about who the relevant stakeholders were and adding too many people in an abundance of caution.  This worked against our stated goal of creating a process that our QA Analysts could manage themselves completely independently – they needed absolute clarity on who to notify and notifying too many people was working against our stated goal of reducing distractions.

    The mid-stream modification that addressed those issues was to go back and discuss with our partners how to build a list for each strike team that we could document and give to the QA Analysts.  Doing that removed the need for judgement calls.  Again, we leaned on the tools we were already using and created special Slack user groups for each strike team.  We maintain a Confluence page that lists each strike team, who the people responsible for triage are and what the Slack user group is called.  This is easy to update every time we spin up a new strike team or shut one down when the feature is completed.  Those user groups have an easy to remember format; ao_<striketeam name> where ao stands for area owners and the strike team name matches the working name of the feature.  The membership of these groups is kept intentionally small.  Typically, it will be the Producer assigned to the strike team and one, or two other owners for the group, in practice that is usually an Engineer and sometimes a Designer.  That decision is made by the strike team.  When it becomes necessary to consult somebody else in making a decision on how to triage a bug a member of that ao_group will @ notify the right people.  This small adjustment has eliminated any doubt on the part of the QA Analyst when posting a bug and eliminated unnecessary notifications. 

    A few final thoughts as I close this post out.  I think a good process can make game development better.  Process that serves a need is a good process.  But we need to remember that very important point; process serves us and as our needs change the process should change.  It's important to constantly rethink how you do things to insure that you have the best process possible.  I hope that if nothing else this post gives my industry colleagues something to consider about how all of you perform bug triage.  What I've described here might be an ideal solution that you have not considered.  It may also motivate you to rethink what you are currently doing for bug triage and help you craft a better solution tailored to your needs.  Unless you occasionally rethink how things are done, you are serving process...not the other way around.


 No battle plan survives contact with the enemy - triage revisited several years later Several years ago, I made a blog post about the triag...