Tuesday, July 26, 2022

 Rethinking the Triage Process in the Age of Covid

    

    I think it doesn’t really require much explanation to describe what our industry has gone through these last few years while going through Covid.  Once busy and bustling offices have been turned into ghost towns and even now as the worst of it is behind us the return to office has been slow and often resisted by those who have found work from home to be a better way to be productive.  In that time entire organizations have sprung up with distributed teams and have no office space to go back to.  The writing is on the wall, distributed work forces are here to stay.

    For my company, System Era Softworks publishers of the indy game Astroneer that has been our story, finding ourselves working in a whole new fashion once we left the office to work from home and our processes were not always the best solution for the new reality we found ourselves in.  There is a proverb that has been with us since the ancient world; “necessity is the mother of invention”.  In response to the challenges work from home we were forced by necessity to look for new solutions. The one we found, that I want to explain in this post is one that we will keep whether or not we all return to the office.

    The specific challenge we faced was finding a better process for triaging bugs.  

    The solution we came up with offers some key benefits; it’s asynchronous, eliminates meetings, reduces management overhead, and empowers our QA Analysts to initiate the triage process once they have written a bug and to work directly with the relevant stakeholders-which reduces distraction for those who don’t need to be involved in the process.

    Triaging bugs is something that throughout my entire career was something for me as a QA professional that had been a consistent format.  It was typically a semi-regular meeting of stakeholders.  We'd go through a list of bugs and for each make some decisions about what to do with them  (i.e., what release we want to target a fix for, the sprint we’ll fix it in, who the bug will be assigned to and of course as the word triage implies what the relative importance of each bug is).  Historically this has been done in a sit-down meeting with representatives from key disciplines, for example Production, Engineering and of course QA.  It was easy enough to turn those sit-down meetings into virtual ones once we started work from home but as time went on it became clear that some additional challenges had presented themselves.    

    The most obvious challenge that prompted us to look at the big picture and search for a solution was scheduling, it was harder to do in general.  Meeting times had become more scarce because while our core hours remained unchanged, we found we had more meetings on the calendar because conversations that could be had deskside were turning into scheduled meetings.  Thus time slots with availability for everyone became more difficult to identify and that had the effect of pushing meetings out further which was slowing down the delivery of bugs to Engineers, Designers and Artists.  Finally, when we did put another meeting on people's calendar it contributed to their own issues with time management - balancing meeting time with their own IC work. 

    So, what was the answer, could we just get rid of the constant triage meetings?  As it turns out, we could.  

    We leaned into the communication tools we were already using, specifically Slack and made this an asynchronous process.  In full disclosure, we do still have traditional triage meetings for bugs that are found in the retail version of our game – those are discovered less frequently and they are generally minor issues unless Customer Service or the Community team tells us otherwise.  But for the majority of bugs we started triaging constantly and asynchronously.  As a result we were able to eliminate meetings for most of the bugs coming in.

    So, what what was the next step?  We looked at who really needed to be involved in the triage process.

    System Era had evolved over the years from a small team where a single stakeholder could speak for their entire discipline to a larger organization where we had created multiple strike teams working on features across multiple releases each with their own discipline leads.  As the single stakeholder for each discipline no longer worked for us, the solution was to look at triage as a process that was targeted to each work effort. 

    As we looked at triage without meetings and a more defined list of stakeholders we realized that we were using a QA management process that no longer made sense.  Typically somebody in QA would oversee the list of bugs requiring triage and would schedule triage meetings as appropriate, but without meetings we no longer needed that.  Somebody had to start the triage process and the solution was obvious.  Each strike team has an embedded QA Analyst.  By empowering them to triage bugs directly with their strike teams we eliminated most of the management overhead (except for monitoring the bugs already in retail).  

    So now that we knew we didn't want to use meetings, had a reduced list of participants and had a smarter workflow for the bug pipeline we defined the process.  Let me describe how it works.  

    We created a Slack channel called #bugs_triage for the singular purpose of triaging bugs.  When a QA Analyst has finalized writing up a bug in Jira they start a new thread in that channel with the bug title, a link to the Jira bug and a brief comment that includes any important information about the bug not obvious from the title alone as well as the QA recommendation for priority.  Most importantly that comment includes @ notifications for the relevant producer, designer, engineer or any other discipline lead like an artist or audio engineer that needs to be involved.  Once posted QA does not expect nor require an immediate response, this is the asynchronous process at work.  Triage can wait until the people notified find a time that works best for them to check in on their Slack notifications and address the bug.  When they respond to it they do it in a threaded response.  That keeps the #bugs_triage channel manageable and readable – it’s only showing top level comments for each bug.  In those threaded conversations we talk about what we’d like to do with the bug including giving those stakeholders a chance to ask for more information as well as making the key decisions about priority, target fix, sprint or who it is assigned to in the same way we would do if we were all in a meeting together.  Once consensus has been reached, the QA team member then updates the bug in Jira and closes out the threaded conversation by editing the original comment by putting a strikethrough on the text of the original post.  Anyone looking into the channel can see which bugs have been triaged and which are still waiting for decisions.  In general, we treat these posts as having a 24 hour SLA – occasionally a QA team member needs to prompt somebody for follow up but in general we’ve had no issues taking care of bugs in a way that has been much faster than the traditional method we were using.  Bugs have gone from published to accepted, assigned and all relevant meta data updated in as little as 15 minutes. Even if it takes 24 hours it is still faster than the results we had gotten from periodic meetings.

    The asynchronous nature of the process is its most important feature, we have minimized disruptions to people by being selective about who we notify and by not taking up a block of their time for a meeting.  A close runner up in importance is the self-service nature of the process, it is initiated by the QA Analysts working with each strike team and has decentralized how QA as an organization responds to managing the flow of bugs and reduced the work involved.

    No process is perfect the first time it’s implemented.  While the process is simple enough and it took little time to teach our team members how it works, we have had a few minor pain points along the way.  Specifically, they turned out to be issues with the @ notifications we used in our bug triage posts.  The problem came from occasional confusion about who the relevant stakeholders were and adding too many people in an abundance of caution.  This worked against our stated goal of creating a process that our QA Analysts could manage themselves completely independently – they needed absolute clarity on who to notify and notifying too many people was working against our stated goal of reducing distractions.

    The mid-stream modification that addressed those issues was to go back and discuss with our partners how to build a list for each strike team that we could document and give to the QA Analysts.  Doing that removed the need for judgement calls.  Again, we leaned on the tools we were already using and created special Slack user groups for each strike team.  We maintain a Confluence page that lists each strike team, who the people responsible for triage are and what the Slack user group is called.  This is easy to update every time we spin up a new strike team or shut one down when the feature is completed.  Those user groups have an easy to remember format; ao_<striketeam name> where ao stands for area owners and the strike team name matches the working name of the feature.  The membership of these groups is kept intentionally small.  Typically, it will be the Producer assigned to the strike team and one, or two other owners for the group, in practice that is usually an Engineer and sometimes a Designer.  That decision is made by the strike team.  When it becomes necessary to consult somebody else in making a decision on how to triage a bug a member of that ao_group will @ notify the right people.  This small adjustment has eliminated any doubt on the part of the QA Analyst when posting a bug and eliminated unnecessary notifications. 

    A few final thoughts as I close this post out.  I think a good process can make game development better.  Process that serves a need is a good process.  But we need to remember that very important point; process serves us and as our needs change the process should change.  It's important to constantly rethink how you do things to insure that you have the best process possible.  I hope that if nothing else this post gives my industry colleagues something to consider about how all of you perform bug triage.  What I've described here might be an ideal solution that you have not considered.  It may also motivate you to rethink what you are currently doing for bug triage and help you craft a better solution tailored to your needs.  Unless you occasionally rethink how things are done, you are serving process...not the other way around.


 Rethinking the Triage Process in the Age of Covid           I think it doesn’t really require much explanation to describe what our industr...