Triage an incident in production system
In the software development world, incidents are inevitable. It's not a matter of if, but when. And when they do occur, how you respond can mean the difference between a minor inconvenience and a full-blown disaster. That's where triage comes in. Triage is a process borrowed from the medical field, where it is used to sort and categorize patients based on the severity of their condition. In the software world, triage is used to quickly assess and categorize incidents so that the appropriate response can be taken.
Triage originates from the French word "trier," which is used to describe the processes of sorting and organization.
The first step in avoiding a triage situation is to ensure that your software is built on a solid foundation. This is where the Twelve-Factor App comes in. The Twelve-Factor App is a set of best practices for building modern, cloud-native applications. It covers everything from configuration to scaling, and following these practices can help prevent incidents from occurring in the first place.
But even with the best intentions, incidents can still occur. When they do, it's important to have a plan in place. The first step in triage is to gather as much information as possible. This includes things like error messages, logs, and user reports. New Relic and observability tools like Elastic Search, Grafana, and Prometheus can be incredibly helpful in this regard, as they provide real-time insights into your application's performance and can help identify potential issues before they become incidents.
Once you have all the information you need, it's time to categorize the incident. In the medical field, patients are categorized based on the severity of their condition. In software, incidents are categorized based on the impact they have on the end-user. Is it a minor inconvenience, or is the application completely unusable? Categorizing the incident in this way helps determine the appropriate response.
From there, it's a matter of following your incident response plan. This plan should outline the steps that need to be taken based on the severity of the incident. For minor incidents, this might be something as simple as restarting a service. For more severe incidents, it might involve rolling back to a previous version of the application or even engaging with third-party vendors to resolve the issue.
In conclusion, triage is an essential part of incident response in the software development world. By following best practices like the Twelve-Factor App and using tools like New Relic and observability tools, you can help prevent incidents from occurring in the first place. But when incidents do occur, having a plan in place and following a structured triage process can help minimize the impact on your end-users and ensure that your application is up and running as quickly as possible.