Outages and downtime are inevitable. When they happen, it’s important to triage, investigate, and fix the problem while at the same time, providing timely status updates to customers and our fellow PollEvians. Juggling all these responsibilities can be chaotic, which is why we believe it’s important to have standard practices and a checklist to follow to minimize the chaos.

Incident commander (required)

The incident commander is primarily responsible for communication and coordination. They should be the single source of truth for Customer Service, Customer Success Managers, and any other PollEvians who need updates about the incident.

They are also who engineers should direct their findings to when investigating the incident. If those responsibilities don’t fully utilize the incident commander’s time or if the issue was low priority, they may also participate in the issue investigation.

The incident commander is responsible for following our downtime checklist to guide the team through these steps:

  1. Triage
  2. Coordinate
  3. Mitigate
  4. Resolve
  5. Follow-up

Data Integrity

All incidents involving data integrity are to be treated as a top-priority bug and as if the site were down. These should be managed through the same process as a critical bug. In the case of a data integrity event, an incident commander should be appointed immediately. There needs to be a single point of contact to communicate to our internal teams and customers before we run any operations that modify customer data.

In the event of a data integrity issue, it is important to identify data that needs to be restored, as well as the underlying bug. It is up to the incident commander to determine if the bug needs to be fixed immediately or if it can wait until after user data is restored.

Triage

The primary goal of triage is to determine what is wrong and how bad it is. Typically, the alert sent to the on-call engineer will give you a good indication of what part of the system is experiencing problems. Determining the severity is much more subjective, but we do have some guiding principles.

High severity

An incident is classified as high severity if some percentage of requests to our servers are failing, if response times have significantly risen, or if a user facing feature is failing. Basically, if it’s an ongoing problem that will affect customers until it’s mitigated, it’s a high severity. High severity issues require a more aggressive response and may involve many people.

Low severity