Application Server

July 25, 2023 Incident: Assets falsely reporting offline

Postmortem

A standard certificate renewal triggered an unexpected restart of our service that controls the Online/Offline status for our assets at 3:12 pm EST. While we do have mechanisms in place to ensure purposeful restarts of the service are graceful, the service did not automatically recover from the unexpected restart and was overloaded with requests which caused a number of false offline alerts to be sent. To solve the issue, our engineers increased the throughput on the service and deleted the backed up requests to help the service recover back to standard load.

In the near term, the team is working on optimizing our existing Online/Offline service to avoid the risk of any extended down time should an unexpected restart occur moving forward. We will also be optimizing our alert muting process to minimize impact in the event that a similar incident should ever occur.

In the long term, our engineering team is actively developing an architectural update to replace our existing Online/Offline service with a more reliable and scalable solution.

Resolved
Opened

This issue was opened retrospectively.