Agent

Issue Being Investigated: False Reporting of Agents Offline

Postmortem

On Sunday, January 21 at 8:21 AM PT, a standard certificate renewal triggered a restart of our service that controls the Online/Offline status for our assets. While we have mechanisms in place to ensure purposeful restarts of the service are graceful, redistributing agent connections across our servers takes several days to complete.

By Monday, January 22 at 7:50 AM PT, as our traffic increased with the majority of our agent fleet coming back online after the weekend, our Online/Offline service drastically began to drop connections, leading to our agents falsely reporting offline. The problem occurred not only because the connections were still redistributing across the servers, but also the proxy server settings were not properly optimized to handle the influx of traffic.

As a response to the dropped connections, we temporarily disabled offline alerting in the app, adjusted our proxy server settings, and manually restarted each of our servers to forcefully optimize the redistribution of connections. By 11:21 AM PT on January 22, we found that these updates appeared to resolve the issue as our agent fleet held presence connections as expected, so we called the all clear on the incident and re-enabled agent offline notifications.

On Tuesday, January 23 at 8:20 AM PT, as our morning traffic once again increased and despite our changes the day prior, our Online/Offline service was still unable to support the influx of traffic and agent connections began to drop. Agent offline reporting was again disabled at 8:38 AM PT and the team continued work to optimize our server load configurations. After implementing these optimizations, we delayed reactivation of agent offline notifications until 9:00 AM PT on Wednesday, January 24, to confirm our system could reliably manage the morning traffic.

As an immediate response to this incident and to ensure a reliable presence system for our partners as they continue to grow their businesses in the years to come, our team is prioritizing additional optimizations of our Online/Offline system in a test environment simulating our current and projected traffic to ensure we can gracefully handle planned or unplanned restarts of the service in both the short and long term as our agent fleet continues to grow. This work is currently in progress and we are committed to investing whatever resources are necessary to mitigate the likelihood of future incidents.

Resolved

RESOLVED: The team has confirmed that agent online connections have remained stable since our latest updates and are calling the all clear.

We have re-enabled offline feature in the app so any offline device reporting will now resume as expected.

We will share retrospective reviews on this incident and action items we will be taking to avoid incidents like this in the future.

This issue is now resolved, but if you have any further questions or concerns relating to this issue, please reach out to our support team at help@syncromsp.com.

Monitoring

UPDATE: We will re-enabled offline alerts at 9:00 am PST / 12:00pm EST so any offline device reporting will resume as expected. We are continuing to monitor to ensure a resolved outcome for the behavior issue.

Monitoring

Stability is keeping true. This is a top priority behavior issue, We have validated the system as stable thus far, we will continue to observe, and test accordingly. Offline Agent Alerts are still temporarily disabled but we will place back in a normal state shortly. Status site updates on this issue will continue until this is resolved.

Updated

UPDATE: Stability remains improved. Due to the nature of the issue, in order to fully validate the system as stable we will continue to observe until tomorrow morning. Offline Agent Alerts will remain temporarily disabled until this result has been confirmed. Status site updates on this issue will resume tomorrow morning on 1/24.

Updated

UPDATE: Improvements to stability have been observed post deployment. The team will continue to monitor results.

Updated

UPDATE: The deployment has completed and the team is monitoring for stability. Offline Agent Alerts remain temporarily disabled until stability is observed.

Updated

UPDATE: A deployment to address Offline Agent Alerts is currently in the process of rolling out. We do expect this will take about 30 minutes and will provide another update once it has completed.

Updated

UPDATE: The team is preparing a deployment for release that will address false Offline Agent Alert notifications. Thank you for your patience. More updates to come soon.

Updated

UPDATE: The team continues to triage a long term solution and address the backend systems. Changes to be made shortly.

Updated

UPDATE: Changes have been implemented and improvements are being seen at this time. The team is triaging long term solutions. Offline Agent Alerts continue to be temporarily disabled until it is deemed stable.

Updated

UPDATE: In order to mitigate false notifications, we are temporarily disabling Offline Agent Alerts until it has been deemed stable.

Monitoring

We are seeing a reoccurrence of False Agent Offline notifications after continuing to monitor since yesterday. We are currently working to resolve the issue and will provide further updates actively. Thank you for your patience.

Resolved

The team has confirmed that agent online connections have remained stable since our latest updates and are calling the all clear.

We have re-enabled offline reporting in the app so any offline device reporting will now resume as expected.

Within the next 48 hours, we will be sharing a retrospective on this incident and action items we will be taking to avoid incidents like this in the future.

This issue is now resolved, but if you have any further questions or concerns relating to this issue, please reach out to our support team at help@syncromsp.com.

Problem Identified

We are seeing improvement and stability with agents maintaining online connections since our latest update. The team is continuing to monitor and offline alerting will remain deactivated until we call the all clear.

Updated

UPDATE: Our team is continuing to work on a solution to address agents falsely reporting offline. We’ve rolled out an additional update and are continuing to monitor agent online presence as the update takes effect.

The agent offline reporting system will remain deactivated as our team continues to work on this ongoing issue.

Investigating

UPDATE: Our team is continuing to work on a solution to address agents falsely reporting offline. The agent offline reporting system will remain deactivated as our team continues to work on this ongoing issue.

Investigating

UPDATE: We are continuing to investigate agents falsely reporting offline at this time. Offline alerting will remain deactivated as we continue to investigate the issue.

Investigating

Our engineers are in the process of deploying an update to potentially mitigate the false offline connections as we continue to investigate this issue.

Investigating

UPDATE: We are still investigating False Offline Alerts of Agents at this time we are pausing Offline alerting until we deem it is stable.

Investigating

We are currently investigating false reporting of Agents being offline. We apologize for any inconvenience this may have caused and appreciate your patience while we work to fix the problem. We will provide further updates as soon as possible. Thank you for your understanding.