Issue Summary

From 10:23 AM to 11:18 AM UTC the Peach API was not responding to database backed API requests. Requests were still able to access basic API functionality. The root cause of this outage was an automated database failover which was not handled gracefully by the API application. This API issue affected several PeachWorks consumer-facing sites including go.peachworks.com, developer.peachworks.com, and apps.peachworks.com.

Timeline (UTC)

10:23 AM UTC - Database failover started automatically
10:24 AM UTC - Database failover completed
10:36 AM UTC - PeachWorks team notified of API outage due to secondary service outage alert
11:17 AM UTC - Peach API service restart initiated
11:18 AM UTC - 100% of API services operational
11:20 AM UTC - Consumer sites operational

Root Cause

At 10:23 AM UTC Amazon RDS initiated an automatic rollover procedure due to low level node failure. This automatic rollover process caused the Peach API to drop all active database connections.

When the backup Amazon RDS instance came online at 10:24 AM UTC, the Peach API did not reconnect to the database. This severely limited the API functionality.

Resolution and Recovery

At 10:44 AM UTC, the monitoring system alerted PeachWorks engineers who immediately began diagnosing the issue. The diagnosis took longer than expected due to the Peach API reporting itself healthy despite being unable to connect to the database.

After deeper log analysis was conducted, the database connection issue was diagnosed and the API services restarted at 11:17 AM UTC. Complete service was restored shortly after.

Corrective and Preventative Measures

An internal review was conducted and the following actions are being taken to prevent further issues of this nature and to improve the response times of diagnosing future incidents:

Ensure that health checks on the Peach API are more comprehensive by taking into account all core dependencies.
Fix the underlying database reconnection process for the Peach API in the case of automatic database failover or other connection loss.
Create additional alerts to notify the team of:
- Automatic database failovers
- Low database connection counts

Posted Oct 12, 2015 - 17:00 EDT

Resolved

System is operating normally. Post Mortem to follow today.

Posted Oct 12, 2015 - 08:03 EDT

Monitoring

Application behavior has returned to normal. We will continue to monitor.

Posted Oct 12, 2015 - 07:26 EDT

Investigating

We are currently investigating this issue.

Posted Oct 12, 2015 - 07:03 EDT

This incident affected: Beyond One (API, Apps, POS Hub API, Developer Portal).