Last month, GoSpotCheck experienced incidents of significant performance degradation which spanned several hours on the mornings of March 4-7. Additionally, on March 14, April 1, and April 2, the platform experienced performance issues that were less widespread, but disruptive nonetheless to a number of customers.
We’d like to reiterate our sincere apologies for the impact these incidents have had on your business. We do not take for granted the trust you place in GoSpotCheck to empower your teams to do great work, and we understand that performance issues like this erode that trust.
Though we cannot undo the impact these incidents had on your operations, we would like to share with you an overview of what went wrong, what we learned, and what we’re doing to prevent this type of incident from happening again.
When the GoSpotCheck Mobile App is connected to the internet, it communicates periodically with our servers to provide the user with the freshest data that is relevant to them. This communication is accomplished through a number of requests that the mobile client makes of our servers. Given the growth in the number of GoSpotCheck Mobile App monthly active users over the last several years, our engineering team has been highly focused on building out the technical infrastructure to support this ever-increasing number of mobile requests.
The week of March 4th saw an unexpected increase in the number of requests the Mobile Apps were making of our servers. The number of requests overwhelmed our servers and caused traffic to and from the server to shutter and in some cases, fail. As a result, many admin users experienced the inability (or a delay in their ability) to load or interact with pages in the Web Dashboard, and many mobile users experienced long load times or errors attempting to refresh the data in the app, despite having adequate internet connectivity.
Our subsequent investigation identified two major contributors to the spike in requests. First, a defect in the GoSpotCheck iOS Mobile Application was causing it to send five, nearly-identical requests instead of just one, and often at the exact same time. This defect alone was responsible for a 5x increase in the true volume of unique requests made of our servers during these incidents. The problem was then compounded by “retry” logic built into each mobile app, which attempted to automatically resubmit every unnecessary identical request, 5-8 times in succession. These two issues in combination created an exponential number of duplicate requests to hit our server at once, which overwhelmed and rendered it inaccessible for many while it processed each request.
GoSpotCheck has taken a multi-faceted approach to remediate these performance issues:
We’ve disabled some of the retry logic which caused the automatic request resubmissions, and calibrated the remaining retry logic’s constraints to be more intelligent/discerning.
We've deployed a bug fix for the iOS Mobile App (v4.14.11) which corrects the number of requests the app makes at any one time, and prevents its ability to generate identical requests.
We have also worked to increase the capacity of our backend systems to better handle increased levels of traffic and requests. This includes, but is not limited to:
Provisioning additional backend databases to spread out traffic for high-volume customers
Scaling up the size and capacity of application resources in the platform
Reducing “hot spots” in the platform, by tuning or refactoring inefficiencies in application and database code.
GoSpotCheck Engineering continues to work diligently to improve stability and increase the capacity of the platform. These efforts include:
Further optimization of data access patterns in backend databases
Continued refactoring and optimization of known application hotspots that struggle under load
Expand usage of caching logic to reduce the overall load on the backend databases