Site Reliability Engineer’s Guide to Black Friday

It’s gotten to the point where Black Friday reliability prep has to start on…well Black Friday. This year, 32% of consumers in the US claimed that they were going to start their holiday shopping in July-October. Plus, Black Friday isn’t the only day eCommerce businesses have to worry about, now we have Cyber Monday, Travel Tuesday, and the thousands of Prime Days from Amazon. This means, the need for robust site reliability plans is more crucial than ever and must be thought about months and months in advance. So, in this blog I will guide your team through the endless holiday season and provide tips and tricks that are sure to improve your site’s reliability.

What is Black Friday Reliability and Why is it Important?

As holiday shopping shifts online, eCommerce companies must diligently prepare for the increase in site traffic and cyber threats that come alongside these major purchasing events. However, successfully handling increased demand is not enough if your site performance declines. Over 50% of consumers will leave an eCommerce site if the load time is higher than 6 seconds, and their expectations are that wait times will be much lower. This only highlights the importance of having robust Black Friday reliability preparation plans that will not only handle an influx of traffic but enhance the user experience too.

What Can Go Wrong on Black Friday?

Even with strong site reliability, unavoidable events may still arise. For instance, if one of your competitors’ sites goes down, consumers may turn to you, unexpectedly spiking traffic, and potentially crashing your site. So, here are some of the incidents that on-call teams must watch out for this holiday season:

Traffic Overload and Downtime – If traffic spikes higher than expected and overwhelms the load balancers, it can lead to crashes or slowdowns that can directly impact sales on your highest sales days.

Scalability Issues – Even with auto-scaling systems in place, things can go wrong. With the sudden spikes that occur during these events, your system may scale too late or by too much causing crashes and unnecessary costs.

Checkout Failures – The influx of payment transactions may stress APIs with payment processors leading to failed purchases and abandoned carts if they cannot handle the volume.

Alert Fatigue – If your team doesn’t set effective alert thresholds, on-call SREs may experience an abundance of unactionable notifications causing alert fatigue. Alert fatigue exhausts teams and leads to delayed response or ignored alerts during critical incidents.

Cyberthreats – Cybercriminals know that sites are more stressed and vulnerable during times of high traffic, so they are more likely to attack and impact your services, employees, and customers.

Best Practices for Black Friday Reliability Prep

To help your SRE team avoid prolonged downtimes and ensure optimal site performance, I have compiled some of the best practices for Black Friday Reliability prep:

Create Reliable On-Call Rotations – When scheduling your team for their on-call rotations during the holiday season, it is important to keep their well-being in mind. Many people have busy home lives this time of year and cannot be expected to be available 24/7. So, creating equitable on-call rotations with strong escalation policies that ensure everyone is able to enjoy some time off can significantly improve productivity and incident response.

Prepare Your Load Balancers – As mentioned, auto-scaling may be insufficient in the case of sudden traffic spikes, so teams should set aside extra capacity on their load balancers before Black Friday rather than waiting for their auto-scaling system to attempt to adapt in real-time. This will help sites seamlessly handle changes in site traffic.

Conduct Regular Reviews – Before the season starts, it is essential for teams to test how well their site can handle the influx of traffic through chaos engineering. Plus, they must evaluate and document their findings from these reviews and past holiday seasons to ensure that they are able to make fixes and prepare themselves for the event once it comes back around.

Deliver Status Updates via Slack or Teams – Through Black Friday and Cyber Monday weekend, your team should have a dedicated Slack channel where the on-call engineers deliver updates on site performance to ensure that everyone is on the same page. Additionally, many teams incorporate incident alerting solutions, like OnPage, into these channels to mobilize critical response teams during IT events and gain a singular view of all incidents in one place.

Proactive Monitoring: Monitoring key metrics, such as HTTP Error Rate and Traffic Anomalies becomes even more crucial during high-stake events, as it can immediately give visibility into when end users are experiencing errors and detect unusual changes in site traffic.

Optimize Monitoring and Observability – SRE teams must optimize their monitoring tools by integrating them with alerting solutions. Monitoring tools tend to deliver alerts through unreliable means of communication, like email. By leveraging OnPage, teams gain access to high-priority mobile alerts that bypass the silent switch and ensure swift mobilization of on-call teams.

Conclusion

The holiday shopping season is getting earlier and earlier as new deal days emerge and consumers are increasingly taking to eCommerce. This challenges SREs to deliver seamless experiences through the season, tackling traffic spikes and cyber threats. Hopefully, the best practices I outlined in this blog will help your team prepare for the holiday seasons to come!

Facebook

Google

Twitter

Zoe Collins

Next OnPage's Strategic Edge Earns Coveted 'Challenger' Spot in 2024 Gartner MQ for Clinical Communication & Collaboration »

Previous « Cloud Engineer – Roles and Responsibilities

Published by

Zoe Collins

Tags: Black Fridayblack friday outageCyber MondayeCommerceHoliday Shoppingincident managementIT engineersmonitoring and observabilityon-call managementon-call rotationsPrime Daysite reliability engineering

6 months ago

How to Combat MSP Alert Fatigue
Managed service providers (MSPs) are responsible for monitoring hundreds or even thousands of devices, meaning…
What Grafana OnCall’s Maintenance Mode Means for On-Call Teams
If you’ve been using Grafana OnCall OSS for incident management, you may have already heard…
From Tickets to Action: Ensuring Proactive IT Support with Jira and OnPage
We're excited to announce the launch of our bi-directional integration between OnPage and Jira! This…