Holiday Readiness, Part I: Best Practices for Maintaining Peak Performance
The last two holiday seasons have been uniquely different from those of years past, setting up a different approach to “holiday readiness” in 2022. Throughout the pandemic (and beyond), retailers became increasingly dependent on their digital channel as the dominant, or potentially only, source of revenue.
Adding to the digital pressure, customers’ expectations have increased and their patience has decreased. Opportunities for competitive shopping are more abundant than ever. So while the rules haven’t necessarily changed for this holiday season, the stakes of ensuring the success or failure for your retail business have never been higher.
When any day could be a holiday
Today, customers anticipate the holiday season earlier and expect it to last longer. With an expected duration of weeks or months, rather than days, we can no longer approach holiday readiness as preparation for a single large event, with special handling, special processes, 24/7 war rooms, and other one-time activities.
We must create a sustainable operational cadence that allows our people, process, and technology to respond with little or no warning to a series of peak events without causing undue impact to customers or business operations.
The extended customer experience funnel
Additionally, the pieces of our application that need to be ready for peak events have changed. Traditionally, the focus has been on the “search to checkout” funnel; as long as a customer could find and pay for the item they wanted,the experience was a success.
Now that customers are increasingly picky about price, inventory, shipping costs, and time lines, the funnel has expanded. We must also consider capabilities like “buy online, pick up in-store,” shipping and tracking information, fulfillment with third-party logistics partners, and so on.
If a customer purchases an item, but their mobile app won’t let them check in to pick it up when they arrive in the store parking lot, then the overall experience is a failure. It’s time to think beyond the checkout page.
Clearly define — and anticipate — your business requirements
The most important step in planning for peak events is to clearly identify your business requirements. Having these requirements written down and agreed upon ahead of time will fuel easy decision-making and streamline the necessary response in times of uncertainty.
Requirements should be stated in business terms — i.e., X number of guest checkouts per second, Y number of logins within the first five minutes, or Z amount of store locator lookups at midnight on Thanksgiving — and not in technical terms: number of servers, amount of storage, or available bandwidth.
You should clearly understand the priority of capabilities and, in the event of a crisis, what can be allowed to fail. For example, “Online checkout can be unavailable for up to 15 minutes as long as the customer service call queue has a wait time of less than 3 minutes.” Or perhaps, “Real-time inventory checking can be disabled if the checkout rate exceeds 1,000 per second.”
Of course, it is possible that your business is simply unable or unwilling to make these trade-offs; regardless, the conversation should be explored so that these kinds of decisions can be made before the event, rather than under pressure.
These requirements should drive your technical architecture, capacity planning, and incident handling procedures.
How to maximize observability
Now that you clearly know what requirements your business has, you need to ensure that you are gathering and tracking all the signals that will indicate whether your systems are meeting the need. Observability can be broken down into three primary categories: monitoring, alerting, and logging.
Monitoring typically encompasses the time series data that we usually collect; that is, data for all the normal operations within the application — logins, checkouts, search queries, and so forth, along with the telemetry of our underlying infrastructure — and CPU levels, bandwidth, storage, and database queries. But don’t forget that beyond these volumetric indicators, you need to be tracking the qualitative experience of the customer. How is your application performing? There is typically an inverse relationship, meaning that as volume of use goes up, the performance degrades. Use load testing tools before the event to establish where in your usage curve you start to see performance degradations.
During your event, use synthetic monitoring to detect when you start to deviate from the baseline performance; synthetic transactions can be used to tightly control variability that could distract from a true system issue. Finally, leverage real user monitoring to understand what your customers are actually seeing, and in the event of an incident, to be able to quantify any negative impacts to the customer experience. Akamai tools for monitoring include Event Center, Event Viewer, Reporting, mPulse, CloudTest, Test Center, and Web Security Analytics.
Alerting simply means creating thresholds for your monitoring data to indicate when action is needed. Ideally, you can leverage adaptive thresholds, which can detect variations from the norm, rather than static thresholds. However, you should recognize that peak events by their very nature are variations from norm, so you must still have well-defined absolute thresholds for the boundaries of safe operations for your application.
Since you have already clearly defined your business requirements, you should also be able to clearly lay out the actions associated with any particular alert. Additionally, alerts should be tied to an escalation process so that nothing is ever missed. Akamai tools for alerting include Control Center alerts, mPulse alerts, and Web Security Analytics alerts.
Logging is often overlooked, but it is a critical piece of peak event management. If your monitoring detects something out of the ordinary, and your alerting wakes up the entire team, how will they know what needs to be done? It pays to ensure that you are logging the right things ahead of time to better anticipate what could go wrong and what to debug first when issues arise — and at what sampling rate (e.g., do you really need 100% sampling or could you do with less?.).
Ideally, you will place logs into a unified dataset that can be combined rather than having to look at multiple systems. Finally, recognize that as your traffic grows during the peak event, so will your logging volume (you might want to revisit that 100% samping decision). Ensure that your logging pipelines can scale/are scaled to meet or exceed the expected volume as defined by your business requirements. Akamai tools for logging include DataStream 2, SIEM Integration, and Edge Diagnostics.
It’s important to remember that you need not only to track the metrics for visibility into your successes, but also track the indicators that will lead to early detection of a failure before it becomes catastrophic.
Embrace graceful degradation to avoid compounding failures
In today’s world, it’s no longer sufficient to think of building your applications and services to simply be “up” or “down.” Modern applications should be available all the time — even through peak events or surges in traffic.
Defining what “available” means to your organization necessitates going back to your business requirements — How can you design your application so that it can handle multiple smaller failures, without resulting in a catastrophic one? And in such a way that the application will “gracefully” degrade so as to limit the impacts of a particular failure, allowing time for your incident response team (or automated actions) to adapt and recover.
With graceful degradation you can maintain the most critical functions available to your most critical users at the most critical times, while choosing what can be sacrificed to ensure that availability. Some approaches to graceful degradation include load shedding, time shifting, reduction of quality, and increasing capacity.
Load shedding is the process of choosing not to respond to all requests once the system has reached a critical limit (hopefully, as identified by your monitoring and alerting). Think of a breaker in your electrical system at home or the spillway on a dam. The art here is in deciding which of the requests to drop at which time. It may simply be enough to enforce a per user or global rate limit against a particular endpoint, denying all transactions above such a limit. Leveraging a virtual waiting room can provide a variety of methods for choosing who to serve and who to deny (or enforce a wait), while also ensuring fairness.
Additionally, using token-based API rate limiting allows for limiting transactions over time, such as a certain amount per month. Akamai tools for load shedding include Visitor Prioritization Cloudlet, WAF Rate Controls, and API Gateway.
Time shifting is a way to defer the processing of transactions until a later time when the system has caught up. In a peak event, you may decide that real-time inventory reservation can be deferred. In this scenario, you would prevent your inventory functions from being overwhelmed, but in turn, you may end up selling more items than you actually have to deliver. Or, you may decide to cache inventory checks for 30 seconds instead of updating after every transaction. Or perhaps you are sending order confirmation emails to a message queue first, rather than triggering them immediately and overwhelming that piece of your application. Akamai tools for time shifting include caching and EdgeKV.
Quality reduction is essentially choosing which pieces of your application will work, which you might shut off, and which might just work slowly. This approach is similar to load shedding and time shifting, except you are explicitly deciding to disable functionality rather than change how it operates. For example, during a peak event, you may disable a section on a product page that shows similar listings, or items that other customers bought. This reduced experience limits the load on your application but doesn’t prevent the critical path for users to transact during your event. Akamai tools for quality reduction include Visitor Prioritization Cloudlet and EdgeWorkers.
Increasing capacity is the most obvious choice when dealing with increased load; however, even the most mature autoscaling components may not be able to react to peak event traffic. By leveraging one of the first three approaches, you may be able to buy time for your scaling to react and then return to normal operations. With effective monitoring and clear business requirements, these kinds of actions and reactions can likely be completely automated. Akamai tools for increasing capacity include Linode and EdgeWorkers.
Graceful degradation is all about planning for expected failures — not the big, catastrophic, disaster recovery events, but the small things that happen each day.
Wrapping up
We’ve discussed ways to identify the most critical functions of your application, observe their behavior, and respond to any small failures while delaying, disabling, or deferring less critical pieces of the workflow. During a peak event, the impacts of small failures can quickly be exacerbated by load, and the impacts are more acutely felt by your customers – and worsen their experience with your site.
In Part II of our holiday readiness series, we’ll dive deeply into application security best practices and focus on recommendations for detecting and mitigating a variety of abuse scenarios just in time for the holiday season – and beyond.
To talk more about best practices for achieving peak performance, reach out to us.
Additional resources
Retail Dive webinar: Are You Prepared to Thrive Online This Holiday Season… and Beyond?
RH-ISAC blog: We Blocked Big Bots…and our Data Doesn’t Lie