Guidance

Design Principles: Designing to avoid disruption

Created:  25 Sep 2016
Updated:  25 Sep 2016
When high-value services rely on digital delivery it becomes essential that they are always available. For the credibility of the service and the users’ convenience, the acceptable percentage of ‘down time’ is effectively zero.

1. Implement denial of service protections as far upstream as possible

Denial of service protections work best when they can leverage economies of scale for both bandwidth and computational power. Once an attack has reached your infrastructure and service, your defence is limited by your available bandwidth and how much computation you can deploy rapidly. For this reason, you should try to obtain denial of service protections at provider level.

You also should be aware of different classes of denial of service (DoS) attack and what kind of protection specific defence services will buy you. For example, a Content Delivery Network may help defend against denial of service on static content, but is less capable of defending dynamic content.

2. Limit unauthenticated user access

Most DoS attacks originate from unauthenticated users. Therefore, limit the access of unauthenticated users to critical systems, this will reduce their attack surface.

Be aware of the user journey to your service, this will have a "critical path". If this route is blocked - for example by the static site which links to the service being down - then your service is as good as unavailable.

Be especially cautious about unauthenticated requests for resources, those which generate a high computational load. For example, if you use federated identity services for login, you will require expensive cryptographic operations to validate the encrypted identity assertions you receive.

For cases such as this, you must ensure that your resources are able to scale to cope with high load.

3. Cope with exceptionally high load

Most online services experience occasional periods of extreme load as the result of service peaks, media interest or attacks. Design your service with this in mind:

  • Include the ability to disable non-essential components. Certain non-essential functionality - such as search - places a high load on servers. Identify these components and be able to disable them in order to help keep the core platform available.
  • Design your service to support automatic scaling. If the service is cloud hosted, make use of the platform's ability to scale out nodes, or consider using Platform as a Service (PaaS) functions which scale seamlessly. Be sure that every component of your architecture can scale to avoid the formation of bottlenecks.

4. Identify bottlenecks, test for high load and denial of service conditions

Identify bottlenecks in your service. For example, low capacity, legacy business logic, or an essential microservice which calls a third party service. Ensure that you have a plan in place to handle these bottlenecks during periods of high load.

Add specific tests for abnormally high load, and for denial of service to your overall testing strategy. For instance, you could simulate some denial of service attacks by purposefully terminating certain microservices or infrastructure elements in your pre-production environments. There are openly available tools to help you test for high load or for failure, such as Netflix's Chaos Monkey. It's important to test how you respond to failure conditions, as well as understanding what those failure conditions could be.

5. Identify where availability depends on a third party and plan for the failure of that third party

Most modern online services will place some reliance on third party services. Hosting and authentication are common examples. Ensure you understand the availability characteristics of these third party services and the impact on yourservice should they fail at a time of high load.

Have a plan for how you will minimise disruption if such an event occurs. For example, if your authentication provider becomes unavailable, how will you treat customers arriving at your service?

6. Log enough to perform root-cause analysis

Sometimes services fail. When this happens you should have enough data logged at infrastructure and application levels to identify the cause of failure. It is important that you are able to quickly identify whether the failure was due to an attack or a bug.

Was this guidance helpful?

We need your feedback to improve this content.

Yes No