Pattern: Safely Importing Data

Created:  16 Jul 2018
Updated:  16 Jul 2018
An architecture pattern for safely importing data into a system from an external source.


Computer systems rarely exist in isolation - they often need to interact with the outside world to be useful. So it's likely you'll need the ability to import information to your organisation's systems, without also importing malicious code. 

Unfortunately, the paths you use to import legitimate data and software also offer attackers a route by which to get malware into your systems. This is true for all means of import, including network connections and removable media. And, since any external system could be compromised, they should all be considered as posing a risk to your systems.

This guidance identifies a set of technical controls which can be used to manage the risks associated with importing data over a network. It is particularly relevant for systems where integrity or confidentiality are paramount, such as those which handle sensitive or personal data, classified information, valuable transactions, or those which operate industrial control systems. 


We have split this guidance into the following sections:

  1. Defending attacks over the network
  2. Avoiding the import of malicious content
  3. Recommended pattern for data import
  4. Monitoring for attempted breaches


1. Defending attacks over the network

Data is normally brought into a system using a network connection and appropriate transport protocol. This could be either a generic protocol (such as SFTP or SMTP) or a system-specific API.

Unfortunately, both network connections and transport protocols could be vulnerable to attack. A successful compromise of either could give an attacker the ability to gain control of the destination servers or network appliances. 

Nature of the attack

A network-based attack may target any of the media or host layers of the OSI stack

Media layers (physical, data link, network)

Attacks include:

  • Crafting a malformed Ethernet frame to exploit a vulnerability in the Ethernet driver of the destination device
  • Crafting a malformed IPv4 or IPv6 header which exploits a vulnerability in the IP stack of the destination device, firmware or operating system

Host layers (transport, session, presentation, application)

Attacks include:

  • Crafting a malformed protocol header which compromises a vulnerability in a protocol handling library (TCP, UDP, SIP etc). 
  • Crafting a message to exploit a vulnerability in the transport compression library (e.g. ZIP or GZIP), if compression is in use
  • Creating malformed messages to exploit a vulnerability in transport layer encryption (e.g. TLS), if transport encryption is in use
  • Attacking the applications or services handling application layer protocols (such as HTTP, SMTP or SFTP services)

Defensive techniques

The following controls can be used in combination to reduce the risk of a successful network-based attack:

  • Rapid patching of software and firmware in network and compute infrastructure, including both operating system and application software. Patching reduces the risk of compromise from known vulnerabilities that the vendor or community has patched, but it does not address the risks posed by an attacker with knowledge of vulnerabilities which aren't in the public domain
  • Uni-directional flow control. Achieved, for example, by using a data diode. This ensures that data only flows one way through the channel. Flow control does not stop a vulnerability being exploited within the destination system but it can make it difficult for an attacker to perform command and control, export data, or simply learn more from the sensitive services being protected. A more capable attacker may look to use alternative export paths to exfiltrate data
  • Use of a simple transfer protocol with a protocol break. A protocol break will terminate the network connection, and the application protocol. The payload will then be passed via a simplified protocol to a receiving process, which re-builds the connection and passes the data on. A well engineered protocol break will make a protocol based attack against the destination system much more difficult

Protocol breaks are normally used in conjunction with flow control enforcement, as depicted below:

Figure 1: Protocol breaks are normally used in conjunction with flow control enforcement

If properly implemented, a protocol break and flow control, used in combination, can significantly reduce the ability of an attacker to compromise a system with a network-based attack. We recommend that appropriate testing be performed on any deployments to gain confidence that the protocol break and flow control mechanisms work as designed. 


2. Avoiding the import of malicious content

One of the main ways attackers attempt to compromise a computer system is to try and hide malicious code within content, such as a file or data object, which is then processed on the target system.

The malicious content is designed to result in execution of the attacker's code, which can either be contained within the content or downloaded as part of the attack.  

Nature of the attack

There are a number of different ways an attacker could compromise a destination system with malicious content.

Attacks include:

  • Sending malformed compressed content with the aim of exploiting a vulnerability in decompression code 
  • Sending malformed encrypted content with the aim of exploiting a vulnerability in decryption code 
  • Sending syntactically malformed content which exploits a vulnerability in a parser used by the destination system, such as a JSON or XML parser  
  • Sending semantically malformed content which exploits a vulnerability in the way the destination system processes the data
  • Sending content which exploits a logical error in the destination system, enabling an attacker to perform a function they should not be able to perform
  • Embedding active code (e.g. scripts or macros) into content that allows it (such as PDFs or Word docs), with the aim of executing attack code on the destination system

More complex data formats are harder for developers to write and test parsers for. It is this complexity that often leads to developers accidentally introducing vulnerabilities into their software.

By way of example, consider office productivity documents, such as PDF or Microsoft Word formats. These can be made up of thousands of interrelated data objects, each with different 'type' definitions. This level of complexity provides a rich attack surface on which to search for vulnerabilities.

Defensive techniques

The following controls can be used in combination to reduce the risk of malicious content successfully compromising a destination system:

  • Rapid patching of all services and applications that are used to open or interpret content, including any library dependencies, middleware and underlying operating systems. Patching reduces the risk of compromise from known vulnerabilities, but it does not address the risks posed by an attacker with knowledge of vulnerabilities which aren't in the public domain
  • Robust engineering and testing of components which handle or process content from external sources, to identify and address as many vulnerabilities as possible during development
  • Syntactic and semantic verification of content to ensure its validity before it reaches the system which will interpret it. Verification components should be robustly implemented and placed after protocol break and diode components. Syntactic verification should ensure the structure and syntax of the object are correct (e.g. that the content is valid XML or JSON which conforms to a specified schema). Semantic verification should ensure that the meaning is valid in the context of the operation or business process being performed. Verification components should ensure all potentially active content has been removed
  • Transforming complex file formats into simple ones. For complex file formats, building a robust verification engine is not likely to be feasible since the likelihood of vulnerabilities existing in any verification functions is also high. In these circumstances transformation can be designed with the aim of neutering any malicious code present in the content. However, the transformation engine processes the complex data format, so could itself be vulnerable to attack. Transformation should, therefore, be performed before the protocol break sender component. Transformation may also remove unwanted content, such as active content (for example macros or scripts). Once verified, you may need to re-build an object in its original format so that it matches what's expected by the destination system
  • Non-persistence and sand-boxing of rendering applications. Non-persistence and sand-boxing can limit the impact of any compromise to a specific session or period of time, making it hard for an attacker to gain persistence within a network, or reducing the value of successfully executing code. Note that non-persistence and sand-boxing controls need careful design to ensure they achieve their desired aim
  • Prevent the running of active code on destination systems. For example, by disabling macros (see our guidance on Macro Security in Microsoft Office)

Dealing with nested content

Care should be taken when formats have embedded content of another format. Nested content should be un-packed, transformed if required, and verified.

In order to prevent vulnerabilities related to recursion, limits should be placed on the amount of nesting and recursion allowed.

For systems which support processing of multiple types of content, there is the possibility of one format being presented as another in a bid to trick your verification engines. To prevent this, content format should be verified robustly and consistently at each step of a system architecture.


3. Recommended pattern for data import

The controls described in the previous sections, such as transformation, verification, and flow control, can be combined to create a pattern for data import. Our recommended approach to this is depicted below: 

Figure 2: Recommended pattern for data import

Ordering of the components

The components in the pattern are ordered deliberately to provide optimum protection. Notably, transformation is performed prior to the data passing through the protocol break and flow control, with verification as the final step.

We consider the Flow Control to mark the boundary between the less trusted 'low side' of the gateway and the more trusted 'high side'. Transformation should be a low side activity because it is inherently risky to parse and process untrusted content. We assume that the transformation engine could be compromised, but we aim to detect and mitigate compromise when it occurs.

If an attacker had total control of the transformation engine, there are several more hurdles for them to overcome in order to have an impact on the destination system.

A well designed import gateway will have the verification engine performing a relatively simple job in comparison to the transformation engine - it ensures the data provided by the transformation engine is syntactically and semantically as expected.

Unlike the transformation engine, which potentially has to be able to handle a wide variety of formats from multiple sources, the verification engine may only need to verify content in just one simple format. Given you have control of the format that data is transformed into, you can configure the verification engine to strictly ensure that data received is as expected.

After successful verification, the data can be passed on to the destination, potentially being transformed back in to the original (or a different) format, for the recipient.

Removing components

Not all components in the pattern are always needed. For example, transformation may not be needed where the content format is simple enough to be verified directly. Equally, the level of verification required can be traded against the level of confidence you have in the robustness of the destination system, and the impact of compromise of that system.

Important architectural considerations

The following security considerations relate to the architecture as a whole:

  • The approach to management and administration should not undermine the security of the gateway. Specifically, the low side components should be managed separately from those on the high side, such that a compromise of any low side component could not result in a bypass of the gateway
  • Non-persistence and sand-boxing should be used to limit the impact of compromise. Any of the components that process data, such as the transformation engine, verification engine and destination system may benefit from one or both of these techniques. They will not stop a component from being compromised, but can limit the impact to a specific process, and make it hard for an attacker to maintain an ongoing presence
  • Steps should be in series. It is important that network design ensures the steps in the pattern cannot be bypassed, as this would remove the security provided by the end-to-end solution

4. Monitoring for attempted breaches 

Protective monitoring could play a part in every step of the gateway, however the most important functions to monitor for attempted breaches are the verification engine and the destination system. 

Monitoring the verification engine and the destination system

The verification engine is the key security enforcing component within the gateway, so should be closely monitored. For import flows that use the transformation technique, if the transformation engine is working correctly there should never be a verification failure. If there is a verification failure, this suggests the transformation component has malfunctioned or been compromised. Consequently, all errors in the verification engine should be raised as alerts for an operator to act on. 

The destination system in our pattern is the prize that an attacker is trying to gain access to. It should be monitored for signs of compromise, paying special attention to any components and processes which handle content received from an external source. Depending on the integrity requirements of the destination system, it may be advisable to separate components which process external content from other content, allowing monitoring systems to be tuned to pay more attention to riskier activities. 

Monitoring the remaining components

For higher risk deployments there is value in applying protective monitoring to the remaining components of the gateway. Below are some tips on what to monitor for in each:

External network connection

For import gateways which only accept connections from specific source systems, strict rules can be applied to restrict the connection to allowed sources and alert on attempted connections from other sources.

For gateways that accept connections from many different sources, a 'known good' approach to approving acceptable sources may not be possible, and a 'known bad' approach may be the the only option.

Transformation engine

Since the transformation engine must process complex and unfiltered information, it's good practice to assume it will be a relatively easy target to compromise. Whilst our pattern is designed to ensure that does not lead to a compromise of the verification engine or destination system, we do wish to know when the transformation engine is compromised and in need of remediation.

Monitoring of the transformation engine for compromise could include monitoring of attempts to establish outbound network connections or system changes. It could also look for uncharacteristic behaviour on the underlying server, virtual machine, or container that is providing the transformation functionality. Uncharacteristic behaviour could include the crashing of processes involved in processing external data or calls to libraries or binaries which are not normally accessed.

Protocol break and flow control

If an optical flow control device is in use, monitoring should be in place within the protocol break receiver for any communication drop-out and the integrity of the connection with flow control. Any other type of error in the receiver component may indicate a compromise of the sender, so should be immediately reported and acted on.

Internal network

In a closed or isolated network, where communication flows are predictable and well understood, it should be possible to implement a 'known good' approach to network monitoring where unusual network communications result in alerts being raised and acted upon. If this is not possible for the whole internal network then particular focus can be applied to unexpected connection attempts to or from the verification engine.

Combining logs from the various components 

To simplify monitoring of the gateway and gain the best possible understanding of the operational security of the gateway, logs or alerts from the various components can be combined and correlated into an appropriate analysis platform.

To avoid the potential for any of these monitoring flows to become a bypass for components in the gateway, we recommend that logs or alerts from components be validated for correctness using the techniques in this guidance before being analysed - otherwise an attacker could generate malformed logs in an attempt to breach the destination system.

Logs or alerts from the external, 'low side' components (In figure 2, the transformation engine(s), external network and protocol break sender) should pass through an appropriate protocol break and flow control before being passed into the destination system for monitoring. 

For further information on log collection please see Introduction to logging for security purposes



This pattern has been developed in the field. It's not a guarantee of safety but if implemented well it can provide a strong level of defence against attack, and can be used in a variety of different systems.

As always with introducing security controls, it is important to understand the end-to-end system, not just in terms of technology but also people and processes. The processes of transformation and verification can cause a modified user experience and so care should be taken to test how well the system works for users before rolling it out. 

Was this guidance helpful?

We need your feedback to improve this content.

Yes No