Hopefully you've arrived here having read the other blogs in the series. If not, you're welcome to back up a bit and start with part one. I'll wait for you.
Connecting with the past
First off, I want to talk about networking as it used to be. So let's wind back the clock 12 years to when my job title in a different organisation was ‘network engineer’. Back then, what mattered to me was making sure that packets of data got from A to B in the shortest possible time. And clouds, they were just those fluffy things that produced rain.
Our switches were Layer 2 devices, our routers were Layer 3 devices, and I could tell you the exact path a packet would take through the network, because it was ours. In those days, we would take it as a personal slight when the laws of physics made our round trips take more than a couple of milliseconds.
Security? That was someone else’s problem, and a mystical art compartmentalised in a small team elsewhere. All we knew was, being forced to route our heavily optimised network streams though remote firewall clusters added latency to our precious traffic, causing our database admins to cry.
Today, things are very different. We find that clouds have become digital, and in doing so radically changed how we think about network topologies. What was once a 48U rack full of firewalls and routers is now 2 simple lines in a JSON template, and the only physical networking of consequence is whatever you need to provide the fastest Internet line possible for your clients.
The walled garden
A few years back, CESG (as we were then known) published an approach to the design of networks for remote working. This pattern was boldly named a 'Walled Garden for Remote Access' and if you squint a bit, you can see the parallels in protecting your petunias from the neighbourhood cats using bricks and mortar.
Years later, the Walled Garden still provides a sensible and logical approach to connecting trusted networks to less trusted ones in a classic on-premises setup. But it doesn’t go far enough to describe exactly how this would work in a true cloud environment. Especially in this brave new world where routers, switches, and firewalls all look like the same thing.
We can, however, dig up some key principles from the Walled Garden, and bring them with us:
- We have services and systems which must remain inaccessible to unauthorized users.
- We must secure the in-transit communication of our clients and remote sites across untrusted networks (hello Internet!).
- We need a trusted location in which to authenticate and verify our clients.
- We need to constrain what authenticated users can access to the minimum required.
The NCSC approach
We were fortunate enough to start building our IT system with - from a technology perspective - a mostly blank canvas. Having made the decision to be cloud-first, we took a good look at our mix of modern and legacy applications and systems.
Some of these applications were using true PaaS or SaaS, but like many organisations, there were areas where we had some applications to deploy on IaaS too. When it came to building the network, this left us with three options.
- Make exclusive use of the PaaS network features provided by our cloud vendor.
- Make exclusive use of virtual appliances provided by third party vendors.
- Use some combination of the two.
(Spoiler alert, we ended up with option 3.)
Within the cloud
Just because we're operating in a cloud environment, doesn’t mean we should suddenly stop caring about data flows within our infrastructure.
When deploying our infrastructure, we made use of the tiered delegation model developed by Microsoft. The cloud is an awesome place to build this sort of model, because we can hide away infrastructure that we really care about behind security groups, making sure that things like PKI and directory services are enforced properly.
No need for those previously mentioned stacks of expensive firewall clusters. If, for example, a privileged user modifies a security group (or the state of a VM changes unexpectedly), we can immediately generate an event. This could, for example, restore the original state of the security group, or send an alert to a set of people.
We separated out our services, permitting those in the same ‘tier’ of importance to reside together. Those things that are categorised as T0 have much more stringent controls and permitted network flows than other tiers. We even went as far hosting our T0 services in a different account entirely. This meant we could more easily control potentially nefarious things such as downloading the virtual disk of a Domain Controller.
Remote access VPNs
We needed a VPN infrastructure which would play nicely with the PRIME spec capabilities of our client devices. (The approach is detailed in our End User Device Guidance, but basically involves IKEv2 with ECC certificates).
Initially, it looked like we could get most of the way with PaaS features (for example the features offered as part of AWS’s VPS service or Azure’s VPN Gateway service). But, after a large amount of testing and 'road-warrioring', we found that these services didn’t yet provide what we needed in a client-to-site capacity.
For example, one feature we needed was the ability to tweak IKE timeouts and Dead Peer Detection values based on client type, and this proved tricky to do. So, we chose to deploy VPN gateway appliances to deal with client-to-site traffic.
Our choice of product was based on a combination of requirements, both technical and practical. We had technical specifications that needed to be met, but we also needed something our in-house support teams were familiar with. In the end we selected Cisco CSRs, the cloud-based equivalent of the Foundation Grade-approved Cisco ASR, which we have lots of experience with.
I'd urge anyone facing this problem to make an independent decision based on your own needs, and also to keep an eye on developments in the PaaS-space. We'll also revisit the decision we made from time-to-time to ensure it remains the right choice for us.
The CSRs are nested in the outermost ‘tier’ of our infrastructure, and are highly available across multiple regions of our cloud provider. They function as you would hope, authenticating clients, validating certificate chains, performing health attestation checks, terminating connections inside our more trusted infrastructure, for onward routing.
The Virtual Machines they run on are totally defined in code, and the whole system can be cloned in a programmatic fashion, allowing us to recreate the entire setup with a few mouse clicks, or a quick command to an API. This is particularly useful because we can maintain an identical reference environment where we can test changes, patches and feature upgrades, without the need to have it running 24/7, consuming compute cycles.
For site-to-site connectivity, standard platform services hadn't quite provided what we needed for our client-to-site VPN. But for true site-to-site VPNs we found them to be perfectly adequate.
Although we still need to make sure that our traffic in transit is properly protected using PRIME and our remote endpoints are properly authenticated, the simple fact that remote sites don’t move around made a massive difference.
We did not see the need for many of the heavy, value-added features you get with third-party appliances.
Using the platform to terminate these connections, we also get scalability and high availability for free. Each remote site is keyed in a traditional fashion. In our case, the remote device is configured as per the PRIME spec, and is keyed and configured using standard methods with a valid EC certificate assigned from our internal PKI infrastructure.
We have Virtual Private Gateways (VPG) defined with a list of permitted remote sites that can connect. And from there we just describe all the remote subnets, endpoint IP addresses, etc. To us, the VPG is a VPN concentrator, load balancer, router and firewall all in one. Configuration information is defined as a few lines of JSON, and checked into a Github repository to track changes.
The future will be different
It's been 5 months since we deployed, and already we can see areas to tweak, upgrade and generally version up. But, of course, all this is subjective to our organisation. For us, this system works. But, just like everyone else, we had to find the correct balance between risk, cost and business needs. This system was born from that decision-making process.
Operating in this way makes network administration much more like developing software than running physical infrastructure. Changing out that 48U rack of firewalls, switches and routers is suddenly indescribably cheaper, and we get the latest awesome features too. Oh, and to top it all off, we've cheered up those database admins no end!
As always, suggestions and comments are welcome in the section below.