Home > AWS, re:Invent > AWS re:Invent 2018: Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small – ARC337

AWS re:Invent 2018: Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small – ARC337

November 27th, 2018

Colm MacCarthaigh – Sr Principal Engineer, EC2 Networking, AWS

20181127_013554604_iOSOne of the undeniable aspects of AWS is its scale. We can think of this scale from two perspectives. From a customer perspective, AWS offers so many services in so many regions that you can build some amazing global applications at scale on top of the AWS cloud. The other perspective is from the AWS side as a huge cloud operator at scale. The AWS cloud shouldn’t be seen just as a (long) list of separate services tied together but should rather be looked at from the bottom up as a massive distributed system. I sometimes explain the AWS cloud as a distributed operating system to help people understand how tightly bound the services are on a common scalable platform. The AWS “OS” has many thing like networking, storage, compute and security services, just like Windows, Linux or Mac does but massively more distributed. AWS CTO, Werner Vogels, is one of the world’s distributed systems experts. The AWS cloud is a system designed for scale.

This session was a rare opportunity to hear from AWS itself how it runs its cloud. It was all about the AWS control plane and how it is the key to the success of the system. Distributing configurations and customer settings, launching instances, or responding to surges in load are all things that this stable and scalable control plane needs to do.

Colm delved into some of the designs and shared operational issues learned running some of the most reliable systems at AWS. Control planes are often a bigger design challenge than the data plane they support.

How do we build simple and stable control systems?

  • Diverse creative minds working in a fearless environment
  • Systemic reviews and mechanisms to share lessons
  • Use well-worn patters were possible and focus invention where it is needed
  • Testing, testing, testing testing

Make trade-offs in order

  1. Security
  2. Durability
  3. Availability
  4. Speed

Control Theory

Colm went into an explanation of academic control theory distilled down for us mere mortals. It was independently discovered and formalised in the early 20th century. Colm says its one of the most under-appreciated branches of science which is incredibly relevant to distributed systems.

What makes a stable control systems?

  • you need some kind of measurement process, this could be CPU above 70% for auto scaling
  • some kind of central controller, seeing you’re not in a desired state and will then decide to act
  • the actuator which is what does things, so launch new EC2 instances

Does this all in a loop, measuring, deciding and actuating.

The only kind of stable control system needs to be proportional–integral–derivative (PID)

10 patterns for controlling the cloud

  1. Checksum all the things, multiple layers…everywhere
  2. Cryptographic Authentication, you don’t want your control panel compromised, be able to revoke and rotate every credential, prevent human access to production credentials, never allow a non-production control plane to talk to the production data plane.
  3. Cells – divide control plane horizontally into regions, AZs and cells. Shells – compartmentalise control plane so data plane is insulated from control panel failures. Poison Tasters: check up front that a change is safe.
  4. Asynchronous Coupling – synchronous systems are too strongly coupled, a downstream dependency can impact the upstream callers
  5. Closed Feedback Loops, you need to measure.
  6. Small pushes and large pulls. You don’t want big systems connecting to small fleets, rather small fleets connecting to big fleets
  7. Avoiding cold starts and cold caches, caches are bi-modal systems. Super fast when they have entries, and slow when they are empty. A thundering herd hitting a cold cache can prevent it ever getting warm, retry storms need to be moderated by throttles.
  8. Throttles. Needed with rate-limits to moderate problem requesters and to damped fluctuating systems. You need to work carefully to ensure throttling does not impact the end customer experience.
  9. Deltas, what happens when we have too much information state to push around, say like S3 metadata. More efficient to compute deltas and distribute patches. Perhaps add a version column to your table.
  10. Modality and Constant Work. What if a lot of things change at the same time? You don’t want to build up backlogs and queues, introducing lag. Minimise the number of possible states.  Systems that change performance in response to workload or data patters can be fragile. Using a relational database for a control plane is a very bad idea, in fact pretty much banned at AWS. Always do full scans with non-relational DBs. Maybe push a file every 10 seconds, whether it changed or not, doesn’t need a queue or any other reconciliation. Its very reliable and robust and you don’t need to worry about deltas.


  1. Closing loops is critical, measure the progress
  2. Loose asynchronous coupling helps
  3. Think about the modalities of the system
  4. Lessons baked into API Gateway and Lambda

An interesting session, well presented. Makes you think of what kind of controls you need to put into your systems and also highlights using native AWS services where all this stuff has been thought of!

Categories: AWS, re:Invent Tags: , , ,
Comments are closed.