AWS re:Invent 2018: Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small - ARC337

Home > AWS, re:Invent > AWS re:Invent 2018: Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small – ARC337

AWS re:Invent 2018: Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small – ARC337

November 27th, 2018 Julian Wood

Colm MacCarthaigh – Sr Principal Engineer, EC2 Networking, AWS

One of the undeniable aspects of AWS is its scale. We can think of this scale from two perspectives. From a customer perspective, AWS offers so many services in so many regions that you can build some amazing global applications at scale on top of the AWS cloud. The other perspective is from the AWS side as a huge cloud operator at scale. The AWS cloud shouldn’t be seen just as a (long) list of separate services tied together but should rather be looked at from the bottom up as a massive distributed system. I sometimes explain the AWS cloud as a distributed operating system to help people understand how tightly bound the services are on a common scalable platform. The AWS “OS” has many thing like networking, storage, compute and security services, just like Windows, Linux or Mac does but massively more distributed. AWS CTO, Werner Vogels, is one of the world’s distributed systems experts. The AWS cloud is a system designed for scale.

This session was a rare opportunity to hear from AWS itself how it runs its cloud. It was all about the AWS control plane and how it is the key to the success of the system. Distributing configurations and customer settings, launching instances, or responding to surges in load are all things that this stable and scalable control plane needs to do.

Colm delved into some of the designs and shared operational issues learned running some of the most reliable systems at AWS. Control planes are often a bigger design challenge than the data plane they support.

How do we build simple and stable control systems?

Diverse creative minds working in a fearless environment
Systemic reviews and mechanisms to share lessons
Use well-worn patters were possible and focus invention where it is needed
Testing, testing, testing testing

Make trade-offs in order

Security
Durability
Availability
Speed

Control Theory

Colm went into an explanation of academic control theory distilled down for us mere mortals. It was independently discovered and formalised in the early 20th century. Colm says its one of the most under-appreciated branches of science which is incredibly relevant to distributed systems.

What makes a stable control systems?

you need some kind of measurement process, this could be CPU above 70% for auto scaling
some kind of central controller, seeing you’re not in a desired state and will then decide to act
the actuator which is what does things, so launch new EC2 instances

Does this all in a loop, measuring, deciding and actuating.

The only kind of stable control system needs to be proportional–integral–derivative (PID)

10 patterns for controlling the cloud

Checksum all the things, multiple layers…everywhere
Cryptographic Authentication, you don’t want your control panel compromised, be able to revoke and rotate every credential, prevent human access to production credentials, never allow a non-production control plane to talk to the production data plane.
Cells – divide control plane horizontally into regions, AZs and cells. Shells – compartmentalise control plane so data plane is insulated from control panel failures. Poison Tasters: check up front that a change is safe.
Asynchronous Coupling – synchronous systems are too strongly coupled, a downstream dependency can impact the upstream callers
Closed Feedback Loops, you need to measure.
Small pushes and large pulls. You don’t want big systems connecting to small fleets, rather small fleets connecting to big fleets
Avoiding cold starts and cold caches, caches are bi-modal systems. Super fast when they have entries, and slow when they are empty. A thundering herd hitting a cold cache can prevent it ever getting warm, retry storms need to be moderated by throttles.
Throttles. Needed with rate-limits to moderate problem requesters and to damped fluctuating systems. You need to work carefully to ensure throttling does not impact the end customer experience.
Deltas, what happens when we have too much information state to push around, say like S3 metadata. More efficient to compute deltas and distribute patches. Perhaps add a version column to your table.
Modality and Constant Work. What if a lot of things change at the same time? You don’t want to build up backlogs and queues, introducing lag. Minimise the number of possible states. Systems that change performance in response to workload or data patters can be fragile. Using a relational database for a control plane is a very bad idea, in fact pretty much banned at AWS. Always do full scans with non-relational DBs. Maybe push a file every 10 seconds, whether it changed or not, doesn’t need a queue or any other reconciliation. Its very reliable and robust and you don’t need to worry about deltas.

Summary

Closing loops is critical, measure the progress
Loose asynchronous coupling helps
Think about the modalities of the system
Lessons baked into API Gateway and Lambda

An interesting session, well presented. Makes you think of what kind of controls you need to put into your systems and also highlights using native AWS services where all this stuff has been thought of!

Categories: AWS, re:Invent Tags: aws, cloud, re:Invent, serverless

Comments are closed.

AWS re:Invent 2018: Monday Night Live with Peter DeSantis AWS re:Invent 2018: Supercharge VMware Cloud on AWS Environments with Native AWS Services – CMP360

WoodITWork.com

AWS re:Invent 2018: Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small – ARC337

Recent Posts

Categories

Blogroll

Visitors

Archives

Recent Comments

WoodITWork.com

AWS re:Invent 2018: Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small – ARC337

Share this:

Recent Posts

Categories

Tag Cloud

Blogroll

Visitors

Archives

Recent Comments