AWS re:Invent 2017: How AWS Runs Our Weekly Operations Meetings

David Lubell and Kevin Miller from AWS

I was really looking forward to this session, as this is the very first time AWS has opened the kimono on how it actually runs its own operations.

I was there two hours in advance to guarantee a place, they only had a room for 60!

David started off by mentioning that AWS has had a weekly operational meeting which has now been running for more than 10 years. It runs for 2 hours every week. This looks at the performance of services with the idea of nipping issues in the bud as well as being forward looking by identifying new best practices.

David wanted to share some lessons learned from what he termed “the trenches” with one of the largest services in the world.

In every meeting, leaders for every AWS service together with more than 100 engineers deep dive into their operations. The reason for having so many people is to have immediate ownership of things and be able to more quickly respond across the whole organisation.

David went through ways they’ve developed to run an effective meeting which is so large. It’s not just about the tech they’re looking at but also how you can run a meeting to look at this effectively, how can you feed back the metrics you can see to stop issues recurring. AWS understandably has a very high bar for operational performance.

In the meeting, they go through and share successes, look at organisation projects, review operational events, do service metrics reviews and other updates and announcements. It’s not just about metrics but is an ops information sharing forum which sends the right message that operations really matters. It helps to gather a community and tech best practices with an accountable audit mechanism. They try to avoid spending too much time on things specific to each team.

Each team has a standard set of metrics (responsiveness etc.) and also metrics for what’s special about the service. Customer service always important so they measure how long it takes an EC2 instance to get fully up and running for example as well as how long the console displays the updates.

The Wheel

There is a problem though they they have too may teams to review as the services have grown.

One way they solve this is using “The Wheel” which is the tool they use to show real time metric. Basically they can pick any service and very easily deep dove into how its performing.

They basically spin a wheel (used to be on cardboard, just of course its online) and the team selected gets to present their metrics. They try to get teams to have 15 minutes in a meeting so that’s 6-7 every week with the others things as well.

They went onto version 3 of “The Wheel” earlier this year by adding weighting for the landing so each team over a number of weeks gets a chance to present.

They plan to open source the fourth version next week.

Once the landing happens and the team to present is identified, they head to the weekly dashboards which may look at API error rates for the service. They then look at whether they are monitoring enough and what lessons were learned. This may then spawn ideas for how to automatically scale a service rather than needing scripted or manual changes. Much of the recommended actions are put into the Well Architected Framework for customers to use too. Most of the remediation they try to automate so if it needs to be rolled out across a number of services (for example a JVM tmp file setting change) this needs to be simple to automate and deploy across the whole fleet.

The question always asked for remediation is: “How can we be more proactive to avoid things like this”.

There’s accountability so dates are required to report back for feedback

Tips to do a similar meeting yourself, Get buy-in, stay regular and iterative and customise.

When does it run? In the middle of the week rather than beginning or end.

Granularity of dashboards depends on the metric, some down to 1 minute, mostly 5 minutes.

They suggested teams use CloudWatch dashboards, they’re getting their own teams to transition to CloudWatch natively..

AWS has a stringent change management process. Approvers and reviewers required with roll back steps. They’ve built a lot of tools obviously to automate deployments and the more automated a team is the less hands on change management approval is needed.

Generally good session, more on the organisational side rather than the actual tech they use to monitor everything and use it across services.

WoodITWork.com

AWS re:Invent 2017: How AWS Runs Our Weekly Operations Meetings - ENT346