When communicating to others in my organization, I sometimes make assumptions about what they already know. I assume they know something I consider basic, so I may not bother to share important information. However, it often turns out that what one person considers common knowledge may be completely new to the next person. And vice versa, of course.
That’s one reason I like to challenge development teams at New Relic to generate lists of 10 things they assume all of their co-workers know about what they do. I’m always surprised that the answers include many things I didn’t know.
For example, I recently asked some of my colleagues about the best practices experienced engineers take for granted when making changes. Here are the top 10 responses. Several were news to me—how many did you know?
1. Advertise your changes
Don’t do anything in a bubble. Instead, make sure there’s a log of all changes that have been made. This could be a simple text document, comments in a Slack channel, or even notations on a calendar. However it’s documented, it’s your responsibility to make sure internal team members who may be affected are aware that the change is happening.
If the change could affect customers, for example, make sure your support team (if your organization has one) is alerted. You may even want to notify your customers. A month’s notice is typical, but for things like breaking API changes or deprecation of a core product, you might want to give notice much earlier.
2. Don’t make changes when things are already broken
When there’s instability in a system, it may seem innocuous to log in to the system and look at reports. Of course if it’s part of your troubleshooting process, then this is OK, but even seemingly small changes can have broad implications. Especially if the system is under increased load. Fix what’s broken first, then make your changes.
3. Don’t risk making things worse
In the middle of an incident, your impulse will be to work fast to drive resolution. But it is exactly at these times when process is most important. Don’t skip the peer review of your change, and don’t act randomly. If the first few remediations you try don’t work as expected, take time to stop and think about what’s going on before plunging ahead.
4. Consider the hard choices during a recovery
Sometimes we’re afraid to make changes that might impact our customers. But if you need to disable customer functionality for one customer in order to restore functionality to the rest of your customers, it’s probably the right call. Here’s the rule of thumb: If taking the action (disabling the customer) would make the severity of the new state lower than the current state (total Impact to one customer is less severe than partial impact to all customers), then it’s probably the right action.
5. Tag team during incidents
All engineers get tired. The Amazon S3 incident from earlier this year lasted four and a half hours, for example, and that doesn’t include the time it took your team to clean up after Amazon had recovered. If you’ve been working on an incident for more than a few hours without a break, it’s time to bring in someone else to add new energy and fresh ideas. If the incident impacts multiple systems, include someone with a broader perspective, like a systems architect.
6. Automate your changes
Engineers think about automating toil—writing scripts that reduce arduous multi-step manual changes. We don’t often think about the benefits we gain from automating fairly simple tasks. Rebooting a host may be a single command, but you can automate a check to ensure you’re not rebooting the last host in a cluster, for instance. Even one-line commands should sometimes be scripted if there are important flags you don’t want to forget.
Other benefits to change management include making it easier to plan work in advance and to test that work in a safe environment. Even more important than safely testing changes, you want to ensure that the work you did in testing is the exact same set of steps you take during the real production event. And be sure to consider the differences between your test environment and production environment. For instance, staging environments tend to have different performance characteristics than production environments, due to differences such as load and volume of data.
7. Validate your input
Lots of scripts and applications run based on an API key configured in the environment variable of the person running the tool. This is great because you don’t keep your secrets in the source code, but it could be problematic if you don’t anticipate the access level of the next person running your tool. If your scripts affect a large number of resources from your development environment, for example, make sure they can’t run against your production environment. You could do this by validating the account number the script is run against, or by making gathering the list and executing the action into two separate steps.
8. Rely on a co-pilot
No matter how much you script your changes, chances are you’re still typing some text into something. If you’re rebooting a network device, make sure someone is looking over your shoulder to ensure you’re rebooting the right one. Talk out loud about your intentions so that the co-pilot can follow along.
Notably, your co-pilot doesn’t have to be more experienced than you to be effective. Sometimes, less experienced engineers don’t make the same assumptions that more experienced ones do, and may ask useful questions that challenge your assumptions.
9. Make sure someone’s around if things go wrong
4:30 p.m. on a Friday during the summer is not a good time to upgrade the kernel on all your Linux machines. Many of your co-workers have probably left early, and they’re not going to be happy to have to log in to the VPN on Friday night. When you select a time for your changes, keep in mind the current load on your system, what most of your customers may be doing at that time, and who on your team is going to be around to help if things break.
10. Identify high-risk items that people don’t know much about
We’ve all had to make changes to obscure legacy systems, or have responsibility for a system that we don’t work with enough to deeply understand. Gauge the amount of preparation work you should put in by the risk of your changes, but if you’re having trouble with this first step, ask yourself, “Do I even know enough about the system to assess the risk of my changes?” It pays to take a moment to ask what you don’t know and think about how the changes might possibly affect the entire system.
We’re all new to this job at some time or another, and fast-paced changes in the industry mean we all have lots to learn. With the continued blurring of the lines between dev and ops, for example, engineers are increasingly required to deploy and operate their own services. More automation means that there is less hands-on maintenance where engineers can learn the underlying services. Even experienced engineers can face new situations, and it’s difficult to be sure exactly what other folks on your team know.
With that in mind, I hope this list provides some new information, or at least reminds you to pay attention to what you already know. Change management is a big deal, and it never hurts to go the extra mile to make sure you don’t accidentally cause problems or make existing ones worse.