In this episode, Tammy Bryant, Principal Site Reliability Engineer at Gremlin, talks about specializing in chaos engineering and incident management. Connect with her further on her website, tammybutow.com.
Should you find a burning need to share your thoughts or rants about the show, please spray them at devrel@newrelic.com. While you’re going to all the trouble of shipping us some bytes, please consider taking a moment to let us know what you’d like to hear on the show in the future. Despite the all-caps flaming you will receive in response, please know that we are sincerely interested in your feedback; we aim to appease. Follow us on the Twitters: @ObservyMcObserv.
Jonan Scheffler: Hello and welcome back to Observy McObservface, the observability podcast we let the internet name and got exactly what we deserve. My name is Jonan. I'm on the developer relations team here at New Relic, and I will be back every week with a new guest and the latest in observability news and trends. If you have an idea for a topic you'd like to hear me cover on this show, or perhaps a guest you would like to hear from, maybe you would like to appear as a guest yourself, please reach out. My email address is (jonan@newrelic.com). You can also find me on Twitter as the Jonan show. We are here to give the people what they want. This is the people's observability podcast. Thank you so much for joining us. Enjoy the show.
I am joined today by my guest, Tammy Bryant. How are you doing today, Tammy?
Tammy Bryant: I'm great, thanks. Thanks for having me.
Jonan: Thank you for coming. It's really nice to finally meet you. We had to bump this around a couple of times—it's hard when you live in Florida and everything is beautiful. It's difficult to commit to just sitting in a room with a microphone [laughs], you do not want to do that. So I want to introduce our listeners to you. And maybe you could tell us a little bit about who you are and how you ended up where you are in the industry today.
Tammy: Yeah, sure thing. My name's Tammy and I work at Gremlin as a principal site reliability engineer. I've been here for three years. I get to do a lot of chaos engineering, which is really fun and awesome. I love learning about systems. I'm really curious, eager to learn, and also very ambitious. I always want to understand how I can fix things, make things better. And, prior to this, I worked at Dropbox as the SRE manager for databases and block storage. So large-scale of 500 million users, 500 petabytes of data. And also at the national Australia bank for six years, doing a lot of cool stuff on internet banking, foreign exchange, trading, mortgage broking, pretty much all the key critical systems. I got to work on all of those. And I was also at DigitalOcean, which was very cool to run the cloud for others and make sure that that was up and running and working really well for customers. So I've got tons of fun stuff. Yeah, and I live in Florida, which I love. I just got two stand-up paddleboards. I love that a lot and just love exploring the outdoors when I'm not on the computer.
Jonan: So I want to talk a little bit about your recent experience. Dropbox is also fascinating to me and DigitalOcean—I use all of these products. I haven't had as much of a chance to play with Gremlin in my actual work, Dropbox of course. And DigitalOcean, we use every month for my meetup. I spun up the most expensive server available to me for at least two hours to set up our Jitsi server. We have our own video-streaming thing. I always feel bad when I spin it back down again for DigitalOcean that I should give them more money. They're a great company.
Tammy: Yeah, they're great.
Jonan: So tell me a little bit about when you joined Gremlin—you've been there for three months now. I assume you were motivated by the very wise career choice to investigate chaos engineering, which is a fascinating discipline. Is that close? Is that how you ended up there?
Tammy: Yeah, definitely. So, so back in, I would say, 2015, I was doing a lot of chaos engineering work at Dropbox, and I just tweeted about it one day. At that time, I'd never ever given any conference talks, and I've been working in industry for 10 years, or more than 10 years, I guess. And I sat down and thought, “What could I share? What could I talk about after all these years of pain, getting paged all the time and on-call and just things happening?” And then I got asked, “Hey, you know, you're doing this cool chaos engineering stuff at Dropbox. Would you like to come and speak on my conference track?” And that was actually the founder of Gremlin, Kolton Andrus. So Kolton, before he founded Gremlin, he was working at Netflix and then Amazon. So I did accept his invite and that was my first talk in the USA.
Jonan: I was gonna say, Netflix is the source of chaos engineering. Surely it's a thing people practiced before then, but is that pretty accurate, that it was popularized there?
Tammy: Yeah, definitely. Bruce Wong created the chaos engineering team. He's the first person to establish that practice and that team, at Netflix. And it was invented by his team and really pioneered by their efforts. But at the time a lot of tech companies, startups just weren't doing this sort of work. The reason I knew about it and was doing disaster recovery specifically was because of working in banks for so long, you have to do it to meet compliance regulations. You have to failover your whole data center, you have to do it every quarter. You have to failover all services, you need to have redundancy in place, and the regulators will come and check it and confirm that it works. So I love that they were pioneering this effort in the technology space, in the startup world, trying to encourage people to not just hope that things will be OK—because hope is not a strategy, as they say. I really think it was created by them. The fun thing was when I was working at NAB, National Australia Bank, we actually used Chaos Monkey. So that was really cool. Like we were like, “Oh, this is cool. Let's give it a go,” because we were moving to AWS at the time.
Jonan: So Chaos Monkey is a library. It's the project that is the codification of this chaos engineering discipline that arose at Netflix, right? Chaos Monkey was produced out of Netflix’s open source efforts.
Tammy: Yes, exactly. And the idea there is that it'll randomly shut down different ECE 2.0: Host. So that's how it works, it'll just pick one at random based on a time window that you can set and then it'll do that. So I think that's pretty cool. It means that you're not having folks hard code servers or services they have to think about failure, because these servers are just going to disappear so better get ready for it. [laugh].
Jonan: It's really easy to do when you're writing code too, you end up with something statically inserted in your code that was unexpected and that that IP address disappears, or this server name, this host goes down and then the system breaks in pretty unexpected and unpredictable ways that are difficult to fix on the fly. I think the value in chaos engineering is of course that reliability, which is so important, but the real value in my opinion is that it forces you to think about those things upfront,
Tammy: Yeah.
Jonan: Design your systems that way.
Tammy: Yeah. I love it too. I was always a really big fan of being able to do rollbacks. That's very big when you work in banking, they do try to make it so you can roll back any system and that you can do failover—but it's been many, many years of that happening and a lot of a culture around that. That's just something that is very normal there. But I think in a lot of other types of companies, you don't have the ability to do that. And then I started to think too about—I don't like the idea of patching things in prod, but I actually have worked on incidents where it was impossible to roll back because you had nothing to roll back to, because you're relying on, for example, AWS and their infrastructure and their platform. So I'll give you an example of when I was working at Dropbox, I was the incident manager on call for all of Dropbox during the great S3 outage of 2017. And we'd migrated mostly off S3 at the time, but we hadn't migrated one service thumbnail. We're thinking that their data wasn't there, but it's just that the thumbnail previews weren't working and we hadn't finished the migration off S3. So people were sending in support tickets, calling up Dropbox, being like, “Where's my data gone? What's happening?” So then we were like, “Dang! This incident is going to run for a wall.” It ended up running for five hours, but we were like, “We're going to have to fix this right now; otherwise, customers are going to be really angry until AWS fixes this issue.” So we actually had to do that migration from AWS to our own on-prem infrastructure during the incident. So it was like, ”We got a patch in prod because there's no other choice.” And we got out of the incident in an hour and a half, but that was a stressful incident, definitely.
Jonan: Yeah. And it's not even that it was broken—I mean, it was broken—but it's giving your customers this impression that caused a huge swell of support tickets and overwhelmed your team at a critical time.
Tammy: Exactly, exactly. You don't want your customers to have that bad experience. And we were thinking from our monitoring, from our observability—what we had at the time—everything's working, we don't have any data issues. So I had to dig in further. I went and logged into my own Dropbox account to understand what they were saying to try and reproduce myself super easy to reproduce, that was just like X's on the thumbnails. And we're like, “Oh, OK. We need to page this team, get them involved, actually figure out how we can migrate off it.” So it's a very interesting example for me, where there was no other option, but to fix that issue during the incident or wait for AWS to fix it, which took five hours. So it was just way too long. You don't want to have to wait that long. You know, I’ve learned a lot over the years and I've definitely changed my mind about how you need to do things. I think, as in working on incidents, I've learned you just gotta be flexible, think on your feet, stay calm, just loop in the right people, have a strategy, work as a team. Like that's the most important thing. [chuckle]
Jonan: Yeah. I see a lot of these incidents go down less than I'm involved with them. I mean, when you're doing DevRel, most of the things you ship, if my waffle robot stops working, nobody's really upset except for me. I worked at Heroku, and Heroku has a policy about engineering—you own what you ship. And I would receive pages for products that we'd put out for conferences I wasn't attending and things like that. It was definitely on me to keep those things alive. That process was enlightening for me to carry the virtual pager for the first time in my career. I've always been part of that support process when I was at New Relic before that and Living Social, but not in the same way where you're on and you've got to stay calm and work with the rest of the team and keep going until it is done. There is no second line of defense against these failures. This point that you were talking about, where you are responding specifically, and very often, that is the result of a page, right? And I'm curious to know how that works in the world of chaos engineering, especially—because you've got presumably some function there that prevents people from freaking out that the system is dying, when it's just the Monkey again.
Tammy: Yeah. There's a lot of different tips around getting to the point where you can safely practice chaos engineering and production, that's often a long journey for people. I would say for some companies, it takes them several years to get there—maybe two, three, even eight years, something like that. It can take a long time. Usually folks start by running their chaos engineering attacks in staging, that would be much more common. So they're deploying their code and then they're running game day. So that's where you get a few people together to actually be able to inject some failure, maybe packet loss/latency, understand how it impacts the system, learn from it and make it better, your system or your application or the cloud infrastructure that you're using. And then another thing folks do is they'll integrate it in the CI/CD tool chain. So as I deploy code, I'm going to trigger this pipeline to run. That's going to automatically run these specific chaos engineering attacks, and then we'll see what the results are. And they tie that into their monitoring and alerting to see if anything fires that wouldn't usually fire, so that's pretty cool too. But as you move to doing it in production, that's something folks are really careful about how they do it. I've done a lot of chaos engineering in production. I do it all the time at Gremlin—it’s super common, obviously, because we do it all the time, every day. And even before that, at Dropbox, I would run weekly chaos engineering attacks on production for the database machines, because you've got tens of thousands of databases and only four people on call, a lot of automation. I've always worked on teams where I owned everything and I was on call for it at the same time. I haven't worked at a company that wasn't like that. I've done the level where you're also, on top of that, the incident manager on call—so lucky you, you get to be on two on-call rotations, all the services you own and the entire company's rotation. But that's been a really interesting experience for me because frequently injecting failure and production, that's when you really learn and you become confident and, the first time you do it, it's scary. You've got all your dashboards out. You've got all your logs out, your tools to be able to stop any issues that occur. But you know, obviously, you've planned really hard before you do your first chaos engineering attack in production. So you're confident that it will impact the customer. And actually, for me, I've never had anything cause an incident from chaos engineering that hasn't happened to me. So it's more about learning and being able to fix issues. And I like it most for mean time to detection, being able to improve things or gathering evidence so that I can get all the teams to fix issues. You know how we always say, “It's never the network,” but I've definitely used chaos engineering to prove that it is the network. The network engineering team is throttling me and my database traffic. And you need to change that setting, because I'm being throttled and that's why everything's so slow for my backups right now. But you know, they know that's [chuckle] this interesting thing that you can do there.
Jonan: You can blame it on the network in the end. I feel like the network is so often the problem for me, but much like when I was starting out in software, I would find a bug in whatever code I had written that I couldn't figure out and I would blame it on the language occasionally. I would be like, “I think I just found a bug in Ruby or JavaScript or Python, I'm going to go and comment on the GitHub repo.” Fortunately my coworkers stepped in and were like, “All right, let's think carefully about whether or not it's inside the house.” So the part where you are using chaos engineering in production, you describe it as like a game day scenario. Am I wrong to have the impression that some people just leave this running all the time out there, that there's the Chaos Monkey out in the production system, and at any moment you have to prepare for a world where your servers disappear?
Tammy: Yeah. So we do that at Gremlin all the time. So that happens every day. A lot of our customers do that too. We also did it at Dropbox, where it was called chaos days and every Wednesday we would take down random different servers. So yeah, Chaos Monkey is just like the error of a shutdown for a specific host, which is what we did there. But over the last few years, I'd say it's gotten a lot more advanced. So you're not just seeing folks shutting down hosts. They're also shutting down pods, containers on Kubernetes, shutting down whole Kubernetes deployments and services. Then also doing other things like black-hole attacks. So a black hole is a really cool attack that was recently created and really pioneered by the team at Gremlin and some other companies who really wanted to focus on a non-destructive way to be able to make hosts, pods, regions unavailable. So the idea of a black hole attack when you run it is that, for a period of time, you're just saying you cannot reach these servers, they don't exist. They're now invisible or you can't reach these pods. They're no longer available for you to be able to reach them. And when it does that, it's really cool, right? Because you can run an attack for 30 seconds, 60 seconds, and then just turn it back on. So you can say, “I'm going to make them unavailable for 30 seconds, now they're back.” And you can imagine that that's way less destructive compared to a shutdown like a Chaos Monkey-style one, because if you do that, it's like, “OK, I'm going to shut down all my servers in this whole region, that's going to take time. Then I'm going to bring them all back up.” At some companies, that's going to take hours to bring up all that infrastructure, it will take forever. And to turn all those services, make sure they're all working well. So it's just a nice and less destructive way to do it. So I'm excited about that. And you know, seeing these different types of attacks that have been created with Gremlin, we have 11 different attacks built-in, but there's a lot more that we have on our backlog to build. And then the other interesting thing too is, not just running singular attacks, but we also have a feature called a scenario. So you can chain together multiple different attack types. So for example, I want to reproduce the incident that happened to me in the past; first, I'm going to kill this process. Then I'm going to shut down this server and that's like just two, right? But that's actually more similar to what actually happens in the real world. We might have a process die and then a server goes away. Do we even know that that happened? Or maybe, let's look at spiking the memory or spiking the CPU, spiking the I/O, filling up the disc, but making other things happen too alongside it. So I really like that as well. It's much more like the real world.
Jonan: That's fascinating to me, this process of testing chaos engineering. I assume it also exposes a lot of failures that you have around paging and alerting. Obviously, if the Gremlin’s out there wrecking everything and no one hears it, right, that proverbial tree falling in the forest that doesn't make a sound, that's a bad scene, right? So this is a good way to validate that alerting and monitoring, yeah?
Tammy: Yes, exactly. That's totally right. So there's a few things that I recommend doing as real-world as possible. So what I like to do is see if this issue occurs, it's good to do it on staging first. If this service is unavailable, can't be reached, do you actually see a spike in your monitoring and your dashboards? Do you get an alert for that? Do you even know that an issue is occurring? A lot of the time, I think, people can't see things are happening, there's problems in their systems. It's just totally invisible. And how do you actually make sure that you're monitoring and you're alerting, especially your alerting works. If you never ever test it out and validate it and make sure that it does work. So I've seen a lot of interesting things there. That's definitely a great place to start. A simple thing too that I like to do is, if you think about it from like, “All right, what happens if this person's service isn't available, then they have a primary on-call?” But then also, obviously you want to make sure there's a secondary on-call, but even just testing out that flow, first I page the primary anchor, if they don't acknowledge the incident, then it goes to the secondary. How long does it take? Does it take five minutes or an hour? What if the secondary doesn't get paged? Does it then go to the whole team? Does someone acknowledge it? Because that can really wreck your meantime to detection and then also your meantime to resolution, and you can have a one-hour outage just because nobody answered their phone for 45 minutes.
Jonan: Yeah. That would be me. That would be me not answering the phone. Like, “Gosh, I hope someone else is on this page.”
Tammy: You know, talking from real experiences. Right? [laughs]
Jonan: Yeah.
Tammy: Gosh! I've seen things. [laughs].
Jonan: [laughs] All the incidents I have seen. These terms mean time to detection and meantime to resolution, I've heard, and I don't know in what context, so I apologize, but I feel like these terms, or maybe those methods have fallen out of favor in some parts of the community. Do you know anything about that? If you've heard that people do not like to use these terms?
Tammy: Yeah. It's funny, when I moved to America and I worked at DigitalOcean, that was a big shift—coming from banking for six years, I used to wear a suit to work. Like you would never imagine it.
Jonan: Wow.
Tammy: I'm serious. I had a suit that I used to wear every day with stockings and nice shoes and stuff and I'd sit at my desk and write code and I was like, “Why do I need to wear a suit to write code?” But whatever, I guess I have to do it and maybe go to one meeting once a week or something.
Jonan: [laughs]
Tammy: Then I moved to DigitalOcean. I was like, “Wow, cool startup.” And then everyone was telling me, “Oh, you're an enterprise engineer.” I was like, “What's that? Wow. We're like different types of people? There's enterprise engineers, if you worked at an enterprise company?” I had no idea. I was like, “I didn't know I was an enterprise engineer.” But I think like when you work there, you have to report on metrics and numbers because of compliance and regulation. So I honestly come at it from that. I'm like, “Well, I need to be able to report specific metrics to the government, to talk about uptime of systems and talk about how long it takes us to recover from the customer's perspective.” If internet banking is down or if ATMs are down and people can't access their money for a certain number of minutes, you have to report that. You can't hide it.
Jonan: Mmh.
Tammy: Yeah. That's just where I come at it. I think a lot of the time, folks will want different metrics or different types of things to focus on. And I think that's great. Keep innovating. There's just some industries where you're working with the government and regulators, and you can't really get them to change how they measure things. And I always think about it more from the customer's perspective too. My personal goal is that I can resolve any incident in five minutes. So I want to be able to resolve anything in five minutes. Mean time to detection has to be a minute or less. So how do I detect any incidents occurring in a minute and get it resolved. I don't want to work on incidents that last three days. I just don't want to do that. I've done it before. I don't want to do it anymore.[chuckles].
Jonan: Three days is a lot.
Tammy: Yeah. Like I've seen incidents last for days. I've even uncovered incidents that had been there for years that people didn't know were happening. You know, that's a real thing that I've seen in systems. It's like, “Wow, this is a really big issue, we need to get these fixed.” and then we get it fixed that day, but it's been there for years.
Jonan: Wow.
Tammy: Yeah. So there's a lot of stuff there. I'm not one to debate different types of things like that. Because I'm all like, “Let's get in there and just get stuff fixed. Let's make that better off. Let's just do that.”
Jonan: You just take a more pragmatic approach, I guess, coming from banking.
Tammy: Yeah [laughs].
Jonan: Yeah. That makes sense to me. So I just want to clarify that I'm understanding that the terms that are falling out of favor, it's not so much about what we call those things, it's just, “Hey, these are the wrong things to measure or we're focused on the wrong parts of what is causing the incident and how it's being resolved,” right?
Tammy: I don't know. I haven't heard the debates actually of what other folks are looking at measuring. I have a few things that I've measured over the last few years. So the first one was always SLAs and that's because they are service level agreements, and if you work in a bank, you get a fine if you don't meet the SLA.
Jonan: Wow.
Tammy: Yeah, you lose money. So there's real money involved. Oh, this is another crazy story, actually. So when I had my first high-severity incident, when I was working at the National Australia Bank, I took down mortgage broking. That was my first incident within a few weeks. And the CTO came to my desk and he's like, “Yeah, you just got yourself a fine because mortgage broking was down.” So what happened was somebody had complained to the ombudsman. So then the fine actually goes on your name and they put it on the person who was running that system and was on-call for that system. So in a database, there is my name logged against that system. And this is before I knew anything about blameless postmortems or whatever and I'm like, “Wow, my name is going to be in that system forever. That's crazy.” But that's just how it goes over there, and having the lead of technology coming to your desk to tell you that you now have a fine, and it cost the bank this much money because the system was down for that long. There's just a lot of things that happen that folks don't know about behind the scenes. But, I think I'm just much more thinking about it from that perspective. What is happening for the customer? The fact that someone complained, the fact that there is a fine program when systems aren't running. When I was running mortgage broking, I always used to think someone is trying to process their mortgage because they're trying to buy their dream home. I want to help them get their dream home. If they can't process their mortgage, then someone else is going to get that dream home. And they're not going to get to live where they want to live. And that's my fault. I feel bad about that. I have deep empathy for the actual customers and what they're trying to do. So I definitely come at it from that perspective too. And, you know, I'll just visualize them sitting there with the kids in their truck.
Jonan: [laughs]
Tammy: Their moving truck and they're like. “That Tammy—now I can't move to my dream house because her system was down.”
Jonan: This is maybe a little bit too much responsibility to take for your work, but I appreciate the approach. That's so intense that the government will find a bank that has a downtime incident and it will be attributed to an individual engineer who was responsible.
Tammy: Yeah. Because they need to be able to put a name onto the fine, onto the incident, and it's all part of regulation and compliance.
Jonan: Wow.
Tammy: That's just how the systems work. I think a lot of folks don't know if they don't work in the enterprise space. There's new compliance regulations coming out all the time. Right now a really big one is called open banking. So that actually has to be released pretty soon. And the other interesting thing as an engineer working on these systems is you have to get these new compliance features or functionality out fast. So the government will come to you and say, “Along with all of the other work you're doing, you need to do this new work that we decided by this date.” And it has to be done, otherwise you probably get fined or something like that. So with open banking, you have to make it available that any customer can see all of their bank accounts, their credit cards, all of that information in a really simple way via an API. So that means you have to build all of that out and make it available. And every bank has to do it by a specific date. I think it's like early next year. So it's a big project right now for the banking industry. And obviously you need to make it reliable. So as a reliability engineer, you want to make sure that the data is accurate too. So, to me, it's a really interesting project, because I think that's great from a consumer perspective, being able...
Jonan: Oh brilliant...
Tammy: To get access to that data across all banks at the same time. I love the concept. It's definitely a fast-paced world. It's not just your timelines, it's also the additional timelines from the government that get placed on you and the regulators. So you just have to be always getting out there and making things happen.
Jonan: I'm trying to imagine a world where such a program is implemented in the United States, this open banking. I feel like the government is unlikely to just decree to banks that they need to do anything. The banks I think have far too much control in the United States, including in the government. But this process where I have to give someone my username and password so that their Selenium can spin up and log into my site, there's a team of 10 people running a startup, you know, in an alley of market groups.
Tammy: Yeah.
Jonan: That now have my username and pass code...
Tammy: Yeah, no compliance, no regulations. That's a big thing that I thought about when I started to work in tech. I was like, Wow, it’s so different because I'd only ever come from banking. So yeah, I like the idea of just going, “Hey, let's try and make our systems really good.” I also worked as a security engineer for a bit at the bank. I did that too as part of my role to understand that a lot more. And it was really cool to do security engineering for internet banking, but obviously I care about people's privacy, I care about people not getting their money stolen. I just think that's really important. That's like a big responsibility and you need to do a good job at that. We also have a really cool thing in Australia called the Australian Computer Society. And when you join, you sign a number of things that you're going to uphold yourself to. And, those are some of the things like caring about users, data privacy and stuff like that. So I really like that. I think that's pretty cool. But Australia is very different. I haven't lived in the US for that long, I'm still learning about it. But yeah, [chuckles], there is a lot.
Jonan: It’s just like you value security and not stealing. This is...
Tammy: [Laughs]
Jonan: Straight on American when it comes to banking. I'm sorry, bankers. I'm just teasing you. The part here with the decree though, this document that you signed, that's fascinating to me because this is a common source of debate in software communities where people dislike being called software engineers occasionally or other people dislike them being called engineers because of countries like Canada, where being an engineer means that you are part of a professional organization. You get a little steel ring about it that is designed to commemorate a bridge that collapsed and killed a bunch of people and a reminder that this is a serious job, that you are actually taking on a lot of responsibility. I think that maybe is a little more so Canada, but I appreciate the sentiment.
Tammy: Yeah, it's really similar in Australia. That is what it's like to you, so you signed to say this is a serious job—you have consumer data that you're looking after in a lot of cases, you need to make sure that you take that seriously. I think it's great. Especially for young engineers that are studying, they'll come and visit the universities and talk to them about the importance of the responsibility that you have as you get access to these systems. So I think that's really cool too. And for me, obviously that really inspired me and motivated me to want to keep these systems up and running. I'm like, “If someone's relying on my system, I want to make it available for them.” I care about that. I want them to be able to get value from this. And I feel like if other engineers are doing that too, and there's a lot of SREs now, a lot of folks that care about reliability, I think that's great. That's what folks are motivated by. And the other thing is, it's hard work. It's complicated. You need to know a lot of different things. You need to understand networking and Linux and infrastructure and cloud. There is so much stuff—databases, messaging systems, whatever it is—so many different things. And make sure that everything runs all the time pretty well. It's not going to run 100 percent of the time, but hopefully have some good failover, so you can keep it up and running for people. That's what motivates me and excites me. And the other thing that I think about a lot is, how do we just make the internet more reliable? Because I also like the idea of making the internet accessible and that means making it fast. So not just up and running, but making sure that people can access it wherever they are, even if they have really crummy internet. In Australia, the internet is so bad that you can't do a Zoom call well. You know, it's really really lagging and slow. The latency's horrible. And so I was really inspired to get into the world of performance engineering and making things run better with limited resources, because of coming from Australia and being so far away. So I think that's really cool and it's still a big area to innovate in.
Jonan: Absolutely. It's going to be a very interesting world suddenly, when everyone has access to broadband internet, wherever they are.
Tammy: Yeah.
Jonan: I'm excited to see it. The discussion we were having earlier reminded me that I haven't yet asked you for your definition of observability. It's a thing that I like to do with all of my guests, because they are somewhat nuanced, different. I know that there are a lot of people in the industry right now trying to come up with a concrete definition, but they all approximately align to one thing, in my opinion, which is not just knowing that a thing is broken, but being able to understand at-a-glance why, this visibility across all of your systems. But I'm sure that you have a more well-formed answer than my simple version. So please, how would you define observability?
Tammy: I'm not sure if my answer would be well-defined. For me, it's something I'm still trying to explore and understand as well. I definitely think that Liz Fong-Jones, she's doing amazing work in the space of observability. I really like her work coming from Google and just building out a lot of different interesting practices. I guess my dream scenario would be that you could observe actually what a user is doing and how they use your system and be able to actually see into the system at any point. And the reason I say that it's kind of like x-ray vision or something like that—just being able to actually observe everything end to end. And, why is that important? Because I've worked on systems where, for example, a database proxy that had no monitoring, no observability, no nothing—you just couldn't see. You went into the proxy and it came out who knows what went on in the middle? That's tough to know. So I don't really like that. I like the idea of being able to actually look at things and see them all the way through all the time, because then you can understand exactly where the problem is. I love that I can then go, “Boom, there's the issue.” Let's pick that out. Let's fix that. And now we're all good. But without that, it's hard. The bits that come in between are especially hard to instrument in a lot of spheres. I think my biggest frustration with this kind of system is when I show up and there's just a circle or a box on the screen that's either green or red and that's supposed to be the information I have to act on. And I'm like, “Well, it's red. We should try turning it off and on again. And then we should delete it and re-create it. And then we should give up forever, because we can never do this again.” Right? I just don't like that. You want to know, “Why is there an issue with the system?”
Jonan: Exactly.
Tammy: Something that I thought was really cool when I was at Dropbox was having metadata for queries. So it was super easy to see “Where is this query coming from?” The SQL query, is it from the API, the web app. So API dub-dub-dub or the desktop client that was one tag, the other thing was, “What are they doing? Are they like listing out their items in their account?”
Jonan: Oh.
Tammy: Yeah, that was super cool. So you're going to then say, “Oh, there's all these issues with queries from the API where people are doing this specific action.” So you can then trace it back up. You know, that's a handy thing when you're debugging. But that's just one little thing. There's so much more that we can do to make things better, but, you know, times are hard. You have to be kind of like a detective and it depends on how good your tools are.
Jonan: And ultimately you've got to be pragmatic about these things, right? If you go too deep down that rabbit hole, it is an infinite pit of fun for a lot of engineers. It's very interesting for me, to track what individual methods were called and how long they took and how many objects were created in the system as a result and all of this stuff. But then you become overwhelmed by your data. It becomes very important to be able to get all of your data into one place where you are able to use it effectively and get visibility.
Tammy: And the interesting thing there too is thinking about performance engineers, they get to spend a lot of time focusing on this and then getting some wins and being able to share their results and making a big impact. So I think that's really cool having a role with folks dedicated to improving performance. I think that's great. You see that mostly only at big companies. That it is important if you want to be able to make some great wins, because how do you get those wins if you don't dedicate the time to it? It's a bit tough. You can make some small improvements, but that's why there are folks who are specifically amazing at database performance engineering. You need to know a lot to be good at that.
Jonan: Yeah.
Tammy: There's new different versions coming all the time, new tools being created. So it takes time to perfect your craft and get really good at it. But I think it's a great thing to do. I'm a little bit surprised that it's not a more popular role, because certainly we have a great understanding of how performance affects dollars, and that tends to be what creates these positions in these teams. You can tie a report to an executive, to a team of positions existing, and then the team comes to exist. Right?
Jonan: Yeah. It's really interesting. Like you see performance engineering teams being really large at enterprise companies, and there'll be like tons of performance engineers and those folks are super specialized. Maybe they've been doing it for 20-plus years. I mean, I've never worked with a performance engineer at a startup yet. [chuckles] I think startups are a little bit scrappier when it comes to that.
Tammy: [laughs]
Jonan: They're very fortunate to have you, I think with your background and your emphasis on security.
Tammy: [laughs]
Jonan: Yeah. We're glad that you're out there. We're almost at time. Before you go, I would love to have you back in a year or sooner—but when you come back, I want to be able to accuse you of having been wrong about something. So I'm asking you to make a prediction. That's a really mean way to set this up. I'm actually not going to do that to you, I promise. I just want to know what you think might happen—what interesting technologies do you see emerging today? What do you think is going to be more popular or less popular over the next year?
Tammy: I'll say a fun one. Everyone will have Linux on the desktop. [laughs]
Jonan: Yes. It's finally here. We have been waiting.
Tammy: Linux on the desktop for everyone. Every company will give you a Linux computer when you join up.
Jonan: You know, I really believe in the dream still. I think that we can get there, probably next year. I think...
Tammy: Yeah. I just have several computers, myself. Like I have like a Linux desktop in my bedroom, I have a Mac desktop, I have like two Linux laptops. I just got a whole bunch of stuff, but I'm a super-nerd, so all my friends just stare, like, “Oh my God, Tammy.”
Jonan: I love it, actually. And I'm going to count that as your prediction, that it will be the year of the Linux desktop. But I want to hear about your home lab then, since you mentioned having one.
Tammy: I don't know if I'd call it a lab or just technology everywhere. I just really love computers. I've always loved computers since I was little, so it's actually super funny, too. When I was studying at university, they sent out this email and said, “Hey, would you like to have a part-time job working for this company that's building websites and stuff or whatever?” I'm like, “Yeah, sure.” So I went there and I was telling them about my skills, they're like, “Wow, you're a hardware person. And then, because I was telling them that I had built computers since I was 12 years old. I'm telling them all about my cooling and my fans, all the sweet stuff that I made. And they were like, “Wow, this is pretty crazy, we haven't met anyone like you in ages.” But the reason that I would build my own computers when I was little was, honestly, it was cool to understand how things worked, and it was so much more fun. And then I would build these sweet computers with pink lights and see-through mice with really cool stuff. So now it's just a lot of experimenting with different things and seeing what I like the most—but I'd say I'm less fancy these days. I want to be able to experiment. That's probably my main reason that I have [an interest in] different types of technology. And my other thing that I love to do—one of my personal favorite hobbies—is to set myself challenges to be able to break different software as fast as possible. So I'll be like, “I'm going to try this new piece of software and I'm going to try and break it.” And my record for breaking things is within a minute—that's the fastest, but then the longest it took me to figure out how to break something was three weeks. And I was
still like, “Oh my goodness, why haven't I broken yet? Usually the average is like one or two hours until I can find some type of reliability, vulnerability.” But yeah, that took me a long time. Finally got there.
Jonan: That's amazing. Walk me through these projects. I'm trying to imagine what that's like—you're talking about some new framework or something like Kubernetes that you're trying to break. What are you trying to break?
Tammy: Yeah, well it could be any different types of things. So sometimes it might be a database software. So I've done work with MongoDB, with CockroachDB, lots of different types of technology. So the reason I got into that specifically was also when I was working at Dropbox—one of the engineers on my team, he would constantly find issues with mySQL and then log them to the community and get them fixed with a fix as well. And a lot of them were around reliability and durability issues, so it made me think back to my time as a security engineer—you have people doing bug bounties all the time where it's like, “I found a bug in your software and then you get money for that,” and it's like a cool thing you can do. But it's usually not about reliability or durability, it's about like being able to get hacked or steal data. So that's why I got into it there. So usually what I'll do is I'll find some type of issue—I found issues with Amazon’s EKS, stuff like that. And then I'll share that and then, they'll write back to me and be like, “Yeah, cool, thanks for letting us know about these if you have anything else, just share it with us too.” So just building pathways between companies to be able to say, “Hey, here's something I identified, and here's an idea for how to fix it. Like let's do that.” So yeah, I like that idea, sharing—it’s just fun.
Jonan: I really appreciate your work there. I wish that there were reliability bounties, let's get on it.
Tammy: Wouldn't it be so cool?
Jonan: Everybody just send Tammy a check and we'll just call that the bounty.
Tammy: [laughs]
Jonan: Well, thank you again so much for joining us. Do you have any parting words for someone who's listening a little earlier on in their career and aspires to be in your position someday?
Tammy: Yeah. I get to meet young people all the time. I actually just did a talk yesterday for CMU for students there and it was so fun to meet everyone. And one of the questions I asked them was, “Who's an adrenaline junkie, anyone into extreme sports, anyone excited about being on-call, working on critical systems, working on incidents, building systems and then keeping them up and running?” “Half the room said, “Yeah!”, which I thought was super cool. But I said to the other half of the room, “You know, even if you're not excited about it—or if you're scared about it or if you're nervous about it, because it's a big responsibility—my biggest tip is to just be curious, learn about systems, take your time, find great mentors and you're going to have a great, awesome career, if you're just starting out. I love working in technology, I find it so cool. There's lots of different industries out there, but I'm happy that I picked this one—and yeah, I'm just going to be here doing it forever. So that's cool. [chuckles] I hope they will too, maybe we'll work on a team together [chuckles].
Jonan: I look forward to seeing people come up into this industry. I think it's going to be a very interesting decade.
Tammy: Yes. I think so, too, for sure.
Jonan: Well, again, thank you so much for coming. I really appreciate your time and I especially look forward to having you back in about a year with our new Linux desktops. [Both laugh]
Tammy: I love it.
Jonan: All right. Have a wonderful day. Thank you, Tammy.
Tammy: Thank you so much.
Thank you so much for joining us for another episode of Observy McObservface. This podcast is available on Spotify and iTunes and wherever fine podcasts are sold. Please remember to subscribe so you don't miss an episode. If you have an idea for a topic or a guest you would like to hear on the show, please reach out to me. My email address is jonan@newrelic.com. You can also find me on Twitter as the Jonan show. The show notes for today's episode along with many other lovely nerdy things are available on developer.newrelic.com. Stop by and check it out. Thank you so much. Have a great day.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.