Observy McObservface Episode 3 - Firefighting and DevOps: The Platypus Incident

This week we were joined by Bobby Tables of Firehydrant.io, a world-renowned purveyor of artisanal, handcrafted incident response services that are, in fact, the opposite of that; perfectly replicable and entirely automated.

Bobby talks us through some of the motivation behind the path he and his co-founders walked (sprinted?) when starting their company and the market circumstances that have created a demand for automated incident response.

We also touch on the distinction between DevOps and SRE, the pros and cons of microservices, and one particularly memorable incident involving a platypus and a fire hydrant.

Should you find a burning need to share your thoughts or rants about the show, please spray them at devrel@newrelic.com. While you’re going to all the trouble of shipping us some bytes, please consider taking a moment to let us know what you’d like to hear on the show in the future. Despite the all-caps flaming you will receive in response, please know that we are sincerely interested in your feedback; we aim to appease.

Enjoy the show!

Jonan: I'm joined today by my guest, Bobby Tables. How are you today, Bobby?

Bobby Tables: I am doing super well, man. How are you?

Jonan: I'm doing great. I'm excited to talk to you about some FireHydrant.io stuff, and specifically incident response. I'm really curious about incident response because I'm not a DevOps professional, and I never really have been. Also, before we get into that, I want to hear about you. Tell me who you are and how you ended up here.

Bobby Tables: My name is Bobby Tables. People call me that after the xkcd comic, after a mom names her child a SQL injection attack. My real name is Robert Ross. Previous to starting FireHydrant with two of my friends—we were all in the DevOps space—but I worked at Namely HR, where I was building a lot of developer tools. I was on the SRE team. Before Namely, I was at DigitalOcean and before that, I was at Thunderbolt Labs doing a lot of really cool things with that consultancy. So a lot of developer tools in my life, and it had been a really fun time to build it for myself and for others.

Jonan: I remember Thunderbolt. How long ago was Thunderbolt?

Bobby Tables: I think I was at Thunderbolt Labs, I want to say, six years ago.

Jonan: You worked with Bryan Lyles, maybe?

Bobby Tables: I did.

Jonan: I remember Thunderbolt Labs because you had great stickers. You've always been on point with the branding. At Thunderbolt, you had really good branding and you also have lovely branding at FireHydrant. In fact, I remember an episode recently where there was a Twitter exchange that ended up with quite an interesting logo on your homepage. I wonder if you want to share a story about that.

Bobby Tables: So, we somehow ended up engaging with Corey Quinn on Twitter and we posted to Corey on Twitter, from our corporate FireHydrant account, we said, “If you draw a logo for FireHydrant, we'll make it our logo for the rest of the day on our website.” And there was a little bit of exchange. He said, “Don't say things if you won’t actually do it.” And we said, “We're serious.” And Corey recently posted the exchange that happened in their Slack.org, and it basically said, “I need you to draw the duckbill logo peeing on a fire hydrant.” So we get this logo back, I want to say, 30 minutes later, something like that. And I went into our WordPress and uploaded the logo and...

Jonan: You had you a duckbill platypus peeing on a fire hydrant...

Bobby Tables: We did.

Jonan: ... as your logo for the rest of the day?

Bobby Tables: We did, yeah.

Jonan: It's amazing.

Bobby Tables: It got a good laugh. It was pretty funny. And we hold that moment near and dear to our hearts and in the company.

Jonan: So you're drinking an old-fashioned?

Bobby Tables: I am.

Jonan: I'm having some scotch. This is my Operation Shamrock class that I got at New Relic back in the day, when they went and opened an office in Dublin—which I think was their first office overseas. They called it Operation Shamrock.

Bobby Tables: That's great.

Jonan: So the company that you've started now, FireHydrant, how long ago did you start up?

Bobby Tables: The first commit for FireHydrant was in September of 2017. And FireHydrant started as a video series where I wanted to record building an application from scratch. It was a Ruby on Rails application. So I recorded every second of FireHydrant being built for the first 40 hours. And then I had a friend say, "Hey, what you're building is way more valuable." So FireHydrant was a project to basically help with incident response. That was always its intention—to be that, but then I just kind of stopped recording after 40 hours, and then I had a couple of friends join in the fun. And then in December of 2018 is when we raised a seed, and we started the project full-time as a team. So here we are—we have some great clients, and we’re helping incident response.

Jonan: How many people are you up to now? You're growing quickly.

Bobby Tables: We've grown pretty well. We have 16 people now.

Jonan: Wow! And so it was about a year and a half ago?

Bobby Tables: A little over a year and a half ago.

Jonan: That's awesome. What is it like to raise funding for an idea you have? I have a lot of friends who do these startup things. It seems to be kind of a path for a software engineer of your experience level. I just don't understand the process. Do you just announce on Twitter, "Hey, someone give me $1 million." Where do you go?

Bobby Tables: A couple of investors in New York City reached out to us, and I feel bad even telling this story, because there's so many founders that say they had to talk to 50 investors before they were able to get a term sheet. I really feel so lucky being able to say that we had a great idea, we had some really enthusiastic investors, and honestly, that's how we raised it.

We were able to tap into our network that had raised a little bit of capital in the past as well, and we had some advice there. We were guided through the whole process, read a lot of books—a lot of books—we definitely tried to read as much as we could before signing anything. But we're really lucky. We have some of the best investors we probably could have ever asked for. So Work-Bench was our first investor, and then our Series A, which closed earlier this year, is Menlo Ventures, which is a really well-established firm. They have investments in Uber, Roku, Warby Parker. So we feel really lucky and we're solving some really cool problems—we just keep the vision moving us forward.

Jonan: So I looked at some of your blog posts. What I think your product does is that when you have an incident at your company, FireHydrant.io will be initiated by someone or even automatically initiated when some threshold metric is reached, and it manages the entire incident lifecycle. So suddenly, people start to get the right Slack messages to let the right people know that systems are down and what steps to take next. It presents to them playbooks and resources to walk through, fix the problem; try rebooting this, try rebooting that, set up this new instance of this application. And then in the end, it wraps it all up into a tidy report that you can then use for your retrospective analysis of the incident. How'd I do?

Bobby Tables: That was great. Do you want a marketing role?

Jonan: I've been at New Relic now for three weeks.

Bobby Tables: You just joined New Relic. That's probably not a question yeah... That was a great description of FireHydrant. So we think of FireHydrant—as it really is baked into the name—the tool that helps you put out a fire. So with a lot of the alerting tools that are out there, they kind of receive a signal and they'll wake you up. And that's really what these alerting tools are kind of built to do, and they do it very well. One of the pains that I felt and my team felt was that once I get woken up, what now? So FireHydrant was kind of born with that idea in mind. You have a smoke detector, your smoke detector is going to wake you up when it smells smoke, but it's not going to help you at that point. It's just going to get you out of the house to call someone else to fix the problem for you.

Bobby Tables: When you think about a fire hydrant on the street, it's not directly responsible for putting out the fire. Firefighters are responsible for putting out the fire, and the fire hydrant is just an essential tool. So we help you organize the right people as fast as possible. We'll create a Slack channel for your incident. We'll create a Zoom bridge. We can even do things like post a run book for the services that are impacted: Here's how you do a rollback, here's how you send a USR2 signal to reload config. You can store all of that, and FireHydrant makes it really, really easily accessible.

If something's broken, New Relic is telling you, "Hey, your APM is way above threshold." It can cause a sense of panic, and you might forget your process. Engineers, they really just want to do the right thing. No engineer is ever going to do anything in an incident response process that's going to intentionally make something worse. But what happens is that they might go into cognitive tunneling where they forget to create a Slack room or update a status page I/O or update our status page product. That's common.

So FireHydrant was kind of built around the idea of how we could make it so you have the same process every single time, and we'll do it for you in a few seconds so you can do what you're really good at as an engineer, which is solve the problem.

Jonan: Which makes it a brilliant product, I will reiterate. Having been in the position many times where I get a PagerDuty alert for some application that is technically in my sphere of responsibility but I haven't worked with very often or maybe I'm new to the company—even if I am one of the more experienced engineers on my team—when I get that alert, I go into an application that I haven't coded on in a few months. I don't remember all of the things. I just start poking about it at random. It's so valuable to have these run books, which is, from my perspective, a relatively recent innovation. I'm sure that there were people doing this 10 years ago, but I feel like when I first came into the industry about that time, it wasn't a common practice, certainly.

And since then, we've gone all in on this DevOps perspective, a term that arose from trying to bridge the gap between developers and IT or systems folk that, really, are the same thing: We're all working toward the same goals, and we're all writing code, and we're all trying to build things in ways that are replicable using tools like FireHydrant. So what is the distinction in your mind?

Bobby Tables: Yeah, the question is one that I've had many discussions over beers at many bars. So I think that one of the things you hear, or you can just Google—start typing in “DevOps,” one thing you'll see pop up in Google is “DevOps versus SRE.” And I think it's important to make a distinction here first. So DevOps has a lot of different ideas. It's certain practices around CI/CD. It's certain practices around building tooling for maybe rolling back to employees or to alert, and the process is kind of around that. It's just that. It's just a bunch of ideas. And that's where SRE kind of comes into play, and it's a framework of those ideas. So if you think about an object oriented-language, Google has a really great presentation about this.

SRE implements DevOps. SRE inherits from DevOps. And that's a nice way to kind of separate the idea of DevOps versus SRE, I think.

Jonan: So, site reliability engineering is an implementation of DevOps principles—DevOps is a collection of ideas, and SRE is the actual implementation?

Bobby Tables: Yes.

Jonan: So from outside of that world, I started seeing people use the term SRE about the time I started hearing about DevOps—but presumably, those two didn't happen in tandem, did they? So there was a foundational book about DevOps. There was a book that maybe had “unicorn” in the title?

Bobby Tables: Yeah. The Unicorn Project: A Novel about Developers, Digital Disruption, and Thriving in the Age of Data, by Gene Kim came out more recently. I think The Phoenix Project by Gene Kim was the one that talked a little bit more directly—in a fictional sense—about DevOps. It’s a really good book. It talks about almost a scary realistic world of DevOps and kind of moving toward it from a non-DevOps world. I haven't read the new one that you're mentioning, The Unicorn Project.

Jonan: I was misremembering because I started going to DevOps meetups. Just in the last year or so, I really took an interest in this kind of stuff. It's fascinating to me to look at systems architecture, the way that I would build an application that I'm designing components for. I'm still kind of a fan of microservices now, the bitty, tiny microservices. There is a threshold where you reach and...

Bobby Tables: Nano services?

Jonan: Nano services, yeah. I think people over-complicated the message there, but I'm still a fan of having apps that approximately have a responsibility. This is where the users live. And architecting a system that way. I use CRC cards sometimes, and I'll do these mind maps when I'm getting ready to set up an application. I feel like systems architecture was very similar, but today, the pieces on the playing field, the number of Legos you have available to you when you're designing a system is huge.

If you look at that CNCF page that describes the applications that are under the Cloud Native Computing Foundation, I think there must be 50 logos on that page. And I feel like it's just seeing this explosive growth. If you use the Kubernetes ecosystem, it's just a barometer for the growth of this type of cloud architecture, I guess. It's exploding. And by next year, we may have twice as many.

So you've got all of these Legos that you're trying to keep track of and to me—from the outside, because it's not my full-time job, as it is yours—it just looks so hard to keep up with. I would rather try to keep up with JavaScript's ecosystem.

Bobby Tables: That's such a good way to compare that. It's interesting that you said Legos, because I talk about this a lot—where if you go to Legoland and you stand very far away, it looks like these sculptures that they're making with Legos have a curvature to them. But if you get close enough, it's still jagged lines. If you put your hand on it, it's still pretty uncomfortable. You still wouldn't want to step on that sculpture.

With these Legos, you're able to make these extremely elaborate systems with all these pieces now. But the problem is you're just making it an extremely complex system. And if you don't have a framework or mindset to really manage that system—if you're not really setting up your team for success to manage that system—you're almost doomed.

I think a lot of the Kubernetes architecture and all of the other projects around the CNCF; service meshes are becoming a huge thing, right? I just had another layer of complexity, and how do you even graph how a request gets to a process anymore? It's becoming insane as compared to 16 years ago, where I have Apache and I have a little PHP thing listening on port 5,000, whatever it was. It's just so different.

Jonan: Back in the day, I used to have to sew together a request myself using a request ID. I would find the app at the front where the user logged in and get the request ID, if we had a request ID and I hoped we did, so I'm not correlating timestamps across microservices. And I'm still searching logs for that request ID and sewing it. And that's all gone now. And we just keep building abstractions that make everything easier.

Bobby Tables: And I can speak from experience with FireHydrant. You almost have to build these things from the start now. You kind of have to have this architecture that is going to have a request ID at the load balancing glare that's going to propagate all the way down into pub/sub. Diagnosing these problems is becoming very, very hard. And that's why service-oriented architecture, while it is good for a lot of companies, it's probably not a good idea to really start with.

We started with a model. I came from a microservice architecture for my last two companies, and we said, "Nope, we're going to do a monolith," because if something breaks, guess what? You can only break in one place. So, we're going to go there.

Jonan: Service-oriented architecture is designed to address pain. And if you don't have the pain yet, then you don't know correctly which pieces to extract.

Bobby Tables: Right. And I think that one of the interesting things about architecture that's not talked about enough is that it should follow a lot of the same principles used in good, object-oriented programming. We have solid principles. We have domain-driven design, single responsibility principle. You talk about open-close, dependency inversion. But we don't talk about those same concepts in architecture, which is totally realistic to do. Why can't we have a system that is open-close where we can add a service that extends the functionality of the system? Why can't we do a Liskov substitution where—because we have such a well-defined API for this service—we need to make it a little faster so we swap it with go app, but because it has the same API signature, consumers don't care.

We don't talk about architecture and design enough. I think we just throw services at the problem and we redesign the endpoint every time differently—every single time—and we end up with these spaghetti monsters. Uber is even talking about this right now. They have a new blog post about that. I think it was some crazy style, one service to three engineers, and Uber has thousands—thousands—of engineers. Could you imagine doing that?

Jonan: I cannot imagine what it's like to be on an engineering team inside that company. I think it has to do with the very, very rapid growth that they experienced. They exploded overnight. They're producing a lot of interesting technology. They have one of the more popular backends for Prometheus. So Prometheus is an application that pulls your applications. So rather than me, when I process a payment for customer Bobby Tables, rather than me calling out to some metrics endpoint and saying, "Hey, Bobby just paid us $10," I just kind of write that locally to some temporary store, and Prometheus comes along every minute or so and pulls an endpoint on my payments application. It gets all of the data of all of the metrics that I've recorded, and it stores them. But Prometheus is not designed as a project to hold that data long-term. It takes it and puts it into a backend store. And Uber M3, I think, is the name of it. Does that sound right?

Bobby Tables: Yeah. M3 does sound right.

Jonan: That's one of the backends for this. So then Prometheus will use Uber's M3 open source project to store that data longer term. Those kinds of technologies—if you look at the number of time series backends that are available for Prometheus, there's 100 right now. The products that I know of that do the thing you're doing, there's one unified product for handling incident response. So my actual question is not, "Hey, name all your competitors," but why is that not more at the forefront of people's minds? Because it's a thing everyone deals with, right? Every SRE I've ever worked with had the run books and the systems, but it was all hand-rolled. They all had customized artisanal processes. And you're handing a company a whole process in a box. Why isn't that more common?

Bobby Tables: That's a great question. I think that FireHydrant can exist today because we've standardized so many other things. Yes, things have gotten more complex, and Kubernetes has started to definitely dominate the way that a lot of modern application architectures are built. I think every major Fortune 100 right now said that they're experimenting with Kubernetes. I remember some article about that, some insane stat. And I'm talking about that because I think it's the same way that FireHydrant can exist today.

We started standardizing on so many things in the last 10 to 15 years. Containers exist because we were able to standardize on a format. Docker created this standardized, portable format. Without that standardization, what would Kubernetes do? Because all these abstractions with really well-defined interfaces started to exist, Kubernetes was able to come in and bundle those things together.

FireHydrant's the same thing. We had DevOps, and then SRE started to become prominent in the space. I think a lot of organizations are thinking about reliability in a more structured sense. And then we also had people standardized on a chat tool. A lot of the world has moved to Slack.

Jonan: I agree with you 1,000%. I know that you are a fellow Rubyist, so I want to take a minute to appreciate what value comes to a community by having an opinion. I think Rails did an amazing thing in getting a lot of programmers to swim in the same direction at the same time. There is definitely something to be said for standardization, the open telemetry change that has come recently where we now have one way to report metrics. I worked with New Relics’ proprietary format back in the day. I would much rather have a metrics format that everyone can use, that we can all interoperate with, right? I feel like we're reaching a point with the DevOps ecosystem and the SRE ecosystem, that we are now learning the value of standardizing how we build these things and how we respond to these things.

So if I were to ask you to make some predictions for the future, what's it going to be like in your space in a year? So in a year, I can call you up to be on the podcast and we'll tell you how wrong you were.

Bobby Tables: Yeah. I'd love that, set a reminder. I have two predictions that I'm actually not making up on the spot. I've been thinking about this a lot. One of them is that we're going to hit a point where service-oriented architectures are becoming incredibly cumbersome and super complex to the point where we're going to need to get away from Google Sheets as the de facto way of listing the services we're running and who owns them. So I think that we're actually going to see a lot of tools come into play for cataloging the services we're running in production. Because there are a lot of challenges that come with running multiple services that are not just around engineering operations. It's a question of, “How do you make this compliant?”

For SOC compliance, you have to log when you deploy a lot of the time. You have to have a deploy log. And when you have 100 services deploying multiple times a day, where does that information go? What is the service? And then people are going multi-region, multi-cloud, even—how do we know where the services are running? That's a challenge that needs to be solved. A lot of companies just have YAML and a GitHub repository to represent that catalog, and that's not going to be sufficient as the year progresses.

Jonan: I think it's particularly relevant to what you do in incident response. If there's an incident, what I want to know right away is what things have changed in the last 24 hours, or one hour, or 10 minutes. What things in the system changed? Because I know enough about computers to know that if you don't poke the bear, it just keeps sleeping. Don't change anything ever, right? So you've got to find those changes. And I think service discovery and those kinds of things are really important to doing that. So your prediction for the next year is that we will have some popular player in the ecosystem or some project that does that service discovery well. And, in an opinionated way, that starts to dominate as the standard?

Bobby Tables: Yeah. Startups and venture capital—as annoying as it might be— are really good indicators of what's happening. If you look at the last six months, a lot of service catalog startups have been coming out where it is their sole responsibility to list the services you're running and the changes that they've recently had. So, it's happening. The challenge is becoming, “How do we define even what a service is?” We're going to start seeing that challenge. And I'm hoping that the CNCF or some governing body creates a definition of what a service is. Is a load balancer that we don't run a service? If you ask me, that is a service, because your customers don't care that it's on Amazon's ELB, but it's a service that you run.

I think that we're going to need a well-defined version of what a service is and have a spec for it, an actual RFC of what a service is. I think what's going to come with that—or hope so—we need a generalized labeling standard. The labels are all over the place. Every major cloud provider, every major metrics consumer—New Relic being one of them—every single one of these services has a way to tag and annotate things. But there is no definition that says what the keys should be and what the value should be for a service. So I really hope that happens. We're going to have someone define a standard set of labels. You could call it “CNCF Labeling Project,” for all I care, that says the name of the service, the purpose of the service, the component that it powers, the team that owns it, and other key value pairs along those lines.

And if you think about the value that provides—if I say I have this key value system that I can then add to my ELB in Amazon, because Amazon supports tags, and then I have my deployment reference in Kubernetes that has annotations, and I have a metric that comes out and that has a key value pair—if I have a way to slice down all of those keys across the entire request cycle from ELB to deployment to metric, that enables a crazy amount of observability that we don't have today. It's cobbled together. So I think that we're going to see a service catalog come out. And I think to make that a reality, we're going to have to see a labeling standard come out. That will actually be a really interesting turning point for observability and linking that to the services that we've run in a standardized way.

Jonan: I think that's a wise prediction, and I think it’s the most likely to happen so far in the brief history of Observy McObservface. This now being our third episode, that's quite an honor. We've had many predictions already, but you win the prediction game for now.

Bobby Tables: Maybe I'll write it now to prove a point and say, "Yes, it happened."

Jonan: That should be a blog post. While we're on the subject, you made a blog post about how to make an old-fashioned, is that right?

Bobby Tables: I did.

Jonan: And was there fire in it? Did you use... you burned your...

Bobby Tables: I did, from the back of an orange. I forget the name of the technique, but you can slice off what's called a “coin” on an orange and apply some flames to it as you squeeze and extract some of the oils from the orange and that kind of inserts it into the drink. So that gives it a nice orangy tone.

Jonan: I appreciate this very much. I feel like the old-fashioned has become a developer drink somehow, but everyone has an opinion about that. I know software developers who craft their own bitters from scratch. I'm from Portland, and we just do that kind of stuff up here, but still very into the artisanal handcrafted. But I don't think people should be into our artisanal handcrafted incident response, and I'm really glad your company exists. I will repeat my prediction for the episode, which is that you're going to get bought out in a year and a half maximum, and you're going to stop taking my calls. But I really appreciate you taking this one and being on this podcast with me. It's been a real pleasure having you, Bobby. Thank you so much for joining us.

Bobby Tables: No, the pleasure is mine. Thanks for reaching out, and I'm always happy to talk to the community. And the real goal for us is to just make developers happier with our tool, so if we're able to keep doing that forever, that's a win for me.

Jonan: I am all about developer happiness. That's what I want to do with the rest of my life. Tell us where we can find you on the internet?

Bobby Tables:

I don't say too much, but you could find me on Twitter, @bobbytables. I'm also on github.com/bobbytables, happy to chat with anyone, if anything I said sounded interesting.

Jonan: Awesome. Thank you so much. I'll see you again soon.

Bobby Tables: Thank you.

Listen to more Observy McObservface episodes. To continue your journey with us, we recommend listening to our Observy Mcobservface with Katy Farmer.

Interested in more with New Relic? Check out how we guide customers to success by reading our story with ZenHub.

Jonan Scheffler

Jonan Scheffler is a former Director of Developer Relations at New Relic. Jonan spends most of his time staring into tiny boxes and pushing buttons. He likes Ruby, Go, machine learning and playing with robots.

이 블로그에 표현된 견해는 저자의 견해이며 반드시 New Relic의 견해를 반영하는 것은 아닙니다. 저자가 제공하는 모든 솔루션은 환경에 따라 다르며 New Relic에서 제공하는 상용 솔루션이나 지원의 일부가 아닙니다. 이 블로그 게시물과 관련된 질문 및 지원이 필요한 경우 Explorers Hub(discuss.newrelic.com)에서만 참여하십시오. 이 블로그에는 타사 사이트의 콘텐츠에 대한 링크가 포함될 수 있습니다. 이러한 링크를 제공함으로써 New Relic은 해당 사이트에서 사용할 수 있는 정보, 보기 또는 제품을 채택, 보증, 승인 또는 보증하지 않습니다.

780+ 개 통합을 사용해 무료로 스택 모니터링

모든 통합 보기

Observy McObservface Episode 3 - Firefighting and DevOps: The Platypus Incident

Tags

관련