Player FM ऐप के साथ ऑफ़लाइन जाएं!
Is It Better To Be At Amazon's Mercy Or Your Own?
Manage episode 286618043 series 2496774
Show notes:
Links:
Loom
Telestream
Recut
Lovesac
Comfy Sacks
Flipper
Full transcript:
Ben:
You know how we had that recent episode with John Nunemaker about Flipper and feature flags and that sort of thing.
Starr:
Oh, a podcast episode.
Ben:
Yeah. Yeah.
Starr:
I thought you meant a dramatic episode.
Josh:
It's just another episode with John.
Starr:
Oh my God. That guy.
Josh:
That was awesome. Yeah. That was a good conversation.
Ben:
We talked in that conversation about using Flipper at Honeybadger, because we've been using Rollout for our feature flags, which, if you didn't listen to that episode, you don't know what a feature flag is. It's a branch in your code that conditionally runs some feature. You can limit it when you deploy it to people and you don't have to deploy a new thing to all your customers at the same time. You can test it live.
Josh:
I'm not sure if we actually explained it in that episode.
Ben:
Maybe we did, maybe we didn't.
Josh:
This will be good background.
Starr:
I wasn't there. I'm usually the driving force behind backing up and explaining things.
Josh:
Yeah, Starr is good. Always, yeah, you've been pretty good about that. Yeah.
Ben:
Yeah. I went ahead and did that. I put a Flipper in Honeybadger and tested a new feature. We are switching from Postgres to DynamoDB for our notice storage. That's every occurrence of every error. It's a lot of data and we cut over a few weeks ago to be reading from that data in Dynamo because now it's fully populated with the past month's of data and it's being updated. We're basically writing this to two places and now it's time to read from the new place.
Ben:
I tested that with Flipper and I'm so glad that I used Flipper for that feature because it saved my bacon this week. I deployed the reading from Dynamo. Oh, actually. We've been doing reading for a while and what I deployed this week was not writing to Postgres anymore, so stopping the dual rights. I put that behind a feature flag and I turned it on just for my projects. I'm so glad I did because I found a bug that really, really would have caused issues for all of our customers if I had deployed that just willy nilly. Yay for feature flags. Yay for Flipper. Go use it. It's a great thing.
Starr:
That's awesome.
Josh:
It's willy nilly. Is that a Ruby joke?
Starr:
How much money do you think that was worth avoiding that mistake? How much would you pay to do that? A thousand dollars? $10,000.
Ben:
Yeah, it's got to be a more than a thousand dollars, for sure.
Starr:
Okay. We're trying to help John with his pricing here.
Ben:
Yeah, totally.
Starr:
I'm sure that Flipper costs a lot less than a thousand dollars. It does.
Ben:
It's worth every penny.
Starr:
Oh, look at that. Real product placement. We're growing up. Look at this podcast we're doing. We just slid that right in.
Ben:
Yeah. In other infrastructure news, I got to say that having your primary search cluster die is not a fun experience, especially when it happens at 4:30 in the morning.
Josh:
Yeah.
Ben:
But I will say this. Amazon, props, Amazon, because we host our Elasticsearch cluster with Amazon. Yay for not having to figure out how to be an expert at running Elasticsearch myself and having to repair things when they went sideways. Also, the tech support was great. They zeroed in on what the issue was. It's our fault apparently, or kind of. What the real explanation is, everything was looking fine to me. All the stats were green. I had monitored six different things based on the documentation that Amazon provided. All those things were fine. There were no alarms. It just died. I'm like, "What the heck's going on?" That's why I opened a ticket.
Ben:
It took them a while to find out what was going on. It took them, oh, I don't know, two or three hours because they were a little perplexed because everything looked fine. Really what it came down to was the CPU spikes that we had. We had some CPU spikes that went over 90% and this was not in their documentation, but apparently that's a really bad thing. We had enough of those spikes that it just gave up the ghost finally. They encouraged us to upgrade the cluster, which I did. Once that was all done and deployed, then everything was fine. I made a suggestion that they might update their documentation for monitoring that particular metric. They appreciated that suggestion.
Ben:
After things were all good yesterday and I had gone and I was decompressing and things were back to normal. I had done the backfill. I was feeling pretty good about where we were. It wasn't a hair on fire situation, right? The app has been architected so that even if we lost our search cluster, it's okay. The whole app doesn't die, right? You can still use Honeybadger. We're still processing errors. We're still sending alerts. People are still using the UI. The way that we decided to ingest the data into the search cluster was delayed or put in a separate queue so that we could still be processing data and we could replay that data when the cluster came back when I was ready for indexing.
Ben:
I had just spent several hours on building some pretty awesome, in my opinion, backfill scripts using SQS and Lambda. All I had to do was queue up all those things that didn't get processed and they got processed. They got back-filled, so yesterday afternoon, I was looking out my kitchen window and I was feeling pretty happy. I was like, "That went really, really well for having such a really bad thing happen."
Josh:
That's awesome. Yeah. Yeah. I noticed yesterday, the outages we have been having lately seem to not usually even be our fault. It's when Amazon has an issue, which I guess, the way you look at it, on one hand, we're at Amazon's mercy now. I think that's the other side of the story, but it is nice that we're not dealing with the actual failures that you get if you're running your own box or something that you're responsible for every little, like network failures, for instance. When we used to have DNS go out or something, or those types of things, it's nice not having those types of issues. I'd much rather be at Amazon's mercy, I think, than be at the mercy of myself.
Ben:
Right. Yeah.
Starr:
I don't know, this has a little bit of a nostalgic flavor to it, right? Just a random, oh, if your CPU usage goes over X amount, your cluster just dies. That's the Elasticsearch I know and love from back in the day. It was nice. It's nice to stay in touch with our roots every now and again.
Josh:
It seems that would be the kind of thing that they could at least have a default notification for. If they know that that's a terrible situation, why don't they just have an email that automatically, it sends you and, "Oh, we noticed you're not monitoring those sorts of things." I could see why you wouldn't want to, but it just seems like it would be a nice touch.
Ben:
Yeah. That's not the way Amazon does things.
Josh:
That's not Amazon. I know.
Ben:
Yeah. They're really a sharp knives kind of company. It's like, "Here is all the tools and we'll give you some good guidance, but you have to go and look for that guidance." I mean, literally, we have eight alarms CloudWatch alarms set up for our Elasticsearch cluster. All of them came from the documentation where Amazon says, "Here, y...
117 एपिसोडस
Manage episode 286618043 series 2496774
Show notes:
Links:
Loom
Telestream
Recut
Lovesac
Comfy Sacks
Flipper
Full transcript:
Ben:
You know how we had that recent episode with John Nunemaker about Flipper and feature flags and that sort of thing.
Starr:
Oh, a podcast episode.
Ben:
Yeah. Yeah.
Starr:
I thought you meant a dramatic episode.
Josh:
It's just another episode with John.
Starr:
Oh my God. That guy.
Josh:
That was awesome. Yeah. That was a good conversation.
Ben:
We talked in that conversation about using Flipper at Honeybadger, because we've been using Rollout for our feature flags, which, if you didn't listen to that episode, you don't know what a feature flag is. It's a branch in your code that conditionally runs some feature. You can limit it when you deploy it to people and you don't have to deploy a new thing to all your customers at the same time. You can test it live.
Josh:
I'm not sure if we actually explained it in that episode.
Ben:
Maybe we did, maybe we didn't.
Josh:
This will be good background.
Starr:
I wasn't there. I'm usually the driving force behind backing up and explaining things.
Josh:
Yeah, Starr is good. Always, yeah, you've been pretty good about that. Yeah.
Ben:
Yeah. I went ahead and did that. I put a Flipper in Honeybadger and tested a new feature. We are switching from Postgres to DynamoDB for our notice storage. That's every occurrence of every error. It's a lot of data and we cut over a few weeks ago to be reading from that data in Dynamo because now it's fully populated with the past month's of data and it's being updated. We're basically writing this to two places and now it's time to read from the new place.
Ben:
I tested that with Flipper and I'm so glad that I used Flipper for that feature because it saved my bacon this week. I deployed the reading from Dynamo. Oh, actually. We've been doing reading for a while and what I deployed this week was not writing to Postgres anymore, so stopping the dual rights. I put that behind a feature flag and I turned it on just for my projects. I'm so glad I did because I found a bug that really, really would have caused issues for all of our customers if I had deployed that just willy nilly. Yay for feature flags. Yay for Flipper. Go use it. It's a great thing.
Starr:
That's awesome.
Josh:
It's willy nilly. Is that a Ruby joke?
Starr:
How much money do you think that was worth avoiding that mistake? How much would you pay to do that? A thousand dollars? $10,000.
Ben:
Yeah, it's got to be a more than a thousand dollars, for sure.
Starr:
Okay. We're trying to help John with his pricing here.
Ben:
Yeah, totally.
Starr:
I'm sure that Flipper costs a lot less than a thousand dollars. It does.
Ben:
It's worth every penny.
Starr:
Oh, look at that. Real product placement. We're growing up. Look at this podcast we're doing. We just slid that right in.
Ben:
Yeah. In other infrastructure news, I got to say that having your primary search cluster die is not a fun experience, especially when it happens at 4:30 in the morning.
Josh:
Yeah.
Ben:
But I will say this. Amazon, props, Amazon, because we host our Elasticsearch cluster with Amazon. Yay for not having to figure out how to be an expert at running Elasticsearch myself and having to repair things when they went sideways. Also, the tech support was great. They zeroed in on what the issue was. It's our fault apparently, or kind of. What the real explanation is, everything was looking fine to me. All the stats were green. I had monitored six different things based on the documentation that Amazon provided. All those things were fine. There were no alarms. It just died. I'm like, "What the heck's going on?" That's why I opened a ticket.
Ben:
It took them a while to find out what was going on. It took them, oh, I don't know, two or three hours because they were a little perplexed because everything looked fine. Really what it came down to was the CPU spikes that we had. We had some CPU spikes that went over 90% and this was not in their documentation, but apparently that's a really bad thing. We had enough of those spikes that it just gave up the ghost finally. They encouraged us to upgrade the cluster, which I did. Once that was all done and deployed, then everything was fine. I made a suggestion that they might update their documentation for monitoring that particular metric. They appreciated that suggestion.
Ben:
After things were all good yesterday and I had gone and I was decompressing and things were back to normal. I had done the backfill. I was feeling pretty good about where we were. It wasn't a hair on fire situation, right? The app has been architected so that even if we lost our search cluster, it's okay. The whole app doesn't die, right? You can still use Honeybadger. We're still processing errors. We're still sending alerts. People are still using the UI. The way that we decided to ingest the data into the search cluster was delayed or put in a separate queue so that we could still be processing data and we could replay that data when the cluster came back when I was ready for indexing.
Ben:
I had just spent several hours on building some pretty awesome, in my opinion, backfill scripts using SQS and Lambda. All I had to do was queue up all those things that didn't get processed and they got processed. They got back-filled, so yesterday afternoon, I was looking out my kitchen window and I was feeling pretty happy. I was like, "That went really, really well for having such a really bad thing happen."
Josh:
That's awesome. Yeah. Yeah. I noticed yesterday, the outages we have been having lately seem to not usually even be our fault. It's when Amazon has an issue, which I guess, the way you look at it, on one hand, we're at Amazon's mercy now. I think that's the other side of the story, but it is nice that we're not dealing with the actual failures that you get if you're running your own box or something that you're responsible for every little, like network failures, for instance. When we used to have DNS go out or something, or those types of things, it's nice not having those types of issues. I'd much rather be at Amazon's mercy, I think, than be at the mercy of myself.
Ben:
Right. Yeah.
Starr:
I don't know, this has a little bit of a nostalgic flavor to it, right? Just a random, oh, if your CPU usage goes over X amount, your cluster just dies. That's the Elasticsearch I know and love from back in the day. It was nice. It's nice to stay in touch with our roots every now and again.
Josh:
It seems that would be the kind of thing that they could at least have a default notification for. If they know that that's a terrible situation, why don't they just have an email that automatically, it sends you and, "Oh, we noticed you're not monitoring those sorts of things." I could see why you wouldn't want to, but it just seems like it would be a nice touch.
Ben:
Yeah. That's not the way Amazon does things.
Josh:
That's not Amazon. I know.
Ben:
Yeah. They're really a sharp knives kind of company. It's like, "Here is all the tools and we'll give you some good guidance, but you have to go and look for that guidance." I mean, literally, we have eight alarms CloudWatch alarms set up for our Elasticsearch cluster. All of them came from the documentation where Amazon says, "Here, y...
117 एपिसोडस
सभी एपिसोड
×प्लेयर एफएम में आपका स्वागत है!
प्लेयर एफएम वेब को स्कैन कर रहा है उच्च गुणवत्ता वाले पॉडकास्ट आप के आनंद लेंने के लिए अभी। यह सबसे अच्छा पॉडकास्ट एप्प है और यह Android, iPhone और वेब पर काम करता है। उपकरणों में सदस्यता को सिंक करने के लिए साइनअप करें।