It’s important that we provide users with the best experience. Part of that means that our service is available through hardware failures. And when things do go wrong, we need systems in place to monitor key metrics and send alerts to services and our team. Initially, we chose Datadog as our monitoring solution because it was easy to set up, and it provided integrations to services that we used. Then we started scaling our customers’ infrastructure to keep up with demand and saw our infrastructure go from 5 to 500+ servers. This didn’t jive with Datadog’s per-server cost model, as it increased our bill from $75 to $7,500+ per month. In order for us to move away we needed something that provided auto-discovery of new servers, collected host and container metrics, alerted us on abnormal conditions, and had an easy way to visualize data. We turned to the open-source world and discovered Prometheus, a monitoring solution built by SoundCloud.
Keep readingThe official blog from the team at Runnable.
A couple of weeks ago, Node.js released its latest LTS: version 6.9.0. I realized this was the case because one of our services broke. It used the nodejs:lts
image and got upgraded by mistake. Inspired by this breaking of one of our services, I decided to dig into what was new in this version of Node. In this blog post, I’ll walk you through some of the new ES6 additions to Node and how they will change the code you write.
In part I, I explored the less-obvious advantages of the microservices architecture that we’ve discovered while building Runnable. In this part, I’ll explain how the microservices architecture creates happier and healthier teams.
Keep readingFrontend applications always have a multitude of user interactions and flows for how a user can get to a particular state. Sometimes these states are not intended and errors happen. Errors can be incredibly difficult to track down, so a reliable process for finding the root cause of an error can save a lot of time and confusion. Our process involves using a few services in conjunction.
Keep readingAfter our announcement of general availability, our team started focusing on improving our onboarding flow. A key bottleneck was the time it took for us to spin up infrastructure for each new user. This could take up to ten minutes, delaying their initial exposure to our product. We knew we were facing a common problem: reducing the amount of time that it takes to add resources and servers to your infrastructure.
Keep readingNo matter how high your CTRs are on your advertising, it won’t translate into active users without an onboarding flow that helps users quickly understand the value of your product. This is what we’ve been focusing on now that Runnable is generally available. Here’s the approach we’ve been using to help more users “see the light”.
Keep readingRecently, GitHub announced a totally new way for applications to integrate with its service. This will allow applications to act as independent entities on GitHub. Currently, applications must always impersonate a user who has the necessary permissions to perform a given action. It can be a headache managing which user an application needs to impersonate in order do the work it needs to do. GitHub’s recent change affects how applications can receive webhooks, how they interact with users, and how they connect to GitHub. Here are three ways Integrations will be easier to implement and maintain.
Keep readingWhen we first started deploying containers across multiple servers, we managed scheduling ourselves. We had to maintain cluster state and determine the best place to schedule a container. We had a solution, but it was not elegant or pretty. When Swarm came out, it promised to solve our scheduling woes. Unfortunately, using it in production hasn’t been as straightforward as we’d hoped. In this post, I’ll cover the problems we encountered and how we worked around them.
Keep reading