Jay Taylor's notes
back to listing indexMonitoring microservices with Synthetic Transactions in Go
[web search]Monitoring microservices with Synthetic Transactions in Go
March 10, 2016
At Unacast we are building software that scale to process millions of proximity interactions each day. We believe the best approach for building scalable and agile systems is microservices. Meaning, we believe in small, smart and focused applications over large ones. It also enables us to continuously experiment with new stuff, have fun, learn things and always choose the best tools to achieve a specific task rather than to make unnecessary or boring tradeoffs.
I have lately been looking into Go and wanted to gain some experience with building software in it. So I used this opportunity to build a real system using Go.
Monitoring Microservices
However, as you might already know, running and keeping track of multiple services is hard, and monitoring is therefore essential. Since most of our services run using Docker, we use Kubernetes to help us keep our services running if they crash unexpectedly, and we couple Kubernetes together with DataDog to help us monitor all our software environments.
Yet, when bugs, that does not crash a service, find their way into production Kubernetes’ monitoring is of no use. With that in mind it is easy to see that we are in need of some other type of monitoring to see if our services are healthy. We decided on experimenting with a concept called Synthetic Transactions.
Synthetic Transactions
Synthetic Transactions is a monitored transaction that is performed on a system running live in production. Such transactions are used to monitor that the system at hand performs as expected.
In the world of e-commerce a synthetic transaction can be a transaction that continuously tries to place an order and monitors if that order succeeded or not. If it does not succeed, it is an indicator that something is wrong and should get someone’s attention immediately.
At Unacast we use synthetic transactions to monitor our Interactions API. The single purpose of the Interactions API is to accept and ingest interactions for further processing. Since we are a data-oriented company, we are in big trouble if the Interactions API starts failing silently.
Building synthetic transaction tester
Usually we buy such services and there are a lot of great tools out there,
such as NewRelic Synthetics and Pingdom.
But since the synthetic transactions
has to know the inner workings
of our infrastructure we decided to try to build it our self.
There are several ways of building synthetic transactions. Ideally, they should be represent a complete transaction. However, I would argue that it is smarter and more pragmatic to build step by step. In this post we will go through the first step of building a synthetic transaction tester. We will share the subsequent steps in future posts.
Step 0: Monitor for expected response
The first natural step is to create a monitor that runs regularly and checks if a known request gets the expected response.
Performing a HTTP request is simple and is easily done with the stdlib alone:
In the code above we specify a SyntheticPayload
that is the specific request
object we want to send. We specify and send the payload in syntheticHttpRequest
and parse the http.Request
to specifically
check if the http status code returned is 202. If it is not, or the request fails,
we suspect that there is something wrong with the API, and returns error codes
indicating that some further action should be taken.
In the event where a synthetic transaction fails we use DataDog, integrated with Slack and PagerDuty, to notify us that something fishy is going on. Specifically, we send Events to DataDog using their API from the synthetic transaction tester. We did this using an unofficial DataDog Go library by zorkian and it looked something like this:
This is a simple way of telling DataDog that everything is OK. We use the event tags to separate between error events and events coming from non-critical environments. This is important because no one wants to get a call in the middle of the night because the development or staging environment is not working as optimal.
Finally, we need to be able to schedule these request. For that we used cron for go. Putting all the parts together we got something that looked like the following code snippet.
Disclaimer: The code above is just an a simplification of how it can be implemented, and does not show a complete implementation.
Monitoring the synthetics transaction tester
As you might have guessed the synthetic transaction tester is also a microservice. So how should we proceed to monitor it? It is obvious that it cannot monitor itself. The solution was to monitor the “OK” events we described earlier. If these events suddenly stop arriving at DataDag we know that something is wrong and we can react accordingly.
Further steps
A simple but powerful extension of step 0 is to log metrics such as response time for each request. Such external metrics will be a more accurate measure of response time than just calculating the process time internally in a service. It can also be used to trigger alerts if the system is slow to respond, indicating a more critical issue that requires further investigation.
In the future it will be natural to extend the synthetic transaction service by verifying that data has been processed. In our case, interactions are processed and safely persisted when they reach one of our BigQuery tables after passing through AppEngine, Pub/Sub and Dataflow. It is therefore natural for us to extend the synthetic transactions monitorer to check and verify that the transactions has been persisted as expected.
In addition to verifying that transactions has been persisted as expected. We could also start to measure and deduce the expected processing time of an interaction and use this measurement to monitor if our system as a whole works efficiently. Also, we can use the same numbers to verify if the system delivers as promised according to the SLA.
Finally, an extension could be to support live consumer-driven contract testing. That is, explicitly check and verify that the response payload was correct. By doing so we can go to bed at night without worrying if we have broken the API for any of its consumers.
Enjoyed this post?
We are still learning and eager to share what we are learning along the way. If you enjoyed this post I recommend that you keep in touch because it is a lot more to come. Also, check out some of the posts about microservices written by my awesome colleges:
Heterogeneous Kubernetes-clusters on Google Cloud Platform with node-pools
Up until recently Kubernetes clusters running on GCP have only supported one machine-type per cluster. It is easy to imagine situations w...… Continue reading