A man comes to the foreign embassy and says: – I am a spy, I want to surrender.
He is asked: – Whose spy are you?
He says: – I am an American spy. – Well, then, you need to go to room #5.
He goes to the room #5: – I am an American spy, I want to surrender. – Do you have weapons? – Yes, I do. – Please go to room #7.
He goes to the room #7: – I’m a spy, I want to surrender, I have weapons. – Please go to room #10.
He goes to the room #10: – I’m a spy, I want to surrender, I have weapons. – Do you have communications equipment? – Yes, I do.
– Please go to room #20.
He arrives to the room #20:
– I’m a spy, I have weapons and communications equipment, and I want to surrender.
He is asked:
– Do you have a mission?
– Yes, I do.
– Well, then go and execute your mission. Stop distracting people from their work!
This joke is mainly about the bureaucracy of large organizations, but I find the joke fascinating for another reason. In a complex microservices architecture, the same bureaucracy can also take place.
An HTTP request is passed from one service to another then another. When the request is rejected, we often lose the entire trace and never know where it went wrong.
Let’s imagine a scenario where you work for an online shopping website. One customer bought a lamp from your site on Jan 5, 2023, but the lamp was never delivered. They reach out to customer support and ask what happened. Thus the nightmare begins.
“Buying a lamp” is an HTTP request to place an order. You reach out to the API gateway team, and they quickly find the request. They told you the request was correctly passed down to backend service A. There are no issues at the API gateway layer.
Then you go to team A and ask for their help. They finally found the request log by looking through all logs on Jan 5. Again, no failures. The request is successfully passed down to internal service B.
You have to do the same for services B, C, D. However, team D cannot find any log related to this HTTP request. No one knows where the request went, and no one knows why it failed.
No wonder why Elon Musk wants to shut down 80% of Twitter’s microservicesA hypothetical microservices architecture
What is Distributed tracing, and why OpenTelemetry?
The answer to the problem is distributed tracing. It is a method to observe the request propagate through microservices. With distributed tracing, a request leaves a trace of the footprint on each service it makes a request to. When you want to diagnose a request failure, you must find all the logs tagged with the request trace. It gives you a full picture of the request lifecycle.
Tracing HTTP Requests with OpenTelemetry
A real trace of Nylas request generated by OpenTelemetry
OpenTelemetry is the most popular tool for distributed tracing. From the dashboard, you can see the Trace ID, the journey of a request, and logs related to that trace. The trace also contains the latency of all service hops in the process. It becomes a lot easier to diagnose issues in a microservices system.
The essential components of OpenTelemetry are trace and span:
A trace is the entire life of a request. It starts when a user calls the first service and ends when the last service responds back to the user.
A span is a step the request takes. Multiple spans form a single trace. Span usually maps to a function. A service can have many functions, so it also can have multiple spans.
A span has attributes. You can attach custom info in a span, such as http.method, user_id, env, etc. Span can have a parent-child relationship. If span A is a child of span B, this means function A is invoked inside function B.
This is a typical trace for an HTTP request:
Typical trace for an HTTP request
The API gateway (blue) span is the parent (or grandparent) span of all other spans. It starts when receiving HTTP request from user and ends when the HTTP response is sent back. The span latency is 1000 milliseconds. Let’s break this down further:
API gateway has two dependencies; thus it has two child spans: auth function (red) and backend service handler (purple).
The backend service handler span has two child spans, each with its own child.
Note the backend service have three spans in total because a span maps to a function, not a service.
Like the backend service, the database adaptor service (green) and internal service (orange) also have their own span. They take 200 ms and 250 ms respectively.
Below is a trace of a Nylas request; look how similar it is to the previous example:
Tracing PubSub Flow with OpenTelemetry
If an HTTP request can be traced, can we also trace a PubSub flow?
The answer is yes. Let’s take a look at another example. This is a trace for a flow connected by message queues (Google PubSub):
Typical trace for a flow connected by message queue
Let’s break down the above diagram further:
Service 1 (blue) handles an event and sends PubSub message X. The span for service 1 ends.
Service 2 (red) receives PubSub message X and processes it.
Service 3 (purple) also receives PubSub message Y. It runs function A and sends PubSub message Y.
Service 4 (green) receives message Y and processes it.
Let’s see the Nylas trace of a PubSub flow:
The first flow is synchronous. The API gateway span is the ancestor of all other spans. Meanwhile, the second flow is asynchronous. There is no common ancestor. The entire trace looks like a relay race.
Inside the same Golang service, traces are passed by the ctx object. Between services, traces are passed by traceparent HTTP header. We will talk more about the traceparent header shortly.
Now we know the basic concept of OpenTelemetry. Let’s look at how to set up distributed tracing from scratch.
Setting up a Distributed Tracing From Scratch
Deploy OpenTelemetry collector
The first step is deploying OpenTelemetry collector. OpenTelemetry collector is a service that pulls traces from microservices and sends them to a monitoring dashboard (such as NewRelic, Jaeger, Prometheus, Coralogix).
This is how distributed tracing works in a microservice architecture:
In the diagram above, OpenTelemetry collector gathers traces from microservice A, B, C and then sends tracing to three monitoring services. OpenTelemetry itself does not store data, and neither does it provide a trace monitoring UI. You will need to pay for observability providers such as New Relic, Jaeger, or Coralogix.
Let’s look at the OpenTelemetry collector under the hood. It has three components:
Receivers receive trace data from microservices. It can be push or pull based.
Processors process traces before sending it to the monitor. It can filter and sample traces.
Exporters send data to monitoring services. It can also be push or pull based.
Let’s update our diagram with more details:
The collector has one receiver, three processors, and three exporters.
OpenTelemetry collector can be deployed with Helm:
After you run the helm install command, OpenTelemetry collector will be available in the url my-opentelemetry-collector.default.svc.cluster.local:4317. With the collector in place, we can start sending traces from microservices.
Setup tracing in a Go microservice
In a Go microservice, you need to initialize tracing in the main.go function:
package main
// Use service name as tracer name
var tracer = otel.Tracer("my-server")
func main() {
// Connect to collector
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
defer cancel()
conn, err := grpc.DialContext(ctx, "my-opentelemetry-collector.default.svc.cluster.local:4317", grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithBlock())
if err != nil {
panic(err)
}
// Set up a trace exporter
exporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithGRPCConn(conn))
// Handle error and create tracer provider
if err != nil {
panic(err)
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithSampler(sdktrace.AlwaysSample()),
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(
resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-server"),
)),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))
// Shut down tracing gracefully
defer func() {
if err := tp.Shutdown(context.Background()); err != nil {
log.Printf("Error shutting down tracer provider: %v", err)
}
}()
// Start server
// ...
}
Now the server can send a trace to OpenTelemetry collector. Let’s look at how to pass traces between services.
Tracing HTTP (1.1) service
By W3C standard, trace data is passed from client to server using traceparent HTTP header:
Tracing version is 00. As of this blog, the latest version is still 00.
Trace id is 351772fff98e2108663b0d55f8eb5466. This does not change for the lifecycle of a request.
Parent span id is 46158e065b2f87cb. This is the span of the HTTP client who sent the request.
Trace flag is 01. As of this blog, the only valid flag is 01, which means the trace is sampled.
The full specification of HTTP tracing is written here: https://www.w3.org/TR/trace-context/. The good news is we don’t have to manually write or parse this header. It is already handled by OpenTelemetry library.
Server side
In Nylas, we use Go Fiber for all public APIs. Fiber has an OpenTelemetry middleware called otelfiber. It automatically parses traceparent header, and creates a new span with HTTP endpoint as its name.
The parsed trace and the new span are put in the fiber context (c *fiber.Ctx). You can call ctx = c.UserContext() to translate fiber context c into a Go context ctx. Then, you can create child spans from ctx.
Here is an example Fiber server with tracing enabled:
You will see a span created with name /hello. In helloWorld function, we created a child span named helloWorld.
Client side
HTTP client can serialize a Go context object into traceparent header automatically. When making a request, the HTTP client creates a new span with url as span name. Here is an example:
// Create client with otel
client := http.Client{Transport: otelhttp.NewTransport(http.DefaultTransport)}
// Create a request with Go context
req, err := http.NewRequestWithContext(ctx, "GET", "http://some-url.com", nil)
if err != nil {
return "", err
}
// Send request
res, err := client.Do(req)
if err != nil {
return "", err
}
defer res.Body.Close()
// Read response
body, err := io.ReadAll(res.Body)
if err != nil {
return "", err
}
fmt.Println(string(body))
A span http://some-url.com will be created. Its parent span is the one carried by Go context object.
Tracing grpc service
In grpc, trace is passed with metadata traceparent. Again, this is automatically handled by OpenTelemetry library.
Upon receiving a grpc request. The traceparent metadata is deserialized. Then the server automatically creates a span named <grpc-service-name>/<grpc-action-name>, and put it in the Go context object.
Similar to HTTP 1.1 service, grpc client also creates a span called <grpc-service-name>/<grpc-action-name>, and serialize it into traceparent attribute.
Tracing PubSub
Google PubSub does not offer a standard tracing library, so we have to write our own. We manually create a PubSub attribute called traceparent in each message. Then we handle serialization and deserialization with our own library.
Publish message
This is an example library for publishing a message:
I wrote a proof of concept for OpenTelemetry integration: https://github.com/quzhi1/OpenTelemetryPlayground. It contains two services: service A and service B. Service A can call service B in both synchronous and asynchronous ways. You can see how tracing works in both cases.
Conclusion
Diagnosing request failures in microservice is hard. We solved this problem by using distributed tracing with OpenTelemetry. OpenTelemetry collector gathers traces from all Nylas services, and sends to our observability provider (Coralogix). We figured out how to propagate trace with HTTP, grpc and Google PubSub. With OpenTelemetry, we have a complete picture of a request’s lifecycle.
Special thanks to all Nylanauts who helped with the OpenTelemetry integration:
Testing microservice architecture locally is difficult. At Nylas, we’ve created a local development environment using Minikube and Tilt. Let’s take a look at how we enable local testing for microservices at Nylas.
Learn how Nylas uses Kong API gateway to authenticate, control routing, and transform requests/responses.
Zhi Qu
Zhi is a staff engineer who leads the Developer Velocity team at Nylas. In 2021, he left Stripe and became a Nylanaut. Zhi is also a history buff who loves traveling.