2023年1月31日

提高GoLang服务的可观察性

性能工程师 @ 福禄

这篇博文是针对那些希望改善其服务的可观察性的GoLang开发者。它略过了基础知识，直接跳到了高级主题，例如异步结构化日志，使用exemplars的度量，使用TraceQL的跟踪，聚合pprof和连续剖析，使用 benchstat的微基准和基本统计，黑盒性能测试，以及用于确定系统最大负载的基本PID控制器。我们还将简要地谈一谈目前在可观察性领域的研究，包括主动的随意剖析和被动的关键部分检测。

可观察性的三大支柱。日志、度量、追踪

如果你正在读这篇文章，你很可能不需要对可观察性的基础知识进行复习。让我们深入了解那些不显眼的东西，并专注于使其尽可能容易地在三个主要的可观察性表面之间移动。我们还将讨论如何将追踪加入到这个组合中，这样pprof数据就可以与追踪联系起来，并返回。

如果你正在寻找一个简短而直接的监控基础知识介绍，以及快速将基本的可观察性引入你的服务的方法，Cindy Sridharan的《分布式系统可观察性》是一个很好的开始。

结构化日志

如果你不使用一个零分配的日志库，日志可能会成为一个瓶颈。如果你还没有，可以考虑使用zap或zerolog--两者都是不错的选择。

Zerolog	767 ns/op	552 B/op	6次分配/操作
zap	848 ns/op	704 B/op	2次分配/操作
装备包	3614 ns/op	2895 B/op	66次分配/操作
逻各斯	5661 ns/op	6092 B/op	78次分配/操作

Golang也有一个正在进行的引入结构化日志的提案：slog。请务必查看，并对该提案提供反馈!

结构化的日志对于从日志中提取数据至关重要。采用json或logfmt格式简化了临时的故障排除，并允许在你研究适当的指标时快速和肮脏的图表/警报。大多数日志库也有现成的钩子，用于gRPC/HTTP客户端和服务器，以及常见的数据库客户端，这大大简化了它们在现有代码库中的引入。

如果你觉得基于文本的格式效率很低，你可以在很大程度上优化你的日志。例如，zerolog支持二进制CBOR格式，而Envoy有protobufs用于他们的结构化访问日志。

在某些情况下，日志本身会成为性能瓶颈。你不希望你的服务因为Docker在启用DEBUG日志时无法快速从stderr管道中提取事件而被卡住。

一个解决方案是对你的日志进行采样。

Alternatively, you can make their emission fully asynchronous, so they never block:

One additional note for those of you using Grafana and Loki: you likely want to set up derived fields. This way, you can extract fields from logs and put them into arbitrary URLs.

An example of one-click navigation between logs, traces, and third-party resources.

Consider including a trace ID in every log message if the context indicates that tracing should be enabled. You’ll thank yourself later for that.

Metrics

Let's assume you're already using Prometheus-style metrics in your service. But what do you do when you see a spike on your graph and need to figure out what caused the slowdown (spoiler: it's probably the database)? Wouldn't it be great to jump from the metric directly to the trace of the slow request? If so, ExemplarAdder and ExemplarObserver are for you:

Note that you can put not only a trace_id but also arbitrary key-value data into the labels, which is especially useful for multi-tenant environments where you can include user_id or team_id. This can be a cost-effective solution to the high-cardinality metrics problem.

Trace

Tracing is essential for performance analysis in today's world, so most services have it enabled. The industry's journey to tracing has been a bumpy one, from OpenTracing to OpenCensus and now OpenTelemetry. We use a common setup with the OTEL libraries for trace emission and Grafana Tempo as the backend.

Flow's blockchain trace emission is distinct from a typical web backend: we don't rely on the trace context passed between trust boundaries, so instead of propagating TraceIDs, we deterministically construct them based on the object being processed - a hash of a block or transaction hash.

The issue with tracing is not the emission of data, but rather the ability to locate "interesting" data. Default Grafana's search capabilities are quite limited, and it can sometimes take minutes to find the desired trace.

TraceQL solves this issue by introducing a convenient way of searching traces. Finding a specific trace is now a breeze. Here are a few examples from the reference documentation:

If you prefer video format, here is Joe Elliott’s introduction to TraceQL from GrafanaCon 2022.

Now, with TraceQL, we can be as specific as we desire:

{.network = "mainnet20" && span.transaction.ContractFunctionCall = "e467b9dd11fa00df.FlowStorageFees.calculateAccountsCapacity" && duration > 1ms }

If you're using the Grafana, don't forget to link your log data source to your traces. This will make it easy to navigate between traces and logs:

LInking traces back to logs (and possibly metrics)

Optimizing TraceQL queries can be achieved by having common per-process attributes in your "resources". This is much faster than searching and can make a huge difference. For example, in the screenshot above we have a bug where env and network should have been emitted as resources instead of attributes.

Profiling

The Go runtime offers excellent profiling capabilities. If you want to instrument your code beyond net/http/pprof, we highly recommend reading DataDog’s “The Busy Developer's Guide to Go Profiling, Tracing and Observability”. It goes into detail into all profiler types (CPU, Memory, Block, etc.) and covers low-level implementation details such as stack trace internals and pprof format.

Many of the profiling features require a new Go runtime for quick and accurate operation. If you intend to rely on profiling data in production, especially if you plan to use continuous profiling, please upgrade to Go 1.19.

Once you have the raw pprof file, you'll want to analyze it. However, go tool pprof -http 5000 has its limits. The ideal solution would be to store profiles in a database that supports basic querying and filtering. We use Google’s Cloud Profiler, but instead of relying on their somewhat limited client libraries, we take advantage of their “offline API,” which allows us to ship existing .pprof files to Google:

The main benefit of sending profiles to remote storage compared to simply storing them locally is the ability to aggregate multiple profiles into a single view.

Additionally, it gives us the ability to look at profile trends over time.

An example of the historic trends top N graph.

Sadly, the Google’s "Offline" API is heavily rate- and size-limited and appears to be generally unsupported, so we are actively exploring alternatives. Since we are already using the Grafana stack, we are keeping an eye on Phlare’s development; it looks like a very capable replacement:

Connecting profiling with tracing

We've invested a lot of effort in the user experience of performance analysis, ensuring smooth transitions between logs, traces, and metrics. However, profiling data is currently a unique challenge. To address this, we're exploring pprof.Do (or alternatively, pprof.SetGoroutineLabels at a lower level). This would enable us to create a connection between profiling and tracing that is currently missing.

Labels have a few drawbacks: they are not supported for all profile types and can increase profile size, so be mindful of label cardinality.

You can add arbitrary labels here, which is especially useful in multi-tenant environments. For example, you can annotate pprofs with a tenant's UserID and TeamID. Even for single-tenant setups, annotating profiles with EndpointPath may provide additional insight into CPU usage.

fgprof

Go's default profiler has a downside of only being able to view either On-CPU or Off-CPU time. Felix Geisendörfer’s profiler, fgprof, solves this issue by providing a single view that captures both.

Continuous Profiling

Profiling is now cheap enough that many companies provide libraries for continuous profiling in production, which has become a trend. Notable examples include Pyroscope, DataDog, and Google.

Rather than embedding a profiler into the codebase, Grafana Phlare employs an agent model that periodically scrapes Go's pprof HTTP endpoints located at /debug/pprof/:

eBPF-based Profiling

All profiling features above require some kind of instrumentation in the binary, whether it be an HTTP endpoint or a continuous profiling library. Recently, there has been a trend in observability towards instrumentation-less profiling enabled by eBPF.

For instance, Parca enables you to observe uninstrumented C, C++, Rust, Go, and more!

bpftrace/bcc profiling

Parca is relying exclusively on perf_events which is totally safe. Meanwhile, if you try a naive profiling approach utilizing uprobe / uretprobe with either bcc or bpftrace on a Go application you will likely encounter a SIGBUS due to Go's stack copying.

Microbenchmarks

Microbenchmarking in Go is a well-known practice, so there's not much to say about it. However, there are a couple of things worth mentioning. Firstly, it's recommended to use test.benchmem and run '^$' when running microbenchmarks.

Secondly, benchmark results should only be considered valid when -count is greater or equal to 10:

Finally, the benchstat tool (or similar) should always be employed when analyzing a single benchmark result to enhance readability and guarantee the absence of noise in the environment:

Noise can be a serious problem in the benchmarking environments. LLVM project has great documentation on how to tune Linux systems for less than 0.1% variation in benchmarks.

Additionally, when making comparisons, especially when claiming performance improvements, it is essential to show statistic:

Don't be shy about adding custom metrics to your tests to gain more insight into your performance reports from critical sections. Utilize b.ReportMetric and b.ResetTimer to optimize them.

It would be nice to have an open-source or SaaS tool that could gather and track microbenchmark results over time for a Go repository; however, we did not find one.

Blackbox Performance Tests

While there are a variety of tools for end-to-end performance testing web applications (e.g., we use grafana/k6 for testing ours), complex systems such as compilers or databases require custom blackbox tests, also known as macrobenchmarks. Here are a few good examples of automated macrobenchmarks in large open source projects (the benchmark frameworks themselves are also open sourced).

Vitess, a distributed database, has a very comprehensive (albeit slightly noisy) nightly performance test that tracks both micro- and macro-benchmark results. The test framework is open sourced at https://github.com/vitessio/arewefastyet.

Rust is another good example of an excellent benchmark framework for all kinds of compiler-specific tests done on a per-commit basis. It is also open sourced at: https://github.com/rust-lang/rustc-perf.

Our end-to-end setup is relatively trivial compared to these two, but still has a couple of interesting details. One of them is detecting the system overload point. Since blockchain is asynchronous and can queue transactions on multiple layers in our end-to-end macrobenchmark, we need to find the maximum transaction-per-second value that blockchain can handle without excessive latencies. We used to use TCP's additive-increase/multiplicative-decrease (AIMD) algorithm for that, but it was quite slow to converge. Recently, we switched to a PID controller (a PD controller, to be exact) instead, which has a nice property of quickly converging to the desired queue size while also not overshooting too much:

An example of our Token Transfer benchmark converging to a stable state.

If you're interested in introducing PID controllers into your own systems, Philipp K. Janert's book, "Feedback Control for Computer Systems: Introducing Control Theory to Enterprise Programmers," is a great place to start. If you're more of a visual learner, the first couple of lectures from the "Understanding PID Control" YouTube playlist are also a great introduction. Just keep in mind that this is the MATLAB channel, so it goes deep quickly. You've been warned =)

Future work

Here we’ll go over things that we are planning on adding to our performance observability toolkit this year.

Bottleneck detection through slowing things down

It can be difficult to speed things up, but slowing them down is relatively straightforward. Therefore, one way to identify bottlenecks is to try slowing system components by 1ms, 5ms, 50ms, 250ms, etc., and measure the benchmark results. Then, extrapolate the function back to -1ms, -5ms, -50ms, etc. This method is approximate, but it works well for small values.

Casual profiler

A casual profiler is another active method and a more precise generalization of the previous one. Instead of slowing down the component of interest, it slows down everything else around it, thus emulating the speedup of the component being tested.

If you are interested in casual profiling there is a good video introduction to in “Coz: finding code that counts with causal profiling” (SOSP'15).

Automatic critical path analysis

Analyzing the critical path of a large-scale distributed system can be challenging. In their OSDI’14 paper, "The Mystery Machine: End-to-End Performance Analysis of Large-Scale Internet Services," University of Michigan and Facebook proposed a way to identify the critical path passively by observing logs (which are likely more similar to traces nowadays). This approach has the benefit of stacking nicely with both tracing and profiling infrastructure.

Appendix A. AI-driven blogpost editor.

Not related to Golang observability, but generally applicable to those who write blog posts: English may not be your native language, so first drafts of blog posts can be quite painful for an editor to fix. This often requires multiple hours (and likely even a bottle of whiskey) to put back forgotten articles, rephrase sentences, fight the urge to send me a Conjunction Junction video, fix punctuation and idioms. Modern language models can take away much of this pain. For example, the first pass of editing this post was done by text-davinci-003 (talk about overkill!) with the following prompt:

You are a technical blog editor reviewing a post about "Improving Observability of Golang Services." Please, rewrite the following paragraph correcting spelling, idioms, punctuation, and word choices as necessary. Combine, split sentences, or rewrite to increase clarity when needed. Don't make it too dry, humor is fine.

Tags: