Observability

10 open source tools compared. Sorted by stars. Scroll down for our analysis.

Tool	Stars	Velocity	Language	License	Score
Loki Horizontally-scalable, multi-tenant log aggregation	28.4k	+35/wk	Go	GNU Affero General Public License v3.0	73
Vector High-performance observability data pipeline	22.1k	+39/wk	Rust	Mozilla Public License 2.0	78
openobserve OpenObserve is an open-source observability platform for logs, metrics, traces, and frontend monitoring. A cost-effective alternative to Datadog, Splunk, and Elasticsearch with 140x lower storage costs and single binary deployment.	19.4k	+124/wk	TypeScript	GNU Affero General Public License v3.0	73
Fluentd Unified logging layer	13.5k	-	Ruby	Apache License 2.0	85
phoenix AI Observability & Evaluation	10.3k	+85/wk	Python	-	69
opentelemetry-collector-contrib Contrib repository for the OpenTelemetry Collector	4.8k	+17/wk	Go	Apache License 2.0	79
latitude-llm Latitude is the open-source ai monitoring platform.	4.2k	+93/wk	TypeScript	LGPL-3.0	71
opentelemetry-ebpf-profiler The production-scale datacenter profiler (C/C++, Go, Rust, Python, Java, NodeJS, .NET, PHP, Ruby, Perl, ...)	3.1k	+5/wk	Go	Apache License 2.0	71
opentelemetry-java-instrumentation OpenTelemetry auto-instrumentation and instrumentation libraries for Java	2.6k	+4/wk	Java	Apache License 2.0	75
docker-otel-lgtm An OpenTelemetry backend in a Docker container image	1.9k	+8/wk	Shell	Apache License 2.0	71

Stay ahead of the category

New tools and momentum shifts, every Wednesday.

Our Analysis

Loki28.4k★

Loki collects logs from your infrastructure without indexing the content, which makes it dramatically cheaper than Elasticsearch-based logging. It's like Elasticsearch for logs, except it doesn't index the full text of every log line. Instead, it indexes only metadata labels (like service name, environment, pod), which makes it dramatically cheaper to run and simpler to operate. Self-hosting is free under AGPL-3.0. You get the full log aggregation engine, LogQL query language, alerting integration with Grafana, and multi-tenant support. It's designed to run alongside Prometheus (metrics) and Tempo (traces) for the full Grafana observability stack. Grafana Cloud offers a free tier with 50GB of logs per month, which is generous for small projects. Paid cloud starts at usage-based pricing. The catch: because Loki doesn't full-text index, searching for a specific string across millions of logs is slower than Elasticsearch. You need to know which labels to filter by first. If your debugging workflow is 'grep for this error message across everything,' Loki will frustrate you. Also, the AGPL license means if you modify Loki and offer it as a service, you must open-source your changes.

Vector22.1k★

Vector is a high-performance pipeline that collects, transforms, and routes logs, metrics, and traces across your infrastructure. It's the plumbing between your applications and your observability stack (Elasticsearch, Datadog, Grafana, whatever you use). Rust-based, MPL 2.0 license. Built by the team behind Timber (now part of Datadog). Single binary, ~10MB, handles millions of events per second on modest hardware. Supports 100+ sources and sinks: pull from syslog, Kafka, files, Kubernetes; push to S3, ClickHouse, Loki, Splunk. The transform layer lets you filter, parse, enrich, and route data using a built-in language called VRL. Fully free. No paid tier, no hosted version. Datadog acquired Timber but kept Vector open source. MPL 2.0 means you can use it commercially. You just can't fork the modified source and close it. Solo through enterprise: free at every scale. The Rust performance means you rarely need to think about Vector's resource usage. One instance handles what would take a cluster of Logstash nodes. The catch: VRL (Vector Remap Language) is powerful but it's a custom DSL you have to learn. If your team already knows Logstash configs or Fluentd plugins, there's a migration cost. And while Datadog keeping it open source is great, the deepest integration is naturally with Datadog's platform.

openobserve19.4k★

OpenObserve handles logs, metrics, traces, and frontend monitoring in one tool. It pitches itself as a Datadog and Splunk alternative, but the real story is the storage architecture. Parquet columnar files in S3 instead of an Elasticsearch cluster, which the team claims cuts storage cost by 140x. Built in Rust, ships as a single binary, OpenTelemetry-native, AGPL. Single-binary mode runs in under two minutes and handles a real workload before you need to scale. High-availability mode adds clustering and federated multi-region search, but that side trends toward enterprise features. Querying is SQL for logs and traces, PromQL or SQL for metrics. No proprietary query language to learn. Solo and small teams: self-host the single binary on a $20 VPS and forget about Datadog bills. Mid-sized teams: HA mode plus S3 storage scales to terabytes per day for a fraction of what hosted observability costs. Large teams: cluster federation and SSO are paid enterprise add-ons. Worth pricing against Datadog's renewal quote. The catch: AGPL means commercial use of the code with modifications has to follow AGPL terms. If you embed OpenObserve in a hosted product, talk to a lawyer first.

Fluentd13.5k★

It takes logs in from applications, servers, containers, and cloud services, transforms them if needed, and routes them to whatever storage or analysis tool you use. Fully free under Apache 2.0. CNCF graduated project. 700+ community plugins cover every source and destination you can think of. The architecture is simple: input plugins (where logs come from), filter plugins (transform/parse), output plugins (where logs go). Treasure Data (the company behind Fluentd) offers enterprise support and their own managed log analytics platform, but Fluentd itself is completely free. The catch: Fluentd is written in Ruby, and for high-throughput scenarios, it can be resource-heavy. That's why Fluent Bit exists, a lightweight, C-based alternative from the same project. For Kubernetes, most people run Fluent Bit as a DaemonSet (one per node) that forwards to a central Fluentd instance. The plugin ecosystem is powerful but plugin quality varies; some community plugins are abandoned. And debugging Fluentd configuration issues when logs aren't flowing is tedious.

phoenix10.3k★

Phoenix is open source observability for AI apps. When an LLM feature misbehaves in production, this is how you see why. It traces every model call your app makes, captures the prompts and responses, and lets you run evals to score output quality over time. Arize built it on OpenTelemetry, and it runs in a notebook, a container, or on your own server. Self-hosting is the point, and it delivers. Run the container, point your app's tracing at it, and you get traces, datasets, experiments, and prompt management in one UI with nothing leaving your infrastructure. That matters when your prompts carry customer data you can't ship to a vendor. The ops burden is moderate. This is a real service to keep running, not a library you import. Solo developers and small teams should self-host this and skip the LLM-observability SaaS bills from the LangSmith and Datadog tier. You get tracing and evals for the cost of a container. Larger teams already paying for Arize's hosted platform get the managed version; the open release is the same engine minus the ops work. The catch is the license. Phoenix is Elastic License 2.0, not true open source. You can self-host and use it freely, but you cannot stand it up as a competing hosted service. For anyone using it to debug their own app, that line never gets crossed. Just know it is source-available, not MIT.

opentelemetry-collector-contrib4.8k★

The OpenTelemetry Collector is the vendor-neutral pipeline for your telemetry data: metrics, logs, traces all flowing through one binary. This contrib repo adds the receivers, exporters, and processors that make it actually useful in production. Prometheus, Jaeger, Kafka, AWS CloudWatch, Datadog, and dozens more. Setup ranges from "download a binary and point it at your backend" to "build a custom distro with exactly the components you need." The Collector Builder tool lets you compile a purpose-built binary with only what you use. No bloat, no unused code listening on ports. Solo devs running a few services: plug this in front of Grafana Cloud's free tier and you have production-grade observability for zero dollars. Teams already paying for Datadog or New Relic: this is how you collect once and ship anywhere, or migrate vendors without re-instrumenting everything. The catch: configuration is YAML and the docs assume you already know what a receiver and exporter are. The learning curve is real if you are new to observability pipelines.

latitude-llm4.2k★

Latitude watches your AI agents the way Datadog watches your servers. When you ship an app built on LLMs, you lose visibility the moment a prompt leaves your code, you can't see why an agent went off the rails or which step failed. Latitude captures every trace, groups failures into issues you can actually act on, and runs evaluations so you know when output quality drifts. The core is open source under LGPL-3.0. Self-hosting is real work but well-trodden: it's a TypeScript app you run yourself, bring your own database and infrastructure. The payoff is that all your prompt and trace data stays on your hardware, which matters if you're handling anything sensitive. If you'd rather skip the plumbing, the hosted cloud has a free Starter tier with 20,000 credits a month. Solo devs and small teams testing an AI feature should start on the free cloud tier and self-host once data control or volume matters. The paid Pro plan runs $99/mo for 100,000 credits, which is the call once you're past hobby scale. Larger teams comparing this to LangSmith or Langfuse get an actually open core instead of an open-core tease. The catch: LLM observability is a crowded space and Latitude is younger than LangSmith. The trace and eval features are strong, but you're betting on a newer project. Self-hosting also means you own the upgrade treadmill and the storage bill as your trace volume grows.

opentelemetry-ebpf-profiler3.1k★

This profiler attaches to your Linux system via eBPF and captures stack traces across every running process without touching your application code. No agents to install, no libraries to load, no recompilation. It supports C/C++, Go, Rust, Python, Java, Node.js, PHP, Ruby, Perl, and the dotnet runtime. All of it runs at roughly 1% CPU overhead. The "no instrumentation" part is what matters. Traditional profilers (pprof, py-spy, async-profiler) require you to pick a language and instrument that specific runtime. This profiler sees everything: kernel space, system libraries, and application code in one unified stack trace. For debugging performance issues that cross language boundaries or involve system calls, nothing else gives you this view. You need Linux kernel 5.4 or newer (4.19 with a specific patch), and it runs on amd64 and arm64. It feeds into the OpenTelemetry ecosystem, so your profiling data lands in whatever backend you already use for traces and metrics (Grafana, Jaeger, and the like). Solo developers probably do not need continuous profiling. Teams running multi-service production systems will find this indispensable. The catch: Linux only. No macOS, no Windows. And eBPF profiling requires elevated permissions, which means your security team will have opinions about deploying it in production.

opentelemetry-java-instrumentation2.6k★

This is how you get it. One JAR file, one JVM flag, and it auto-instruments your Spring Boot, Kafka, gRPC, JDBC, and dozens of other libraries. Zero code changes. CNCF project, Apache 2.0, completely free. The agent itself is trivial to deploy. Add it to your JVM startup, point it at an OTLP endpoint, done. The real ops burden is the backend: you need somewhere to send the data. OpenTelemetry Collector plus Jaeger or Grafana Tempo is the common self-hosted stack. That's a meaningful setup, but it's a one-time cost shared across all your services. Solo devs and small teams can point it at a managed backend (Grafana Cloud free tier, Honeycomb, Datadog) and skip the infrastructure entirely. Larger teams running their own Grafana/Tempo stack get full control and zero per-host licensing. The agent is vendor-neutral by design, so you're never locked to one backend. The catch: it's Java-only. If you're running a polyglot stack, you need separate OpenTelemetry agents for Python, Node, Go, etc. And "zero code changes" means "zero code changes until you need custom spans," at which point you're adding SDK calls anyway.

docker-otel-lgtm1.9k★

Grafana's docker-otel-lgtm is a single Docker image that bundles the whole LGTM stack plus an OpenTelemetry Collector. LGTM stands for Loki (logs), Grafana (dashboards), Tempo (traces), and Mimir/Prometheus (metrics), with Pyroscope (continuous profiling) thrown in. Apache 2.0 licensed. The point of the image is one command: `docker pull grafana/otel-lgtm && ./run-lgtm.sh`. Port 3000 is Grafana, everything else is pre-wired underneath. You point your app's OpenTelemetry exporter at the collector and you have a working observability backend in under a minute. Perfect for trying OTel pipelines, demoing a feature, or developing instrumentation locally. Solo developers and small teams use this for local dev and CI test environments. For production, Grafana explicitly tells you to run the components separately or pay for Grafana Cloud. There is no HA, no multi-tenancy, no persistence guarantees in the bundled image. The catch is the production warning is not optional. The image is built for convenience, not durability. Use it to learn OpenTelemetry, build instrumentation, or demo dashboards. When you go to production, plan to split Loki, Mimir, Tempo, and Grafana into separate deployments with their own storage backends. That is a different project, not a Docker run command.