Observability Revolutions in Oligo's Runtime Sensor

Introduction

Observability has become a cornerstone of modern distributed systems, allowing teams to understand and debug their applications in real time. At Oligo, we have constructed our runtime sensors to operate in our customers' production clusters in a way that is not only reliable, but efficient and effective. This post provides an in-depth analysis of how we leverage observability to refine and enhance our sensor’s performance.

Through multiple observability revolutions, we've developed the most efficient yet lightweight runtime sensor in the industry, striking the perfect balance between depth of insights and minimal overhead.

That’s how Oligo’s sensor has become the top-leading runtime security sensor in the industry both in terms of value and resource efficiency. Other security sensors in the industry consume over 1GB of memory, while the Oligo sensor averages less than 400MB of memory consumption.

Each breakthrough felt like opening a new set of eyes, revealing new insights to propel us to a next level of performance and maturity.

In this post, we'll explore why observability matters for runtime sensors, the different layers of observability we've adopted, and how a data-driven approach has helped us iterate rapidly and to improve decision making.

Why Observability Matters—Especially for Runtime Sensors

Observability is essential for understanding system behavior, diagnosing issues, and optimizing performance. When sensors run in production, visibility isn't optional—it’s the only way to ensure they work as expected.
Key reasons why observability is critical:

Minimal impact to customer resources – Our sensor must operate with minimal overhead to avoid impacting customer operations.
Variability in conditions – We are installed in lots of clusters, each behaving differently. Observability helps us see how our sensor performs in different environments.
Data-driven debugging and diagnosis – Before these revolutions, debugging was based on assumptions. Now, we rely on real data to quickly pinpoint and resolve issues, eliminating false hypotheses instantly.

We also send observability data from all of our sensors to Oligo Cloud. To do this without burdening the system, we’ve optimized throughput to be so lightweight that it has a negligible impact - less than 1KB/s per sensor on average. This allows us to continuously collect powerful insights for improvement without compromising performance.

Logs

Logging is the first step in observability. It provides direct visibility into system behavior. But naive logging can create bloated, inefficient logs. There are some well-known best practices, such as structured logging, using appropriate log levels, and centralizing logs. However, we'll focus on some of the more advanced practices that have significantly improved our observability:

Log messages without parameters – Store parameters in structured fields (e.g., JSON) for better indexing and searching. Example:

Use rate-limited logging – Avoid flooding logs with repetitive messages. Oligo’s sensor limits log frequency to capture meaningful events without performance overhead. The previous point is also a prerequisite for efficient rate limited logs.
Sending eBPF logs – Our eBPF code is a critical part of our sensor and we needed observability on potential problems in the code itself. We've efficiently structured and categorized logs, sending them to user mode through ring buffers and we ensure only important ones are sent back home.

But logs alone weren't enough. They're expensive to store and query - and they don't provide a high level overview of what’s going on.

Metrics

Metrics provide a high-level view of performance. They help us detect trends and anomalies at a glance. We collect two main types:

System Metrics – For example: CPU usage, memory consumption, disk I/O, and network throughput.
Application-Specific Metrics – For example: Monitored process counts, bytes sent/received.

We rely on Prometheus for metric collection and storage.
We use Grafana for visualization. We built sophisticated dashboards to speed up investigations and gain true value from existing metrics.
Here’s an example of sensor’s memory and cpu consumption in one of our customer’s environment:

Pro Tip: Before adding metrics, list the questions you want to answer. Then design a dashboard with panels. Only after that, add the actual metrics in code. This is the same structured approach used in web development—requirements → UI → backend.
Yet, even with metrics, some issues remained hidden. We needed deeper insights. That’s why we added profiling.

CPU & Memory Profiling

CPU and memory are among the most important considerations when building a runtime sensor, and at Oligo, we’ve prioritized their optimization to the extent that we’ve elevated our sensor to an enterprise-grade standard.
Metrics show trends, but they don’t explain why CPU or memory usage is high. That’s where profiling comes in.
We use continuous profiling that captures trends over time instead of one-off snapshots, which is a game changer!

Results are displayed as Flamegraphs, clearly highlighting which functions consume the most CPU and memory.
With continuous profiling, we made huge performance gains, and we monitor it closely to avoid performance degradation.
But something was still missing - historical analysis and deeper data context. That’s why we added event warehousing.

Event Warehousing

Observability isn’t just about real-time monitoring—historical analysis is just as important. We aggregate sensor-generated data into an event warehouse for OLAP (Online Analytical Processing).
With event warehousing, we significantly expanded our ability to analyze and aggregate vast amounts of data, enabling us to run complex queries that were impractical with logs alone.
Key benefits include:

Data deduplication – Identify redundant events to optimize storage.
Data quality insights – Ensure we collect accurate, meaningful telemetry.
Anomaly detection – Spot unexpected behavior over time.

Event warehousing serves as a crucial complement to logs and metrics, helping us detect issues that only become apparent
when examining data at scale. By uncovering patterns and anomalies in massive datasets, we can proactively pinpoint inefficiencies, accelerate problem resolution, and apply precise optimizations.
This approach dramatically improves both data throughput and quality, enhancing our observability pipeline’s efficiency.
For instance, we can easily know distribution of data production by image:

That’s something that can’t be easily checked with logs or metrics. In many cases, it's not even clear what questions to ask. But once you have the data, you can query almost anything, uncovering insights like this one.

Side Note: Why We Didn’t Need Observability Traces

Unlike complex microservices, our sensor runs in a limited number of services with minimal inter-service communication. This means observability traces aren’t necessary for our observability stack. Instead, logs, metrics, profiling, and event warehousing provide all the visibility we need.

Summary

Observability transformed how we develop and improve our runtime sensor at Oligo. Through multiple observability revolutions, we:

Refined our logging strategy with structured logs, rate-limiting, and eBPF integration.
Built advanced metrics dashboards using Prometheus and Grafana to speed up troubleshooting.
Optimized sensor performance with continuous CPU and memory profiling.
Enhanced data-driven decision-making through event warehousing and large-scale analytics.

That's how we developed the most efficient and lightweight runtime sensor in the industry.
These techniques aren’t just for customers—they power our own Oligo Cloud as well. The result? Faster insights, better performance, and the ability to fix issues before they become problems.Observability isn’t just monitoring—it’s the key to building, improving, and maintaining world-class software.

Upcoming Next

We're not stopping here! The next phases of our observability journey are already in motion, and we're tackling two major areas that will take our insights to the next level:

Data Completeness Monitoring – We’re building mechanisms to track every step in our observability pipeline, ensuring no data goes missing. This will allow us to detect and pinpoint where data drops occur, helping us refine our pipeline and improve reliability.
eBPF Observability – We're building observability capabilities for our eBPF components, some of them are: monitoring program latency, BPF map utilization, and custom metrics. This will provide us with unprecedented visibility into performance bottlenecks and fine-tune our sensor like never before.

These enhancements are set to push the boundaries of what’s possible with observability, making our runtime sensor even more powerful. Stay tuned - exciting things are coming!
‍

At Oligo, we're always pushing observability forward. What challenges have you faced in production systems? Let’s discuss!

Observability Revolutions in Oligo's Runtime Sensor

Overview

Introduction

Why Observability Matters—Especially for Runtime Sensors