Agent health states
The safety monitor tracks the agent’s health in real time and assigns it one of three states:GREEN
Normal operation. Instrumentation is active and all metrics are within configured thresholds.
YELLOW
Moderate overhead detected. Instrumentation continues, but the agent has logged a warning. You should review your probe configuration.
RED
Safety threshold exceeded. All instrumentation is immediately suspended for the duration of
cooldownSec, then automatically resumes.Guardrail options
The following options control when the safety monitor triggers. All options can be set in code or via environment variables, with environment variables taking priority.- Code
- Environment variables
Rate limit (hitsPerSec)
Default: 10 hits/second
The maximum number of probe capture events the agent processes per second across all active probes. If this limit is reached, additional hits are dropped until the next second. This prevents a high-frequency probe from overwhelming the telemetry pipeline.
Reduce this value on services where even a small amount of extra processing is sensitive. Increase it when you are actively debugging and need more capture throughput.
Bandwidth cap (bandwidthKbPerSec)
Default: 200 KB/second
The maximum amount of telemetry data the agent sends to the broker per second. If the agent is capturing large objects or has many active probes, it will stop sending data until the bandwidth budget resets each second.
This limit protects both your network and the broker from being flooded by snapshot payloads. If you consistently hit the bandwidth cap, reduce the data size limits (see Data size limits) rather than raising the bandwidth cap.
Event-loop lag threshold (maxLagMs)
Default: 50 ms
The maximum acceptable event-loop lag before the agent enters the RED state. The safety monitor measures lag continuously. If the measured lag exceeds maxLagMs, instrumentation is suspended immediately.
Lower this value on latency-sensitive services where even 50 ms of additional event-loop pressure is unacceptable. Raise it on batch processing services where some lag is expected.
Pause budget (pauseBudgetMs)
Default: 20 ms/second
The total amount of time per second the agent is allowed to spend on instrumentation-related pauses. The agent tracks cumulative pause duration within each one-second window. If it exceeds pauseBudgetMs, it enters the RED state.
This is the most direct measure of the overhead the agent introduces. If you are on a latency-sensitive application, set this to a value well below your P99 latency budget for probe operations.
Cooldown duration (cooldownSec)
Default: 10 seconds
How long the agent stays suspended after entering the RED state. After the cooldown, the agent resumes instrumentation automatically.
Increase this value if you want the agent to stay out of the way for longer after a safety trigger, giving your application more time to stabilize before probes resume. Decrease it for development environments where you want faster recovery.
Data size limits
In addition to the rate-based guardrails, HyperProbe limits how much data is captured in each snapshot. These limits prevent large objects from causing memory pressure or OOM conditions in the host application.| Option | Env variable | Default | What it controls |
|---|---|---|---|
maxObjectDepth | HYPERPROBE_MAX_OBJECT_DEPTH | 3 | How many levels deep the agent serializes nested objects |
maxArrayLength | HYPERPROBE_MAX_ARRAY_LENGTH | 3 | Maximum number of array elements captured per array |
maxObjectProperties | HYPERPROBE_MAX_OBJECT_PROPERTIES | 50 | Maximum number of properties captured per object |
maxStringLength | HYPERPROBE_MAX_STRING_LENGTH | 1024 | Maximum character length for captured string values |
stackFrameDepth | HYPERPROBE_STACK_FRAME_DEPTH | 3 | Number of call stack frames captured per snapshot |
These limits apply per captured variable, not per snapshot. A snapshot with 10 variables will serialize each one up to 3 levels deep independently. The overall snapshot size is capped at 2 MB by the agent before transmission.
