Configuring HyperProbe safety guardrails for production

HyperProbe is designed to be safe to run in production without human supervision. The agent continuously monitors its own overhead and automatically suspends all instrumentation if it detects that probes are affecting your application’s performance. You can tune the thresholds that trigger this protection, the limits that cap how much data the agent collects, and the duration of the automatic recovery period — all without changing your application code.

Agent health states

The safety monitor tracks the agent’s health in real time and assigns it one of three states:

GREEN

Normal operation. Instrumentation is active and all metrics are within configured thresholds.

YELLOW

Moderate overhead detected. Instrumentation continues, but the agent has logged a warning. You should review your probe configuration.

RED

Safety threshold exceeded. All instrumentation is immediately suspended for the duration of cooldownSec, then automatically resumes.

When the agent enters the RED state, it logs the reason and suspends all active probes. After the cooldown period, it resumes and re-applies any probes that have not yet hit their limit. Your application is never paused — only the instrumentation is suspended.

Guardrail options

The following options control when the safety monitor triggers. All options can be set in code or via environment variables, with environment variables taking priority.

Code
Environment variables

import { HyperProbe } from '@hyperprobe/node-sdk';

HyperProbe.start({
  serviceId: 'my-service',
  environment: 'production',
  brokerUrl: 'grpc://broker.example.com:50051',
  commitSha: process.env.GIT_COMMIT,

  hitsPerSec: 10,         // default: 10
  bandwidthKbPerSec: 200, // default: 200
  maxLagMs: 50,           // default: 50
  pauseBudgetMs: 20,      // default: 20
  cooldownSec: 10,        // default: 10
});

HYPERPROBE_HITS_SEC=10
HYPERPROBE_BANDWIDTH_KB_SEC=200
HYPERPROBE_MAX_LAG_MS=50
HYPERPROBE_PAUSE_BUDGET_MS=20
HYPERPROBE_COOLDOWN_SEC=10

Rate limit (`hitsPerSec`)

Default: 10 hits/second The maximum number of probe capture events the agent processes per second across all active probes. If this limit is reached, additional hits are dropped until the next second. This prevents a high-frequency probe from overwhelming the telemetry pipeline. Reduce this value on services where even a small amount of extra processing is sensitive. Increase it when you are actively debugging and need more capture throughput.

Bandwidth cap (`bandwidthKbPerSec`)

Default: 200 KB/second The maximum amount of telemetry data the agent sends to the broker per second. If the agent is capturing large objects or has many active probes, it will stop sending data until the bandwidth budget resets each second. This limit protects both your network and the broker from being flooded by snapshot payloads. If you consistently hit the bandwidth cap, reduce the data size limits (see Data size limits) rather than raising the bandwidth cap.

Event-loop lag threshold (`maxLagMs`)

Default: 50 ms The maximum acceptable event-loop lag before the agent enters the RED state. The safety monitor measures lag continuously. If the measured lag exceeds maxLagMs, instrumentation is suspended immediately. Lower this value on latency-sensitive services where even 50 ms of additional event-loop pressure is unacceptable. Raise it on batch processing services where some lag is expected.

Pause budget (`pauseBudgetMs`)

Default: 20 ms/second The total amount of time per second the agent is allowed to spend on instrumentation-related pauses. The agent tracks cumulative pause duration within each one-second window. If it exceeds pauseBudgetMs, it enters the RED state. This is the most direct measure of the overhead the agent introduces. If you are on a latency-sensitive application, set this to a value well below your P99 latency budget for probe operations.

Cooldown duration (`cooldownSec`)

Default: 10 seconds How long the agent stays suspended after entering the RED state. After the cooldown, the agent resumes instrumentation automatically. Increase this value if you want the agent to stay out of the way for longer after a safety trigger, giving your application more time to stabilize before probes resume. Decrease it for development environments where you want faster recovery.

Data size limits

In addition to the rate-based guardrails, HyperProbe limits how much data is captured in each snapshot. These limits prevent large objects from causing memory pressure or OOM conditions in the host application.

Option	Env variable	Default	What it controls
`maxObjectDepth`	`HYPERPROBE_MAX_OBJECT_DEPTH`	`3`	How many levels deep the agent serializes nested objects
`maxArrayLength`	`HYPERPROBE_MAX_ARRAY_LENGTH`	`3`	Maximum number of array elements captured per array
`maxObjectProperties`	`HYPERPROBE_MAX_OBJECT_PROPERTIES`	`50`	Maximum number of properties captured per object
`maxStringLength`	`HYPERPROBE_MAX_STRING_LENGTH`	`1024`	Maximum character length for captured string values
`stackFrameDepth`	`HYPERPROBE_STACK_FRAME_DEPTH`	`3`	Number of call stack frames captured per snapshot

These limits apply per captured variable, not per snapshot. A snapshot with 10 variables will serialize each one up to 3 levels deep independently. The overall snapshot size is capped at 2 MB by the agent before transmission.

Tuning for your environment

High-traffic production service

On a service handling thousands of requests per second, tighten the rate and lag limits to minimize any performance impact:

HyperProbe.start({
  serviceId: 'high-traffic-api',
  environment: 'production',
  brokerUrl: 'grpc://broker.example.com:50051',
  commitSha: process.env.GIT_COMMIT,

  hitsPerSec: 5,          // fewer captures per second
  bandwidthKbPerSec: 100, // tighter bandwidth cap
  maxLagMs: 30,           // more sensitive to event-loop lag
  pauseBudgetMs: 10,      // tighter pause budget
  cooldownSec: 30,        // longer recovery period after RED
});

Pair this configuration with precise probe conditions to ensure each hit is intentional rather than relying on the rate limiter to shed load.

Active debugging session

When you are actively investigating a bug and want richer, faster captures, you can relax the defaults:

HyperProbe.start({
  serviceId: 'staging-api',
  environment: 'staging',
  brokerUrl: 'grpc://broker.example.com:50051',
  commitSha: process.env.GIT_COMMIT,

  hitsPerSec: 50,
  bandwidthKbPerSec: 500,
  maxLagMs: 200,
  pauseBudgetMs: 50,
  cooldownSec: 5,

  // Capture more data per snapshot
  maxObjectDepth: 5,
  maxArrayLength: 10,
  stackFrameDepth: 10,
  maxStringLength: 4096,
});

Do not use relaxed safety settings in production. Higher limits mean the agent can introduce more overhead before the safety monitor intervenes. Reserve these settings for staging or local environments.

Get Started

VS Code Extension

SDK Integration

Probe Types

Guides

Troubleshooting

Configuring HyperProbe safety guardrails for production

Agent health states

GREEN

YELLOW

RED

Guardrail options

Rate limit (`hitsPerSec`)

Bandwidth cap (`bandwidthKbPerSec`)

Event-loop lag threshold (`maxLagMs`)

Pause budget (`pauseBudgetMs`)

Cooldown duration (`cooldownSec`)

Data size limits

Tuning for your environment

High-traffic production service

Active debugging session

Get Started

VS Code Extension

SDK Integration

Probe Types

Guides

Troubleshooting

​Agent health states

GREEN

YELLOW

RED

​Guardrail options

​Rate limit (hitsPerSec)

​Bandwidth cap (bandwidthKbPerSec)

​Event-loop lag threshold (maxLagMs)

​Pause budget (pauseBudgetMs)

​Cooldown duration (cooldownSec)

​Data size limits

​Tuning for your environment

​High-traffic production service

​Active debugging session

Agent health states

Guardrail options

Rate limit (`hitsPerSec`)

Bandwidth cap (`bandwidthKbPerSec`)

Event-loop lag threshold (`maxLagMs`)

Pause budget (`pauseBudgetMs`)

Cooldown duration (`cooldownSec`)

Data size limits

Tuning for your environment

High-traffic production service

Active debugging session