Skip to main content

Prometheus Metrics

The transport ships built-in Prometheus metrics covering throughput, handler latency, consumer lag, publish errors, dead letters, and connection health. Metrics are written to a prom-client registry — the de-facto standard in the NestJS ecosystem — so any /metrics endpoint exporter picks them up without additional wiring.

Why this exists

Traces tell you what happened to one message; metrics tell you what's happening to the system as a whole. NATS JetStream is a queue, a stream store, and an RPC bus rolled into one, and operators need to know things like "is my consumer falling behind?" or "what's the p99 handler latency for orders.created over the last hour?" Those are aggregate questions — Prometheus territory, not APM territory.

The library exposes everything an operator typically alerts on:

  • Throughput: messages received, processed, published, and dead-lettered per second.
  • Latency: handler duration, publish duration, and RPC round-trip duration as histograms.
  • Lag: consumer num_pending, num_ack_pending, and stream size as gauges.
  • Health: connection state, RPC timeouts, transport errors classified by context.

Default labels stay bounded (declared patterns, enum-typed statuses, named kinds), so dashboards stay performant as your handler count grows.

Setup

Install the optional peer dependency once:

pnpm add prom-client

The transport declares prom-client as an optional peer. If you do not enable metrics, the package is never imported. Tested against prom-client@^15.

Enable metrics in forRoot:

import { JetstreamModule } from '@horizon-republic/nestjs-jetstream';

@Module({
imports: [
JetstreamModule.forRoot({
name: 'orders',
servers: ['nats://localhost:4222'],
metrics: true,
}),
],
})
export class AppModule {}

…or with forRootAsync:

JetstreamModule.forRootAsync({
name: 'orders',
imports: [ConfigModule],
inject: [ConfigService],
useFactory: (config: ConfigService) => ({
servers: config.get<string[]>('NATS_SERVERS')!,
metrics: true,
}),
})

That's the whole integration. Metrics write to prom-client's global register. To expose them at /metrics, pair the transport with @willsoto/nestjs-prometheus:

import { PrometheusModule } from '@willsoto/nestjs-prometheus';

@Module({
imports: [
PrometheusModule.register(), // exposes /metrics
JetstreamModule.forRoot({
name: 'orders',
servers: ['nats://localhost:4222'],
metrics: true, // writes to the same global register
}),
],
})
export class AppModule {}

curl localhost:3000/metrics returns your application's metrics and the transport's metrics in a single Prometheus exposition.

Full configuration

import { Registry } from 'prom-client';

JetstreamModule.forRoot({
// ...
metrics: {
register: customRegister, // default: prom-client's global register
prefix: 'jetstream_', // default: 'jetstream_'
defaultLabels: { service: 'orders', env: 'prod' },
pollInterval: 15_000, // default: 15s; set 0 to disable gauge polling
buckets: {
handlerDuration: [0.001, 0.01, 0.1, 1, 10],
publishDuration: [0.001, 0.01, 0.1, 1, 10],
rpcDuration: [0.001, 0.01, 0.1, 1, 10],
},
},
})

Disabled by default — zero overhead

When the metrics option is omitted or set to false:

  • prom-client is never imported (the dynamic import() only runs when metrics is truthy).
  • The transport's hot paths add ~30 nanoseconds per message (a single Map.get to check if a listener exists) — effectively free.

Production deployments that don't need metrics pay nothing for the feature.

Metric catalog

All metric names are prefixed with jetstream_ (configurable). defaultLabels from the config are merged into every metric.

Counters

NameLabelsDescription
jetstream_messages_received_totalstream, subject, kindMessages routed to a handler.
jetstream_messages_processed_totalstream, subject, kind, statusHandler invocations that completed. statussuccess, error, retried, terminated.
jetstream_messages_unhandled_totalsubject (literal <unmatched>)Messages with no matching handler.
jetstream_messages_dead_letter_totalstream, subjectMessages that exhausted all delivery attempts.
jetstream_publish_totalsubject, kind, statusClient-side publish operations. statussuccess, error.
jetstream_rpc_timeout_totalsubjectRPC calls that exceeded the timeout deadline.
jetstream_consumer_recovered_totalkindSelf-healing recoveries after consume-loop failures.
jetstream_errors_totalcontextTransport-level errors. contextconnection, codec, publish, consume, handler, shutdown, other.
jetstream_metrics_poll_errors_totaltargetErrors hit while polling for gauge data. targetconsumer.info, stream.info, jsm.connect.

Histograms

NameLabelsSource
jetstream_handler_duration_secondsstream, subject, kind, statusWall-clock duration from handler entry to settlement.
jetstream_publish_duration_secondssubject, kind, statusWall-clock duration of client publish operations.
jetstream_rpc_duration_secondssubject, statusFull RPC round-trip from caller's perspective. status includes timeout.

Default buckets (in seconds): [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]. Cover sub-millisecond RPC up to ten-second batch handlers. Override via metrics.buckets.

Gauges (polled)

Refreshed every pollInterval ms by querying JetStreamManager.consumers.info() and JetStreamManager.streams.info() for every consumer this service owns.

NameLabelsSource field
jetstream_consumer_num_pendingstream, consumer, kindConsumerInfo.num_pending
jetstream_consumer_num_ack_pendingstream, consumer, kindConsumerInfo.num_ack_pending
jetstream_consumer_num_redeliveredstream, consumer, kindConsumerInfo.num_redelivered
jetstream_consumer_num_waitingstream, consumer, kindConsumerInfo.num_waiting
jetstream_stream_messagesstreamStreamInfo.state.messages
jetstream_stream_bytesstreamStreamInfo.state.bytes
jetstream_connection_upserver1 while connected, 0 after disconnect.

Setting pollInterval: 0 (or false) disables the polling loop entirely. Counter and histogram metrics continue to update from event hooks.

Label values

Every label value is bounded — no free-form data ever reaches a label. The subject label is sourced from the declared pattern in @EventPattern / @MessagePattern, not from the wire NATS subject, so cardinality is bounded by your handler count.

  • kindevent, command, broadcast, ordered.
  • status on handler metrics — success, error, retried, terminated.
  • status on publish metrics — success, error.
  • status on RPC round-trip metrics — success, error, timeout.
  • context on jetstream_errors_totalconnection, codec, publish, consume, handler, shutdown, other.

PromQL examples

Throughput — messages received per second by stream kind:

sum by (kind) (rate(jetstream_messages_received_total[1m]))

p99 handler latency for a specific event:

histogram_quantile(
0.99,
sum by (le) (
rate(jetstream_handler_duration_seconds_bucket{subject="orders.created"}[5m])
)
)

Error rate per handler:

sum by (subject) (rate(jetstream_messages_processed_total{status="error"}[5m]))
/ sum by (subject) (rate(jetstream_messages_processed_total[5m]))

Alert if consumer lag exceeds 10k pending messages:

jetstream_consumer_num_pending > 10000

Alert if NATS connection has been down for 5 minutes:

jetstream_connection_up == 0

p95 RPC round-trip latency excluding timeouts:

histogram_quantile(
0.95,
sum by (subject, le) (
rate(jetstream_rpc_duration_seconds_bucket{status!="timeout"}[5m])
)
)

Polling behavior

The polling loop pulls consumer.info() and streams.info() from the NATS server at the configured pollInterval. The loop is deliberately conservative:

  • Backpressure: if a previous tick has not completed by the time the next interval fires, the new tick is skipped (no queueing). A warn log is emitted so operators can tell when the configured interval is too aggressive for the load.
  • Per-target error isolation: a failing consumer.info() call increments jetstream_metrics_poll_errors_total{target="consumer.info"} but does not abort the remainder of the cycle. Streams and other consumers in the same tick still update.
  • Graceful shutdown: OnModuleDestroy cancels the timer and awaits the in-flight tick before resolving, so the process exits cleanly.
  • Connection-loss tolerance: while NATS is disconnected, polling fails fast and increments the poll-error counter. Gauges become stale (not zero) — which is the correct semantic: we do not know the values, so we do not lie about them.

The Command (RPC) consumer is only polled in JetStream RPC mode. Core RPC mode does not create a JetStream stream for commands, so there is nothing to poll. Ordered consumers are ephemeral and do not have a stable durable name, so they are excluded from polling — use jetstream_messages_processed_total{kind="ordered"} to monitor ordered throughput instead.

Cardinality safety

Prometheus performance degrades sharply with high cardinality, so the design avoids unbounded labels:

  • subject uses the declared pattern (e.g. orders.created) from @EventPattern, never the wire NATS subject. Subject wildcards are not supported in handler patterns — the router uses exact-match lookup — so declared and wire subjects coincide today. Pinning to the declared form future-proofs the metric.
  • kind, status, and context are all enum-typed with a small bounded set of values.
  • stream and consumer are deterministic functions of serviceName and StreamKind.
  • server is bounded by the NATS cluster size.
  • For unmatched messages, subject="<unmatched>" is used as a single sentinel rather than the actual subject — preventing an attacker from blowing up cardinality by publishing to random subjects.

If you set defaultLabels with high-cardinality values (e.g. per-request IDs), Prometheus performance is on you — the transport never injects unbounded labels itself.

Performance characteristics

Per-message overhead with metrics enabled:

  • performance.now() at handler entry.
  • EventBus.emit(HandlerCompleted, ...) after settlement — Map.get + callback invocation.
  • PatternRegistry.resolveDeclared() (a Map.get) inside the metrics service.
  • Counter.inc() + 1× Histogram.observe() in prom-client.

Aggregate cost: ~5–10 microseconds per message on modern hardware. For a service handling 10,000 messages per second, that is ~50ms of CPU time per second of wall clock — well under 5% overhead and dominated by the NATS round-trip itself.

Polling cost: 1 consumer.info + 1 streams.info per consumer kind every pollInterval ms. For a service with both event and RPC handlers in JetStream mode, that's roughly 0.27 NATS requests per second at the default 15s interval — negligible.

Memory: each metric family allocates roughly 200 bytes per label combination. With ~100 declared subjects × 4 statuses × 4 kinds, all 19 metric families together stay under ~10 MB of heap.

Custom hooks alongside metrics

Adding metrics: true does not interfere with user-provided hooks. The transport delivers events to both your hooks and the metrics service:

JetstreamModule.forRoot({
name: 'orders',
servers: ['nats://localhost:4222'],
metrics: true,
hooks: {
[TransportEvent.DeadLetter]: async (info) => {
await sentry.captureMessage(`Dead letter: ${info.subject}`, { extra: info });
},
},
});

The HandlerCompleted, Published, and RpcCompleted events used internally by the metrics service are also part of the public TransportHooks surface — register your own listeners if you want to add custom alerting on top of the built-in counters.

Multi-registry deployments

If your application exposes multiple /metrics endpoints (e.g. one per tenant), pass a dedicated Registry instance per metrics endpoint. The transport writes to whatever registry you supply:

import { Registry } from 'prom-client';

const tenantRegister = new Registry();

JetstreamModule.forRoot({
// ...
metrics: { register: tenantRegister },
});

Multiple JetstreamModule instances are not currently supported (the library expects a single connection per service), so multi-registry use cases are limited to running the transport once with metrics scoped to a non-default registry.