Groq

Prometheus Metrics

Prometheus is an open-source monitoring system that collects and stores metrics as time series data. Its stable API is compatible with a range of systems and tools like Grafana.

Enterprise Feature

This feature is only available to our Enterprise tier customers. To get started, please reach out to our Enterprise team.

APIs

Groq exposes Prometheus metrics about your organization's usage through VictoriaMetrics. It supports most Prometheus querying API paths:

  • /api/v1/query
  • /api/v1/query_range
  • /api/v1/series
  • /api/v1/labels
  • /api/v1/label/<label_name>/values
  • /api/v1/status/tsdb

MetricsQL

Prometheus queries against Groq endpoints use MetricsQL, a query language that extends Prometheus's native PromQL query language.

Querying

Queries can be sent to the following endpoint:

https://api.groq.com/v1/metrics/prometheus

To Authenticate, you will need to provide your Groq API key as a header in the Authorization: Bearer <your-api-key> format.

Grafana

If you run Grafana, you can add Groq metrics as a Prometheus datasource:

  1. Add a new Prometheus datasource in Grafana by navigating to Settings -> Data Sources -> Add data source -> Prometheus.
  2. Enter the following URL under HTTP -> URL: https://api.groq.com/v1/metrics/prometheus
  3. Set the Authorization header to your Groq API key:
  • Go to Custom HTTP Headers -> Add Header
    • Header: Authorization
    • Value: Bearer <your-api-key>
  1. Save & Test.

Available Metrics

Groq provides the following metrics:

Request Metrics

  • requests:increase1m
    • The number of requests made within a minute
  • requests:rate1m
    • The average number of requests per second over a given minute

Broken out by model and status_code

Latency Metrics

  • e2e_latency_seconds:{percentile}:rate5m
    • Percentile end-to-end latency average over a 5 minute window for P99, P95, and P50
  • ttft_latency_seconds:{percentile}:rate5m
    • Percentile time to first token latency average over a 5 minute window for P99, P95, and P50
  • queue_latency_seconds:{percentile}:rate5m
    • Percentile queue latency (time request spends in queue before being processed) average over a 5 minute window for P99, P95, and P50

Broken out by model.

Token Metrics

  • tokens_in:{percentile}:rate5m
    • Percentile number of input tokens average over a 5 minute window for P99, P95, and P50
  • tokens_out:{percentile}:rate5m
    • Percentile number of output tokens average over a 5 minute window for P99, P95, and P50

Broken out by model.

In addition to using the APIs directly, you can see a handful of curated charts directly in our console at Metrics

Was this page helpful?