Prometheus is an open-source monitoring system that collects and stores metrics as time series data. Its stable API is compatible with a range of systems and tools like Grafana.
This feature is only available to our Enterprise tier customers. To get started, please reach out to our Enterprise team.
Groq exposes Prometheus metrics about your organization's usage through VictoriaMetrics. It supports most Prometheus querying API paths:
/api/v1/query/api/v1/query_range/api/v1/series/api/v1/labels/api/v1/label/<label_name>/values/api/v1/status/tsdbPrometheus queries against Groq endpoints use MetricsQL, a query language that extends Prometheus's native PromQL query language.
Queries can be sent to the following endpoint:
https://api.groq.com/v1/metrics/prometheus
To Authenticate, you will need to provide your Groq API key as a header in the Authorization: Bearer <your-api-key> format.
If you run Grafana, you can add Groq metrics as a Prometheus datasource:
https://api.groq.com/v1/metrics/prometheusAuthorization header to your Groq API key:AuthorizationBearer <your-api-key>All metrics are broken out by model and project id. Some metrics are broken out by status code and le (for use with histogram_quantile). Metric names are prefixed with their labels and provided as rate5m (rate over a 5 minute window).
In addition to using the APIs directly, you can see a handful of curated charts directly in our console at Metrics
Groq provides the following metrics:
model_project_id_status_code:requests:rate5mle_model_project_id:tokens_in_bucket:rate5mle_model_project_id:tokens_out_bucket:rate5mmodel_project_id:tokens_in:rate5mmodel_project_id:tokens_out:rate5m le_model_project_id:queue_latency_seconds_bucket:rate5m
le_model_project_id:ttft_latency_seconds_bucket:rate5m
le_model_project_id:e2e_latency_seconds_bucket:rate5m
le_model_project_id:prompt_cache_hits_bucket:rate5mmodel_project_id:prompt_cache_hits:rate5mmodel_project_id:prompt_cache_misses:rate5mTotal requests across all models and projects:
sum(model_project_id_status_code:requests:rate5m)
P99 E2E latency across all projects, for a specific model:
histogram_quantile(0.99, sum by(le) (model_project_id:e2e_latency_seconds_bucket:rate5m{model="llama-3.1-8b-instant"}))