Architecture#
Guiding principles:
Grafana Cloud is used to reduce the amount of infrastructure to maintain. There is nothing that prevents using on-premise Grafana stack (although this wasn’t tested)
You are encouraged to tweak the default dashboards and panels, add new Vector transformations, metrics and alerts
The components are optional and are considered to be building blocks that you can mix and match. You do not have to use PostgreSQL job cache if you can live without the TreeView panel and do not need to ask complex questions that require ad hoc SQL queries. If you do not need to visualize orchestrations as traces, you can drop the Tempo Relay service and Grafana Agent
Some questions are easier to answer with SQL queries, some with LogQL
Once configured, Salt Master settings don’t need to be changed to add a new metric, transformation or log destination. All transforms are done outside of Salt.
Salt orchestrations are the primary units of work. For now this means that execution module calls and
state.apply/highstate
aren’t supported by event transformations and default Grafana dashboards shipped with the projectFuture iterations might add support for module calls and highstates. We have other interesting ideas once #62683 is implemented.
It might be possible to replace the Tempo Relay service and Grafana Agent with just Vector, once it gets better support of traces (see #1444, #12029 and #14747)
PostgreSQL is used because Loki can’t store large blobs of data (the limit is 64 kb). Salt job returns (esp. for orchestrations) can easily exceed that
Although it wasn’t tested, the system should support multiple Salt masters (with the exception of a singular Job Cache DB that should be dedicated for each master unless it is a shared cache for HA setup)
How the full system is wired together:

Salt Master controls one or more minions and runs various orchestration jobs
JSONB Job Cache (master returner) sends job returns to a PostgreSQL database. The DB schema has two custom views to make ad hoc queries

Vector Engine sends Salt events to Vector (as JSON over TCP)
- Vector transforms incoming events
Injects a random traceID field, adds some tags
Forwards events to the Tempo Relay service via HTTP
Sends summarized job returns (without the nested
data
field) to LokiGenerates two Prometheus metrics (job duration and job status)

Tempo Relay service transforms orchestration job returns into traces (using opentelemetry-sdk) and submits them to Grafana Agent
Grafana Agent sends traces to Grafana Tempo
Grafana dashboards use all four data sources (Loki, Prometheus, Tempo, PostgreSQL) to display various dashboards and navigate between them