External Agents¶

Overview¶

External agents are long-running processes that connect to JustIAM via a persistent gRPC stream and execute scheduled tasks on behalf of a tenant. Unlike the built-in inline worker (which runs scripts inside the backend process), agents:

Run outside the JustIAM cluster — on-premises, in a private cloud, or in any Kubernetes namespace you control.
Require no inbound connectivity — the agent dials out to the backend; no port forwarding or firewall rules are needed.
Hold no credentials — task secrets and config are pre-resolved by the backend and injected into each TaskAssignment message. The agent never touches your database.

This makes agents ideal for tasks that need access to internal infrastructure (private APIs, on-premises directories, VPN-only services) without exposing those resources to the JustIAM cluster.

Architecture¶

  ┌─────────────────────────────────────────────────┐
  │  JustIAM cluster                                │
  │                                                 │
  │  ┌─────────┐  gRPC stream   ┌──────────────┐   │
  │  │ Backend │◄───────────────┤ agent binary │   │
  │  │ :9090   │  TaskAssignment│              │   │
  │  │         │───────────────►│  (anywhere)  │   │
  │  └─────────┘  TaskResult    └──────────────┘   │
  └─────────────────────────────────────────────────┘

The agent connects to BACKEND_ADDR (gRPC, TLS or plaintext).
On connect it sends a RegisterAgent message containing its slug, worker type, and max concurrency.
The backend dispatches TaskAssignment messages to the agent. The assignment includes the compiled script, injected secrets and config — everything needed to run without database access.
The agent executes the script and sends back a TaskResult (output + status).
A heartbeat ping is exchanged every 20 s to keep the stream alive through load balancers and firewalls.

Agent tokens¶

Each agent authenticates with a per-tenant agent token. Tokens are managed from Infrastructure → Agent Tokens in the admin UI, or via the API / Terraform.

Create via admin UI:

Go to Infrastructure → Agent Tokens.
Click New Token, enter a descriptive name (e.g. on-prem-agent).
Copy the token value — it is shown once and not stored in plain text.

List via API:

GET /api/v1/agent-tokens?page=1&limit=25&search=on-prem
Authorization: Bearer <admin-token>

Returns a paginated list of active tokens. Response shape:

{
  "data": [ /* AgentTokenRecord objects */ ],
  "total": 3,
  "page": 1,
  "limit": 25
}

Create via API:

POST /api/v1/agent-tokens
Authorization: Bearer <admin-token>
Content-Type: application/json

{"name": "on-prem-agent"}

Response:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "on-prem-agent",
  "token": "agent_fe04c34fe8070a1d8face1d4787197d03b03cfee25ed9be4dbe9509d50f91024",
  "created_at": "2026-04-21T00:00:00Z"
}

Create via Terraform:

resource "justiam_agent_token" "on_prem" {
  name = "on-prem-agent"
}

# Store the raw token in a secret manager or output it once:
output "agent_token" {
  value     = justiam_agent_token.on_prem.token
  sensitive = true
}

Enabling agent execution for a tenant¶

Set worker_mode = agent in the tenant's configuration:

Kubernetes Secret (justiam-tenant-<slug>):

apiVersion: v1
kind: Secret
metadata:
  name: justiam-tenant-acme
  namespace: justiam-mt
type: Opaque
stringData:
  db_url: "postgres://acme:secret@postgres:5432/acme?sslmode=require"
  jwt_secret: "long-random-string"
  worker_mode: "agent"
  worker_agent_pool: "dedicated"   # "shared" or "dedicated"

Agent pool types¶

Pool	Description
`shared`	Tasks are dispatched to the platform-wide shared agent pool. Individual agent instances are not exposed per-tenant.
`dedicated`	Tasks are dispatched only to agents whose `TENANT_SLUG` matches this tenant's slug.

When worker_mode = agent:

Scheduled tasks are dispatched according to worker_agent_pool (see table above).
If no agent with capacity is available the task run is marked error immediately (no silent fallback to inline execution).

Running an agent¶

Binary environment variables¶

Variable	Default	Description
`AGENT_TOKEN`	(required for dedicated agents)	Token created in Infrastructure → Agent Tokens. Not needed when `SHARED_AGENT_PLATFORM_KEY` is set.
`SHARED_AGENT_PLATFORM_KEY`	(unset)	Hex-encoded 32-byte HMAC key for shared agents. When set, the agent self-registers and `AGENT_TOKEN` is not required.
`AGENT_NAME`	`POD_NAME` or hostname	Stable identity for the agent in the platform DB. Set to `metadata.name` via fieldRef in k8s.
`BACKEND_GRPC_ADDR`	`localhost:9090`	`host:port` of the backend gRPC server
`BACKEND_URL`	`http://localhost:8080`	HTTP base URL for IDP API calls made by task scripts (`idp.*` helpers)
`GRPC_TLS`	`auto`	`true` / `false` / `auto` — `auto` enables TLS when the address ends in `:443`
`TENANT_SLUG`	`*`	Tenant this agent serves. `*` means cross-tenant (shared).
`TENANT_HOST_SUFFIX`	`justiam.com`	Domain suffix used to build the HTTP `Host` header, e.g. `acme.justiam.com`
`WORKER_TYPE`	`task`	Worker category — must match the tenant's `worker_type` (currently `task`)
`MAX_CONCURRENT`	`4`	Maximum number of tasks running simultaneously on this agent
`AGENT_ID_FILE`	`/data/agent-id`	Path to persist the stable agent UUID across restarts
`AGENT_NAME`	`POD_NAME` or hostname	Human-readable name shown in the Workers UI and stored in `agent_connections`. Set to the node name (e.g. `default`) for dedicated agents provisioned by the controlplane.
`VERSION`	injected by ldflags	Version string reported to the backend and shown in the Workers UI. Automatically set when building with `make build-agent VERSION=x.y.z`.
`OTEL_ENABLED`	`false`	Set to `true` to export execution spans to Tempo via the backend relay
`OTEL_SERVICE_NAME`	`justiam-agent`	Service name shown in Grafana traces (e.g. `agent-acme-default`)
`OTEL_RESOURCE_ATTRIBUTES`	(unset)	Extra resource attributes, e.g. `k8s.namespace.name=justiam-mt` — required for Loki related-logs links in Tempo
`OTEL_SAMPLE_RATIO`	`1.0`	Fraction of traces to sample (0.0–1.0)

Kubernetes deployment — tenant-specific agent¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-acme
  namespace: acme-infra   # your own namespace, outside the JustIAM cluster
spec:
  replicas: 1
  selector:
    matchLabels:
      app: justiam-agent
  template:
    metadata:
      labels:
        app: justiam-agent
    spec:
      volumes:
        - name: agent-id
          emptyDir: {}
      containers:
        - name: agent
          image: marcportabellaclotet/justiam-mt-agent:latest
          imagePullPolicy: Always
          volumeMounts:
            - name: agent-id
              mountPath: /data
          env:
            - name: AGENT_TOKEN
              valueFrom:
                secretKeyRef:
                  name: justiam-agent-acme
                  key: agent_token
            - name: BACKEND_ADDR
              value: "grpc.justiam.com:443"
            - name: BACKEND_URL
              value: "https://acme.justiam.com"
            - name: TENANT_SLUG
              value: "acme"
            - name: MAX_CONCURRENT
              value: "4"
            - name: AGENT_ID_FILE
              value: "/data/agent-id"

Kubernetes deployment — shared (cross-tenant) agent¶

A shared agent serves tasks for any tenant that has worker_mode = agent but no dedicated agent available. Shared agents use HMAC self-registration — they obtain a short-lived token automatically at startup rather than requiring a static AGENT_TOKEN secret.

Initial setup (one-time)¶

In the controlplane UI go to Infrastructure → Shared Agent Pool → Initialise. This will:

Generate a 32-byte platform key and store it in a justiam-agent-platform Kubernetes Secret.
Create the agent-shared Deployment (2 replicas, auto-scaling configured).
Create a HorizontalPodAutoscaler.

Alternatively use the API:

POST /api/shared-agent-pool
Authorization: Bearer <controlplane-session>
Content-Type: application/json

{
  "min_replicas": 2,
  "max_replicas": 10,
  "cpu_target_pct": 70
}

Manual Kubernetes deployment¶

If you manage the Deployment yourself, use SHARED_AGENT_PLATFORM_KEY instead of AGENT_TOKEN. The AGENT_NAME is used as the stable pod identity in the platform DB:

env:
  - name: SHARED_AGENT_PLATFORM_KEY
    valueFrom:
      secretKeyRef:
        name: justiam-agent-platform
        key: platform_key
  - name: AGENT_NAME
    valueFrom:
      fieldRef:
        fieldPath: metadata.name
  - name: BACKEND_GRPC_ADDR
    value: "backend.justiam-mt.svc:9090"
  - name: BACKEND_URL
    value: "http://backend.justiam-mt.svc:8080"
  - name: TENANT_HOST_SUFFIX
    value: "justiam.com"
  # TENANT_SLUG omitted → defaults to "*"
  - name: MAX_CONCURRENT
    value: "4"

How self-registration works¶

On startup the agent calls POST /api/agent/v1/register signed with HMAC-SHA256(platform_key, "agent_register:<name>:<unix_timestamp>").
The backend verifies the signature and timestamp window (±5 min) and inserts a short-lived token (4h TTL) into shared_agent_tokens.
The agent uses the token for gRPC auth. At ~75% of TTL (3 h) it renews automatically in a background goroutine — no restarts needed.

Key rotation¶

Rotate the platform key from the controlplane UI (Infrastructure → Shared Agent Pool → Rotate Key) or via the API:

POST /api/shared-agent-pool/rotate-key
Authorization: Bearer <controlplane-session>

This generates a new key, updates the justiam-agent-platform Secret, and triggers a graceful rolling restart of the agent-shared Deployment (respects maxUnavailable: 1). In-flight tasks on surviving pods complete normally; each restarted pod immediately registers with the new key.

Note: When BACKEND_GRPC_ADDR points to an internal cluster hostname (no TLS), set GRPC_TLS=false explicitly. When using the public gRPC ingress (:443), TLS is enabled automatically.

Viewing connected agents¶

Open Workers in the admin UI to see:

Current login-worker mode and live connected agents
All connected agents with agent ID, slug, worker type, max concurrency, and active task count

Or query the API directly:

GET /api/v1/workers
Authorization: Bearer <admin-token>

Selecting agents for tasks (dispatch algorithm)¶

When a task is due the backend selects the least-loaded eligible agent within the configured pool:

Dedicated pool (`worker_agent_pool: dedicated`)¶

Only tenant-specific agents (TENANT_SLUG = <slug>) are considered.
The agent with the lowest active-task count is chosen (reservoir sampling on ties).
If no dedicated agent has capacity the task run is marked error (no silent fallback).

Shared pool (`worker_agent_pool: shared`)¶

Only shared agents (TENANT_SLUG = *) are considered.
The least-loaded eligible agent is chosen.

Global concurrency cap¶

task_max_concurrent sets an optional cross-replica cap on the total number of concurrently running tasks for the tenant across all pools (0 = unlimited). Configure it from the controlplane UI or via PUT /tenants/{slug}/worker-limit.

Multi-replica accuracy: Agent connections are persisted to each tenant's database (agent_connections table). The Workers page always shows all connected agents across every backend replica, not just those on the replica serving the current HTTP request.

Security considerations¶

Agent tokens are stored as SHA-256 hashes — the raw token cannot be recovered from the database.
Revoke a token immediately from Infrastructure → Agent Tokens → Revoke if it is compromised. Connected agents using that token are rejected on their next registration attempt.
Task scripts run inside the Yaegi interpreter with the same package allowlist as the inline worker (fmt, strings, time, standard library — no os/exec, syscall, or net raw sockets).
The gRPC stream is protected by TLS when connecting through a public ingress. Inside the cluster (agent ↔ backend) plaintext is acceptable since the traffic is within the private network.

Backend gRPC server configuration¶

Variable	Default	Description
`GRPC_PORT`	`9090`	Port the backend listens on for agent connections. Set to empty to disable the gRPC server entirely.

The gRPC server sends keepalive PINGs every 30 s and enforces a minimum client ping interval of 15 s. Agents send PINGs every 20 s. These defaults survive most load balancer idle timeouts.