The module lifecycle stageGeneral Availability
The module has requirements for installation

The AI assistant is a chat built into the web interface that helps you explore cluster state, resources and documentation. It is optional and turns on when a Kubernetes secret named assistant exists in the d8-console namespace.

Experimental feature. The assistant sends cluster data (resource manifests, logs, the Deckhouse queue, etc.) to the LLM you configure. Only connect models and providers you trust with that data.

How it works

  • The assistant service and its MCP server are deployed only when the assistant secret exists in the d8-console namespace.
  • The assistant reads LLM credentials from that secret and talks to the model over an OpenAI-compatible API (/v1/chat/completions, streaming).
  • The chat button (✨) appears in the top bar when the assistant is enabled, or when you may create and update secrets in the d8-console namespace (then a setup form opens instead of the chat).

The model endpoint must be OpenAI-compatible and support tool/function calling — the assistant invokes tools (cluster access, documentation search) through function calling.

Configuring from the UI

Open the chat with the ✨ button and fill in the form (or click the “Assistant settings” gear if the chat is already configured):

  • API base URL — base URL of the OpenAI-compatible API, including the version suffix (usually /v1).
  • Model — model name as the endpoint expects it (e.g. gpt-4o).
  • API key — authorization key/token. For a model without auth, enter any non-empty string. When editing, leave it blank to keep the current key.
  • Token limit — model response limit (max_completion_tokens).
  • Context window (tokens) — the model’s full context size; powers the token usage meter. May be left blank.
  • Context budget (tokens) — threshold above which old tool output is trimmed to avoid hitting the model limit. 0 disables it.

Values are stored in the secret; updating an existing secret needs no redeploy.

Cluster modifications

By default the assistant is read-only: it inspects resources, logs, documentation, the Deckhouse queue, and so on. Creating, updating, and deleting Kubernetes resources is disabled.

Below the input field the chat has an “Allow changes” toggle (labeled «Разрешить изменения» in Russian). When you turn it on before sending a message, the assistant may create, update, and delete cluster resources. The server enforces this flag — it is not only a model instruction.

Access rights. All cluster operations run as the current user, using the same session token as the rest of Console. The assistant does not gain extra privileges: if you cannot create or change a resource in the UI, the assistant cannot do it either.

With modifications allowed, the assistant performs a dry run first and asks for explicit confirmation before applying a real change.

assistant secret schema

The secret type is Opaque. Keys:

Key Required Default Description
api_key yes LLM authorization key/token.
base_url yes OpenAI-compatible API base URL, including /v1.
model yes Model name as the endpoint expects it.
max_tokens no 40000 Response token limit.
context_window no 0 (unknown) Full model context size (for the usage meter).
max_context_tokens no 300000 Prompt budget; 0 disables context trimming.

Example: external model (OpenAI GPT-4o)

kubectl -n d8-console create secret generic assistant \
  --from-literal=api_key='sk-proj-...' \
  --from-literal=base_url='https://api.openai.com/v1' \
  --from-literal=model='gpt-4o' \
  --from-literal=context_window='128000'

The external endpoint must be reachable from the assistant pod (internet egress).

Example: in-cluster model

If an OpenAI-compatible serving (vLLM, Ollama, etc.) runs in the cluster, point the assistant at its in-cluster Service URL:

kubectl -n d8-console create secret generic assistant \
  --from-literal=api_key='not-needed' \
  --from-literal=base_url='http://<service>.<model-namespace>.svc.cluster.local:8000/v1' \
  --from-literal=model='<served-model-name>'
  • model — the model name as the serving reports it (for vLLM, --served-model-name).
  • api_key — if the serving needs no auth, still provide any non-empty string.

Managing context size

The conversation history lives in the browser and is sent to the model in full on every request. Cluster tools (such as listing resources) can return very large payloads that accumulate in the context and quickly hit the model’s context window limit (a 400 … maximum context length error). To avoid this, the assistant applies several optimizations.

Kubernetes output slimming

The bulkiest, low-signal fields are stripped from k8s_list_resources and k8s_get_resource results:

  • metadata.managedFields — server-side apply bookkeeping;
  • the kubectl.kubernetes.io/last-applied-configuration annotation — a manifest duplicate.

Meaningful spec/status and other metadata are kept. Non-Kubernetes payloads are passed through unchanged.

Per-tool-result cap

Any single tool result is capped at roughly 50,000 characters. A truncation note is appended hinting to narrow the request with filters (namespace / label_selector / field_selector / limit) so the model asks again more precisely.

Context guard (max_context_tokens)

Before each model call the assistant estimates the total prompt size. If it exceeds max_context_tokens (default 300000), old tool output is replaced with a stub — starting from the earliest — while the current step’s results are preserved. This compacts the context without losing the latest useful result.

The guard is on by default; set max_context_tokens: 0 to disable it (e.g. if the model trims context on its own).

Token usage meter

After each round the model reports actual token usage. When context_window is set (e.g. 128000 for gpt-4o), a thin “used / available” meter appears above the input with a color indication (green → amber → red) and a hover tooltip. Without context_window, only the used-token counter is shown.