The module lifecycle stage: General Availability
The module has requirements for installation
The AI assistant is a chat built into the web interface that helps you explore
cluster state, resources and documentation. It is optional and turns on when a
Kubernetes secret named assistant exists in the d8-console namespace.
Experimental feature. The assistant sends cluster data (resource manifests, logs, the Deckhouse queue, etc.) to the LLM you configure. Only connect models and providers you trust with that data.
How it works
- The assistant service and its MCP server are deployed only when the
assistantsecret exists in thed8-consolenamespace. - The assistant reads LLM credentials from that secret and talks to the model
over an OpenAI-compatible API (
/v1/chat/completions, streaming). - The chat button (✨) appears in the top bar when the assistant is enabled, or
when you may create and update secrets in the
d8-consolenamespace (then a setup form opens instead of the chat).
The model endpoint must be OpenAI-compatible and support tool/function calling — the assistant invokes tools (cluster access, documentation search) through function calling.
Configuring from the UI
Open the chat with the ✨ button and fill in the form (or click the “Assistant settings” gear if the chat is already configured):
- API base URL — base URL of the OpenAI-compatible API, including the version
suffix (usually
/v1). - Model — model name as the endpoint expects it (e.g.
gpt-4o). - API key — authorization key/token. For a model without auth, enter any non-empty string. When editing, leave it blank to keep the current key.
- Token limit — model response limit (
max_completion_tokens). - Context window (tokens) — the model’s full context size; powers the token usage meter. May be left blank.
- Context budget (tokens) — threshold above which old tool output is trimmed
to avoid hitting the model limit.
0disables it.
Values are stored in the secret; updating an existing secret needs no redeploy.
Cluster modifications
By default the assistant is read-only: it inspects resources, logs, documentation, the Deckhouse queue, and so on. Creating, updating, and deleting Kubernetes resources is disabled.
Below the input field the chat has an “Allow changes” toggle (labeled «Разрешить изменения» in Russian). When you turn it on before sending a message, the assistant may create, update, and delete cluster resources. The server enforces this flag — it is not only a model instruction.
Access rights. All cluster operations run as the current user, using the same session token as the rest of Console. The assistant does not gain extra privileges: if you cannot create or change a resource in the UI, the assistant cannot do it either.
With modifications allowed, the assistant performs a dry run first and asks for explicit confirmation before applying a real change.
assistant secret schema
The secret type is Opaque. Keys:
| Key | Required | Default | Description |
|---|---|---|---|
api_key |
yes | — | LLM authorization key/token. |
base_url |
yes | — | OpenAI-compatible API base URL, including /v1. |
model |
yes | — | Model name as the endpoint expects it. |
max_tokens |
no | 40000 |
Response token limit. |
context_window |
no | 0 (unknown) |
Full model context size (for the usage meter). |
max_context_tokens |
no | 300000 |
Prompt budget; 0 disables context trimming. |
Example: external model (OpenAI GPT-4o)
kubectl -n d8-console create secret generic assistant \
--from-literal=api_key='sk-proj-...' \
--from-literal=base_url='https://api.openai.com/v1' \
--from-literal=model='gpt-4o' \
--from-literal=context_window='128000'The external endpoint must be reachable from the assistant pod (internet egress).
Example: in-cluster model
If an OpenAI-compatible serving (vLLM, Ollama, etc.) runs in the cluster, point the assistant at its in-cluster Service URL:
kubectl -n d8-console create secret generic assistant \
--from-literal=api_key='not-needed' \
--from-literal=base_url='http://<service>.<model-namespace>.svc.cluster.local:8000/v1' \
--from-literal=model='<served-model-name>'model— the model name as the serving reports it (for vLLM,--served-model-name).api_key— if the serving needs no auth, still provide any non-empty string.
Managing context size
The conversation history lives in the browser and is sent to the model in
full on every request. Cluster tools (such as listing resources) can return
very large payloads that accumulate in the context and quickly hit the model’s
context window limit (a 400 … maximum context length error). To avoid this,
the assistant applies several optimizations.
Kubernetes output slimming
The bulkiest, low-signal fields are stripped from k8s_list_resources and
k8s_get_resource results:
metadata.managedFields— server-side apply bookkeeping;- the
kubectl.kubernetes.io/last-applied-configurationannotation — a manifest duplicate.
Meaningful spec/status and other metadata are kept. Non-Kubernetes payloads
are passed through unchanged.
Per-tool-result cap
Any single tool result is capped at roughly 50,000 characters. A truncation
note is appended hinting to narrow the request with filters
(namespace / label_selector / field_selector / limit) so the model asks
again more precisely.
Context guard (max_context_tokens)
Before each model call the assistant estimates the total prompt size. If it
exceeds max_context_tokens (default 300000), old tool output is replaced
with a stub — starting from the earliest — while the current step’s results
are preserved. This compacts the context without losing the latest useful
result.
The guard is on by default; set max_context_tokens: 0 to disable it (e.g. if
the model trims context on its own).
Token usage meter
After each round the model reports actual token usage. When context_window is
set (e.g. 128000 for gpt-4o), a thin “used / available” meter appears above the
input with a color indication (green → amber → red) and a hover tooltip. Without
context_window, only the used-token counter is shown.