Featured image of post Deploy and configure Librechat on Kubernetes

Deploy and configure Librechat on Kubernetes

In this article you will learn how to deploy Librechat on Kubernetes and how to configure cloud based LLM providers and custom providers.

Librechat configuration structure

Librechat is distributed as a container available on it’s github repo.

In order to run it, you have two options :

  • docker compose : that’s the primary deployment target
  • kubernetes using a helm chart : Maintained in the github repo.

Historicaly, Librechat was configured using a .env file where you can define a predifined list of env variables.

Due to the constant increase of new options, an additionnal yaml based configuration file as been added here

Architecture

Choose your LLM provider

Those two provider offers free plan without putting your credit card

OpenRouter

You can select the list of free models by selecting only the free models

The openrouter config should look like

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  endpoints:
    custom:
      - name: "OpenRouter"
        # For `apiKey` and `baseURL`, you can use environment variables that you define.
        # recommended environment variables:
        apiKey: "{{ requiredEnv "OPENROUTER_API_KEY" }}"
        baseURL: "https://openrouter.ai/api/v1"
        models:
          default: ["deepseek/deepseek-r1-0528:free", "mistralai/devstral-small-2505:free", 'qwen/qwen3-235b-a22b:free', 'nvidia/llama-3.3-nemotron-super-49b-v1:free', 'meta-llama/llama-4-maverick:free']
          fetch: false
        titleConvo: true
        titleModel: "meta-llama/llama-3-70b-instruct"
        # Recommended: Drop the stop parameter from the request as Openrouter models use a variety of stop tokens.
        dropParams: ["stop"]
        modelDisplayLabel: "OpenRouter"

Don’t forget to create an API KEY

Mistral AI

Mistral AI is a french AI company that offer also a very interesting free plan (something like 1 billion of tokens which is quite enough for a unique user or for testing purpose)

Have a look at Mistral API Console and create an account.

The mistral config should look like :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
    - name: "Mistral"
      apiKey: "${MISTRAL_API_KEY}"
      baseURL: "https://api.mistral.ai/v1"
      models:
        default: ["mistral-tiny", "mistral-small", "mistral-medium", "mistral -large-latest"]
        fetch: true
      titleConvo: true
      titleModel: "mistral-tiny"
      modelDisplayLabel: "Mistral"
      dropParams: ["stop", "user", "frequency_penalty", "presence_penalty"]

Use a local provider

This usecase will need an additional componenet in order to provide a unique entrypoint like portkey or LiteLLM. I personally use LiteLLM which is easier to configure.

Inference engine

Depending on your local GPU infrasctructure and the number of users there is plenty of inference engine you can use.

Ollama

For very small number of users, Ollama is a good catch for first try as it propose a long list of supported models. The drawback is that it is intended to be run on small gpu cards so the models are usually heavily quantized in order to fit on commodity hardware. They usually are quite slow to respond and are intended to be used for a very few users.

VLLM

VLLM is actually the most famous inference engine as it support nearly all models and provide a good batching for multiusers use case.

Authentication

As I am already using a keycloak instance with Cisco Duo (look at my old articles for that), I use the OpenID Connect implementation on Librechat.

Enhance Librechat with RAG, MCP Server….

And there is more functionnality, like RAG (upload pdf files to enhance models using your files), Image generation, Agents, MCP Servers….

Feel free to contact me if you need assistance :-)

comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy