Launching an OpenAI-Compatible Server Using the Gemma Model with llama-cpp-python and Accessing It from Spring AI

We will try Google's Gemma in the same way as described in "Generating Text with a Local LLM and Setting Up an OpenAI-Compatible Server Using llama-cpp-python".

Installing llama-cpp-python

First, create a virtual environment.

mkdir -p $HOME/work/llm
cd $HOME/work/llm
python3 -m venv .venv
source .venv/bin/activate

Install llama-cpp-python along with the server.

CMAKE_ARGS="-DLLAMA_METAL=on" pip install --force-reinstall --no-cache-dir 'llama-cpp-python[server]'


If you encounter errors on an Apple Silicon Mac, try the setup described at https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md.

chat_format="gemma" was supported in the following commit, so use v0.2.48 or later.


Downloading Gemma

sudo mkdir -p /opt/models
sudo chown -R $USER /opt/models

Since the 7B model has a large file size, we will download the 2B model.


Download gemma-2b.gguf to /opt/models/.

Starting the OpenAI-Compatible Server

Start the server with the following command. You need to specify --chat_format=gemma.

python3 -m llama_cpp.server --chat_format=gemma --model /opt/models/gemma-2b-it.gguf --n_gpu_layers 1

You can check the API documentation from the following URL.


The "Create chat completion" API of OpenAI requires the model parameter, but it seems that the model parameter is not necessary for llama-cpp-python.

Accessing with curl

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
   "messages": [
      {"role": "user", "content": "Give me a joke."}
 }' | jq .

A joke was returned.

  "id": "chatcmpl-79f5ae4c-cf47-494c-a82c-a7e3747ab463",
  "object": "chat.completion",
  "created": 1708846379,
  "model": "/opt/models/gemma-2b-it.gguf",
  "choices": [
      "index": 0,
      "message": {
        "content": "Why did the scarecrow win an award?\n\nBecause he was outstanding in his field!",
        "role": "assistant"
      "finish_reason": "stop"
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 18,
    "total_tokens": 32
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
   "messages": [
      {"role": "user", "content": "日本の首都はどこですか?"}
 }' | jq .

It seems to work in Japanese as well.

  "id": "chatcmpl-3f111b5e-4244-4cfc-9818-d23b8d04ccb2",
  "object": "chat.completion",
  "created": 1708846400,
  "model": "/opt/models/gemma-2b-it.gguf",
  "choices": [
      "index": 0,
      "message": {
        "content": "日本の首都は東京です。東京は日本の東部に位置し、日本を代表する都市です。",
        "role": "assistant"
      "finish_reason": "stop"
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 22,
    "total_tokens": 36

Accessing with Spring AI

Let's access it from an app using Spring AI. Since it is OpenAI-compatible, you can use the OpenAI Chat Client of Spring AI.

Here is a sample app.


git clone https://github.com/making/hello-spring-ai
cd hello-spring-ai
./mvnw clean package -DskipTests=true
java -jar target/hello-spring-ai-0.0.1-SNAPSHOT.jar --spring.ai.openai.base-url=http://localhost:8000 --spring.ai.openai.api-key=dummy
$ curl localhost:8080
What do you call a boomerang that won't come back?

A stick.

This app itself is for OpenAI, but the advantage of using llama-cpp-python is that you can use Gemma just by changing the properties.

If you don't care about compatibility with the OpenAI API and just want to use Gemma, you can also use Ollama via spring-ai-ollama.

