Warning
This article was automatically translated by OpenAI (gpt-4o). It may be edited eventually, but please be aware that it may contain incorrect information at this time.
We will try Google's Gemma in the same way as described in "Generating Text with a Local LLM and Setting Up an OpenAI-Compatible Server Using llama-cpp-python".
Table of Contents
Installing llama-cpp-python
First, create a virtual environment.
mkdir -p $HOME/work/llm
cd $HOME/work/llm
python3 -m venv .venv
source .venv/bin/activate
Install llama-cpp-python along with the server.
CMAKE_ARGS="-DLLAMA_METAL=on" pip install --force-reinstall --no-cache-dir 'llama-cpp-python[server]'
Note
If you encounter errors on an Apple Silicon Mac, try the setup described at https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md.
chat_format="gemma"
was supported in the following commit, so use v0.2.48 or later.
https://github.com/abetlen/llama-cpp-python/commit/251a8a2cadb4c0df4671062144d168a7874086a2
Downloading Gemma
sudo mkdir -p /opt/models
sudo chown -R $USER /opt/models
Since the 7B model has a large file size, we will download the 2B model.
https://huggingface.co/google/gemma-2b/tree/main
Download gemma-2b.gguf
to /opt/models/
.
Starting the OpenAI-Compatible Server
Start the server with the following command. You need to specify --chat_format=gemma
.
python3 -m llama_cpp.server --chat_format=gemma --model /opt/models/gemma-2b-it.gguf --n_gpu_layers 1
You can check the API documentation from the following URL.
The "Create chat completion" API of OpenAI requires the
model
parameter, but it seems that themodel
parameter is not necessary for llama-cpp-python.
Accessing with curl
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Give me a joke."}
]
}' | jq .
A joke was returned.
{
"id": "chatcmpl-79f5ae4c-cf47-494c-a82c-a7e3747ab463",
"object": "chat.completion",
"created": 1708846379,
"model": "/opt/models/gemma-2b-it.gguf",
"choices": [
{
"index": 0,
"message": {
"content": "Why did the scarecrow win an award?\n\nBecause he was outstanding in his field!",
"role": "assistant"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 18,
"total_tokens": 32
}
}
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "日本の首都はどこですか?"}
]
}' | jq .
It seems to work in Japanese as well.
{
"id": "chatcmpl-3f111b5e-4244-4cfc-9818-d23b8d04ccb2",
"object": "chat.completion",
"created": 1708846400,
"model": "/opt/models/gemma-2b-it.gguf",
"choices": [
{
"index": 0,
"message": {
"content": "日本の首都は東京です。東京は日本の東部に位置し、日本を代表する都市です。",
"role": "assistant"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 22,
"total_tokens": 36
}
}
Accessing with Spring AI
Let's access it from an app using Spring AI. Since it is OpenAI-compatible, you can use the OpenAI Chat Client of Spring AI.
Here is a sample app.
https://github.com/making/hello-spring-ai
git clone https://github.com/making/hello-spring-ai
cd hello-spring-ai
./mvnw clean package -DskipTests=true
java -jar target/hello-spring-ai-0.0.1-SNAPSHOT.jar --spring.ai.openai.base-url=http://localhost:8000 --spring.ai.openai.api-key=dummy
$ curl localhost:8080
What do you call a boomerang that won't come back?
A stick.
This app itself is for OpenAI, but the advantage of using llama-cpp-python is that you can use Gemma just by changing the properties.
If you don't care about compatibility with the OpenAI API and just want to use Gemma, you can also use Ollama via spring-ai-ollama.