Warning
This article was automatically translated by OpenAI (gpt-4o).It may be edited eventually, but please be aware that it may contain incorrect information at this time.
"Launching an OpenAI-Compatible Server Using Gemma Model with llama-cpp-python and Access It from Spring AI" in the same way, we will try Meta's Llama 3.
Table of Contents
Installing llama-cpp-python
First, create a virtual environment.
mkdir -p $HOME/work/llm
cd $HOME/work/llm
python3 -m venv .venv
source .venv/bin/activate
Install llama-cpp-python along with the server.
CMAKE_ARGS="-DLLAMA_METAL=on" pip install --force-reinstall --no-cache-dir 'llama-cpp-python[server]'
Note
If you encounter errors on an Apple Silicon Mac, try the setup described at https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md.
chat_format="llama-3" was supported in the following commit, so use v0.2.64 or higher.
https://github.com/abetlen/llama-cpp-python/commit/8559e8ce88b7c7343004eeccb7333b806034b01c
Downloading Llama 3
sudo mkdir -p /opt/models
sudo chown -R $USER /opt/models
The model used with llama-cpp-python needs to be in GGUF format, so we will use the following model converted to GGUF format.
https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF
Download Meta-Llama-3-8B-Instruct.Q4_K_M.gguf to /opt/models/.
Starting the OpenAI-Compatible Server
Start the server with the following command. You need to specify --chat_format=llama-3.
python3 -m llama_cpp.server --chat_format=llama-3 --model /opt/models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --n_gpu_layers 1
You can check the API documentation from the following.
The OpenAI "Create chat completion" API requires the
modelparameter, but it seems that themodelparameter is not necessary for llama-cpp-python.
Accessing with curl
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Give me a joke."}
]
}' | jq .
A joke was returned.
{
"id": "chatcmpl-755f1cc0-39e1-408c-b119-11034492b500",
"object": "chat.completion",
"created": 1715328366,
"model": "/opt/models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf",
"choices": [
{
"index": 0,
"message": {
"content": "Here's one:\n\nWhy couldn't the bicycle stand up by itself?\n\n(Wait for it...)\n\nBecause it was two-tired!\n\nHope that made you laugh!",
"role": "assistant"
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 17,
"completion_tokens": 32,
"total_tokens": 49
}
}
Accessing with Spring AI
Let's access it from an app using Spring AI.
Since it is OpenAI-compatible, you can use the OpenAI Chat Client of Spring AI.
Here is a sample app.
https://github.com/making/hello-spring-ai
git clone https://github.com/making/hello-spring-ai
cd hello-spring-ai
./mvnw clean package -DskipTests=true
java -jar target/hello-spring-ai-0.0.1-SNAPSHOT.jar --spring.ai.openai.base-url=http://localhost:8000 --spring.ai.openai.api-key=dummy
$ curl localhost:8080
Here's one:
Why couldn't the bicycle stand up by itself?
(Wait for it...)
Because it was two-tired!
Hope that made you smile! Do you want to hear another one?
This app itself is for OpenAI, but the advantage of using llama-cpp-python is that you can use Llama-3 just by changing the properties.
If you don't care about compatibility with the OpenAI API and just want to use Llama 3, you can also use Ollama via spring-ai-ollama.