---
title: llama-cpp-pythonを使ってローカルLLMでテキスト生成とOpenAI互換サーバーを立てる
tags: ["Python", "llama.cpp", "OpenAI", "Machine Learning", "MPS", "LangChain", "LlamaIndex", "Spring AI"]
categories: ["AI", "LLM", "llama.cpp"]
date: 2023-11-17T08:03:02Z
updated: 2023-11-22T07:42:00Z
---

[llama.cpp](https://github.com/ggerganov/llama.cpp)はMetaのLLaMAモデルのC/C++版でラップトップで。
元々はApple SiliconのMacで動かす用だったようですが、LinuxやWindowsで大丈夫みたいです。Apple SiliconのMetalが利用できます。

このllama.cppのPythonバインディングである[llama-cpp-python](https://github.com/abetlen/llama-cpp-python)を試してみます。
llama-cpp-pythonは付加機能としてOpenAI互換のサーバーを立てることができます。


試した環境はこちらです

<img width="682" src="https://github.com/making/blog.ik.am/assets/106908/f8c9ef7e-bc21-47f7-86ca-0c0752ddd9c4">


**目次**
<!-- toc -->

### llama-cpp-pythonのインストール

まずはvenvを作成します。

```
mkdir -p $HOME/work/llm
cd $HOME/work/llm
python3 -m venv .venv
source .venv/bin/activate
```

llama-cpp-pythonのインストールします。serverも一緒にインストールします。

```
CMAKE_ARGS="-DLLAMA_METAL=on" pip install --force-reinstall --no-cache-dir 'llama-cpp-python[server]'
```

> ℹ️ Apple SiliconのMac上でエラーが出る場合は、 https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md のセットアップを試してください。

### Modelのダウンロード

Hugging Face上のモデルをダウンロードします。GGUF形式である必要があります。


自分が試したのは以下の二つです。

* https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF
* https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF


```
sudo mkdir -p /opt/models
sudo chown -R $USER /opt/models
```

ここでは https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf を使用します。

```
cd /opt/models/
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
cd $HOME/work/llm
```

### 簡単なテキスト生成

llama.cppを直接使って簡単なテキストを生成するコードを書きます。


#### llama-cpp-pythonで簡単なテキスト生成

```python
cat <<EOF > demo.py
from llama_cpp import Llama
llm = Llama(model_path = "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf", n_gpu_layers = 1)
output = llm("""
[INST] Give me a joke. [/INST]
""")
print(output['choices'][0]['text'])
EOF
```

実行します。何かジョークを言ってくれました。

```
$ python3 demo.py 2> /dev/null
Why don't scientists trust atoms? 
Because they make up everything!
```

#### LangChainで簡単なテキスト生成

https://python.langchain.com/docs/integrations/providers/llamacpp

```
pip install langchain llama-cpp-python
```

```python
cat <<EOF > demo-langchain.py
from langchain.llms import LlamaCpp

llm = LlamaCpp(
    model_path = "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    n_gpu_layers = 1,
    verbose = False,
)

print(llm("Give me a joke."))
EOF
```

```
$ python demo-langchain.py 

A: Ah, humor! A delightful way to lighten the mood. Here's one for you: 
Why don't scientists trust atoms? Because they make up everything!
```

#### LamaIndexで簡単なテキスト生成

https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html

```
pip install llama-index llama-cpp-python
```


```python
cat <<EOF > demo-llamaindex.py
from llama_index.llms import LlamaCPP
from llama_index.llms.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)
llm = LlamaCPP(
    model_path = "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    model_kwargs = {"n_gpu_layers": 1},
    messages_to_prompt = messages_to_prompt,
    completion_to_prompt = completion_to_prompt,
    verbose = False,
)

print(llm.complete("Give me a joke."))
EOF
```

```
$ python demo-llamaindex.py 
 Sure! Here's a joke for you:

Why don't scientists trust atoms? Because they make up everything!
```

### OpenAI互換サーバーの起動

以下のコマンドでサーバーを立ち上げます。

```
python3 -m llama_cpp.server --model /opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf --n_gpu_layers 1
```

以下からAPIドキュメントを確認できます。

http://localhost:8000/docs

#### curlでアクセス

先ほどと同じインプットを今度はAPI経由で問い合わせます。

> OpenAIの["Create chat completion" API](https://platform.openai.com/docs/api-reference/chat/create)は`model`パラメータが必須ですが、<br>
> llama-cpp-pythonの方は`model`パラメータはなくても良いみたいです。

```
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
   "messages": [
      {"role": "user", "content": "Give me a joke."}
   ]
 }' | jq .
```

同じジョークが返ってきました。

```json
{
  "id": "chatcmpl-93596a2a-6933-4c40-82a4-2fed63bc5dda",
  "object": "chat.completion",
  "created": 1700207715,
  "model": "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": " Why don't scientists trust atoms? Because they make up everything!",
        "role": "assistant"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "completion_tokens": 14,
    "total_tokens": 31
  }
}
```

[OpenAIのドキュメント](https://platform.openai.com/docs/guides/text-generation/chat-completions-api)と同じ例を試します。


```
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
   "model": "gpt-3.5-turbo",
   "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Who won the world series in 2020?"},
      {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
      {"role": "user", "content": "Where was it played?"}
   ]
 }' | jq .
```

合っていそうな回答が返りました。`model`パラメータは何を入れても良いみたいです。

```json
{
  "id": "chatcmpl-94d6afac-f029-4284-a8e3-7e01e2ab569e",
  "object": "chat.completion",
  "created": 1700207737,
  "model": "gpt-3.5-turbo",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": " The 2020 World Series was played in Arlington, Texas. It was the first time that the World Series has been played in October since 1988, and it is the first time it has been held in October in the American League Championship Series format.",
        "role": "assistant"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 68,
    "completion_tokens": 57,
    "total_tokens": 125
  }
}
```

#### PythonのOpenAIクライアントからアクセス

https://platform.openai.com/docs/libraries/python-library

```
pip install openai
```

```python
cat <<EOF > demo-openai.py
from openai import OpenAI

client = OpenAI(base_url = "http://localhost:8000/v1", api_key = "dummy")

response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
    {"role": "user", "content": "Where was it played?"}
  ]
)

print(response)
EOF
```

```
$ python demo-openai.py
ChatCompletion(id='chatcmpl-6684c9c9-5432-4056-8ae4-71b8bdf7cee5', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content='The 2020 Major League Baseball World Series was played at Globe Life Field in Arlington, Texas. It was the first time that the series had been held in a single ballpark since 1973.', role='assistant', function_call=None, tool_calls=None))], created=1700209988, model='gpt-3.5-turbo', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=48, prompt_tokens=68, total_tokens=116))
```

#### LangChainからアクセス

https://python.langchain.com/docs/integrations/platforms/openai

```
pip install langchain openai
```
```python
cat <<EOF > demo-langchain.py
from langchain.chat_models import ChatOpenAI
from langchain.schema import AIMessage, HumanMessage, SystemMessage
chat = ChatOpenAI(openai_api_base = "http://localhost:8000/v1", openai_api_key = "dummy")
messages = [
    SystemMessage(content = "You are a helpful assistant."),
    HumanMessage(content = "Who won the world series in 2020?"),
    AIMessage(content = "The Los Angeles Dodgers won the World Series in 2020."),
    HumanMessage(content = "Where was it played?")
]
result = chat(messages)
print(result)
EOF
```

```
$ python demo-langchain.py
content=' The 2020 World Series was played at Globe Life Field in Arlington, Texas. It was the first time the series was played in a neutral site due to the COVID-19 pandemic.'
```


#### LlamaIndexからアクセス

https://docs.llamaindex.ai/en/stable/examples/llm/openai.html

```
pip install llama-index
```

```python
cat <<EOF > demo-llamaindex.py
from llama_index.llms import ChatMessage, OpenAI

llm = OpenAI(api_base = "http://localhost:8000/v1", api_key = "dummy")
messages = [
    ChatMessage(role="system", content="You are a helpful assistant."),
    ChatMessage(role="user", content="Who won the world series in 2020?"),
    ChatMessage(role="assistant", content="The Los Angeles Dodgers won the World Series in 2020."),
    ChatMessage(role="user", content="Where was it played?"),
]
resp = llm.chat(messages)
print(resp)
EOF
```

```
$ python demo-llamaindex.py 
assistant:  The 2020 World Series was played at Globe Life Field in Arlington, Texas.
```

#### Spring AIからアクセス

https://docs.spring.io/spring-ai/reference/getting-started.html

```
git clone https://github.com/rd-1-2022/ai-openai-helloworld.git
cd ai-openai-helloworld
./mvnw clean package -DskipTests

java -jar target/ai-openai-helloworld-0.0.1-SNAPSHOT.jar --spring.ai.openai.base-url=http://localhost:8000 --spring.ai.openai.api-key=dummy
```

```
$ curl --get  --data-urlencode 'message=Tell me a joke about a cow.' http://localhost:8080/ai/simple 

{"completion":" What do you call a cow with no legs? Ground beef!"}
```