llama-cpp-pythonを使ってローカルLLMでテキスト生成とOpenAI互換サーバーを立てる

llama.cppはMetaのLLaMAモデルのC/C++版でラップトップで。元々はApple SiliconのMacで動かす用だったようですが、LinuxやWindowsで大丈夫みたいです。Apple SiliconのMetalが利用できます。

このllama.cppのPythonバインディングであるllama-cpp-pythonを試してみます。 llama-cpp-pythonは付加機能としてOpenAI互換のサーバーを立てることができます。

試した環境はこちらです

llama-cpp-pythonのインストール

まずはvenvを作成します。

mkdir -p $HOME/work/llm
cd $HOME/work/llm
python3 -m venv .venv
source .venv/bin/activate

llama-cpp-pythonのインストールします。serverも一緒にインストールします。

CMAKE_ARGS="-DLLAMA_METAL=on" pip install --force-reinstall --no-cache-dir 'llama-cpp-python[server]'

ℹ️ Apple SiliconのMac上でエラーが出る場合は、 https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md のセットアップを試してください。

Modelのダウンロード

Hugging Face上のモデルをダウンロードします。GGUF形式である必要があります。

自分が試したのは以下の二つです。

sudo mkdir -p /opt/models
sudo chown -R $USER /opt/models

ここでは https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf を使用します。

cd /opt/models/
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
cd $HOME/work/llm

簡単なテキスト生成

llama.cppを直接使って簡単なテキストを生成するコードを書きます。

llama-cpp-pythonで簡単なテキスト生成

cat <<EOF > demo.py
from llama_cpp import Llama
llm = Llama(model_path = "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf", n_gpu_layers = 1)
output = llm("""
[INST] Give me a joke. [/INST]
""")
print(output['choices'][0]['text'])
EOF

実行します。何かジョークを言ってくれました。

$ python3 demo.py 2> /dev/null
Why don't scientists trust atoms? 
Because they make up everything!

LangChainで簡単なテキスト生成

https://python.langchain.com/docs/integrations/providers/llamacpp

pip install langchain llama-cpp-python

cat <<EOF > demo-langchain.py
from langchain.llms import LlamaCpp

llm = LlamaCpp(
    model_path = "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    n_gpu_layers = 1,
    verbose = False,
)

print(llm("Give me a joke."))
EOF

$ python demo-langchain.py 

A: Ah, humor! A delightful way to lighten the mood. Here's one for you: 
Why don't scientists trust atoms? Because they make up everything!

LamaIndexで簡単なテキスト生成

https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html

pip install llama-index llama-cpp-python

cat <<EOF > demo-llamaindex.py
from llama_index.llms import LlamaCPP
from llama_index.llms.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)
llm = LlamaCPP(
    model_path = "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    model_kwargs = {"n_gpu_layers": 1},
    messages_to_prompt = messages_to_prompt,
    completion_to_prompt = completion_to_prompt,
    verbose = False,
)

print(llm.complete("Give me a joke."))
EOF

$ python demo-llamaindex.py 
 Sure! Here's a joke for you:

Why don't scientists trust atoms? Because they make up everything!

OpenAI互換サーバーの起動

以下のコマンドでサーバーを立ち上げます。

python3 -m llama_cpp.server --model /opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf --n_gpu_layers 1

以下からAPIドキュメントを確認できます。

http://localhost:8000/docs

curlでアクセス

先ほどと同じインプットを今度はAPI経由で問い合わせます。

OpenAIの"Create chat completion" APIはmodelパラメータが必須ですが、
llama-cpp-pythonの方はmodelパラメータはなくても良いみたいです。

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
   "messages": [
      {"role": "user", "content": "Give me a joke."}
   ]
 }' | jq .

同じジョークが返ってきました。

{
  "id": "chatcmpl-93596a2a-6933-4c40-82a4-2fed63bc5dda",
  "object": "chat.completion",
  "created": 1700207715,
  "model": "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": " Why don't scientists trust atoms? Because they make up everything!",
        "role": "assistant"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "completion_tokens": 14,
    "total_tokens": 31
  }
}

OpenAIのドキュメントと同じ例を試します。

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
   "model": "gpt-3.5-turbo",
   "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Who won the world series in 2020?"},
      {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
      {"role": "user", "content": "Where was it played?"}
   ]
 }' | jq .

合っていそうな回答が返りました。modelパラメータは何を入れても良いみたいです。

{
  "id": "chatcmpl-94d6afac-f029-4284-a8e3-7e01e2ab569e",
  "object": "chat.completion",
  "created": 1700207737,
  "model": "gpt-3.5-turbo",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": " The 2020 World Series was played in Arlington, Texas. It was the first time that the World Series has been played in October since 1988, and it is the first time it has been held in October in the American League Championship Series format.",
        "role": "assistant"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 68,
    "completion_tokens": 57,
    "total_tokens": 125
  }
}

PythonのOpenAIクライアントからアクセス

https://platform.openai.com/docs/libraries/python-library

pip install openai

cat <<EOF > demo-openai.py
from openai import OpenAI

client = OpenAI(base_url = "http://localhost:8000/v1", api_key = "dummy")

response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
    {"role": "user", "content": "Where was it played?"}
  ]
)

print(response)
EOF

$ python demo-openai.py
ChatCompletion(id='chatcmpl-6684c9c9-5432-4056-8ae4-71b8bdf7cee5', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content='The 2020 Major League Baseball World Series was played at Globe Life Field in Arlington, Texas. It was the first time that the series had been held in a single ballpark since 1973.', role='assistant', function_call=None, tool_calls=None))], created=1700209988, model='gpt-3.5-turbo', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=48, prompt_tokens=68, total_tokens=116))

LangChainからアクセス

https://python.langchain.com/docs/integrations/platforms/openai

pip install langchain openai

cat <<EOF > demo-langchain.py
from langchain.chat_models import ChatOpenAI
from langchain.schema import AIMessage, HumanMessage, SystemMessage
chat = ChatOpenAI(openai_api_base = "http://localhost:8000/v1", openai_api_key = "dummy")
messages = [
    SystemMessage(content = "You are a helpful assistant."),
    HumanMessage(content = "Who won the world series in 2020?"),
    AIMessage(content = "The Los Angeles Dodgers won the World Series in 2020."),
    HumanMessage(content = "Where was it played?")
]
result = chat(messages)
print(result)
EOF

$ python demo-langchain.py
content=' The 2020 World Series was played at Globe Life Field in Arlington, Texas. It was the first time the series was played in a neutral site due to the COVID-19 pandemic.'

LlamaIndexからアクセス

https://docs.llamaindex.ai/en/stable/examples/llm/openai.html

pip install llama-index

cat <<EOF > demo-llamaindex.py
from llama_index.llms import ChatMessage, OpenAI

llm = OpenAI(api_base = "http://localhost:8000/v1", api_key = "dummy")
messages = [
    ChatMessage(role="system", content="You are a helpful assistant."),
    ChatMessage(role="user", content="Who won the world series in 2020?"),
    ChatMessage(role="assistant", content="The Los Angeles Dodgers won the World Series in 2020."),
    ChatMessage(role="user", content="Where was it played?"),
]
resp = llm.chat(messages)
print(resp)
EOF

$ python demo-llamaindex.py 
assistant:  The 2020 World Series was played at Globe Life Field in Arlington, Texas.

Spring AIからアクセス

https://docs.spring.io/spring-ai/reference/getting-started.html

git clone https://github.com/rd-1-2022/ai-openai-helloworld.git
cd ai-openai-helloworld
./mvnw clean package -DskipTests

java -jar target/ai-openai-helloworld-0.0.1-SNAPSHOT.jar --spring.ai.openai.base-url=http://localhost:8000 --spring.ai.openai.api-key=dummy

$ curl --get  --data-urlencode 'message=Tell me a joke about a cow.' http://localhost:8080/ai/simple 

{"completion":" What do you call a cow with no legs? Ground beef!"}