llama.cppはMetaのLLaMAモデルのC/C++版でラップトップで。 元々はApple SiliconのMacで動かす用だったようですが、LinuxやWindowsで大丈夫みたいです。Apple SiliconのMetalが利用できます。
このllama.cppのPythonバインディングであるllama-cpp-pythonを試してみます。 llama-cpp-pythonは付加機能としてOpenAI互換のサーバーを立てることができます。
試した環境はこちらです
目次
llama-cpp-pythonのインストール
まずはvenvを作成します。
mkdir -p $HOME/work/llm
cd $HOME/work/llm
python3 -m venv .venv
source .venv/bin/activate
llama-cpp-pythonのインストールします。serverも一緒にインストールします。
CMAKE_ARGS="-DLLAMA_METAL=on" pip install --force-reinstall --no-cache-dir 'llama-cpp-python[server]'
ℹ️ Apple SiliconのMac上でエラーが出る場合は、 https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md のセットアップを試してください。
Modelのダウンロード
Hugging Face上のモデルをダウンロードします。GGUF形式である必要があります。
自分が試したのは以下の二つです。
- https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF
- https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF
sudo mkdir -p /opt/models
sudo chown -R $USER /opt/models
ここでは https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf を使用します。
cd /opt/models/
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
cd $HOME/work/llm
簡単なテキスト生成
llama.cppを直接使って簡単なテキストを生成するコードを書きます。
llama-cpp-pythonで簡単なテキスト生成
cat <<EOF > demo.py
from llama_cpp import Llama
llm = Llama(model_path = "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf", n_gpu_layers = 1)
output = llm("""
[INST] Give me a joke. [/INST]
""")
print(output['choices'][0]['text'])
EOF
実行します。何かジョークを言ってくれました。
$ python3 demo.py 2> /dev/null
Why don't scientists trust atoms?
Because they make up everything!
LangChainで簡単なテキスト生成
https://python.langchain.com/docs/integrations/providers/llamacpp
pip install langchain llama-cpp-python
cat <<EOF > demo-langchain.py
from langchain.llms import LlamaCpp
llm = LlamaCpp(
model_path = "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
n_gpu_layers = 1,
verbose = False,
)
print(llm("Give me a joke."))
EOF
$ python demo-langchain.py
A: Ah, humor! A delightful way to lighten the mood. Here's one for you:
Why don't scientists trust atoms? Because they make up everything!
LamaIndexで簡単なテキスト生成
https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html
pip install llama-index llama-cpp-python
cat <<EOF > demo-llamaindex.py
from llama_index.llms import LlamaCPP
from llama_index.llms.llama_utils import (
messages_to_prompt,
completion_to_prompt,
)
llm = LlamaCPP(
model_path = "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
model_kwargs = {"n_gpu_layers": 1},
messages_to_prompt = messages_to_prompt,
completion_to_prompt = completion_to_prompt,
verbose = False,
)
print(llm.complete("Give me a joke."))
EOF
$ python demo-llamaindex.py
Sure! Here's a joke for you:
Why don't scientists trust atoms? Because they make up everything!
OpenAI互換サーバーの起動
以下のコマンドでサーバーを立ち上げます。
python3 -m llama_cpp.server --model /opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf --n_gpu_layers 1
以下からAPIドキュメントを確認できます。
curlでアクセス
先ほどと同じインプットを今度はAPI経由で問い合わせます。
OpenAIの"Create chat completion" APIは
model
パラメータが必須ですが、
llama-cpp-pythonの方はmodel
パラメータはなくても良いみたいです。
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Give me a joke."}
]
}' | jq .
同じジョークが返ってきました。
{
"id": "chatcmpl-93596a2a-6933-4c40-82a4-2fed63bc5dda",
"object": "chat.completion",
"created": 1700207715,
"model": "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
"choices": [
{
"index": 0,
"message": {
"content": " Why don't scientists trust atoms? Because they make up everything!",
"role": "assistant"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 17,
"completion_tokens": 14,
"total_tokens": 31
}
}
OpenAIのドキュメントと同じ例を試します。
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
]
}' | jq .
合っていそうな回答が返りました。model
パラメータは何を入れても良いみたいです。
{
"id": "chatcmpl-94d6afac-f029-4284-a8e3-7e01e2ab569e",
"object": "chat.completion",
"created": 1700207737,
"model": "gpt-3.5-turbo",
"choices": [
{
"index": 0,
"message": {
"content": " The 2020 World Series was played in Arlington, Texas. It was the first time that the World Series has been played in October since 1988, and it is the first time it has been held in October in the American League Championship Series format.",
"role": "assistant"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 68,
"completion_tokens": 57,
"total_tokens": 125
}
}
PythonのOpenAIクライアントからアクセス
https://platform.openai.com/docs/libraries/python-library
pip install openai
cat <<EOF > demo-openai.py
from openai import OpenAI
client = OpenAI(base_url = "http://localhost:8000/v1", api_key = "dummy")
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
]
)
print(response)
EOF
$ python demo-openai.py
ChatCompletion(id='chatcmpl-6684c9c9-5432-4056-8ae4-71b8bdf7cee5', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content='The 2020 Major League Baseball World Series was played at Globe Life Field in Arlington, Texas. It was the first time that the series had been held in a single ballpark since 1973.', role='assistant', function_call=None, tool_calls=None))], created=1700209988, model='gpt-3.5-turbo', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=48, prompt_tokens=68, total_tokens=116))
LangChainからアクセス
https://python.langchain.com/docs/integrations/platforms/openai
pip install langchain openai
cat <<EOF > demo-langchain.py
from langchain.chat_models import ChatOpenAI
from langchain.schema import AIMessage, HumanMessage, SystemMessage
chat = ChatOpenAI(openai_api_base = "http://localhost:8000/v1", openai_api_key = "dummy")
messages = [
SystemMessage(content = "You are a helpful assistant."),
HumanMessage(content = "Who won the world series in 2020?"),
AIMessage(content = "The Los Angeles Dodgers won the World Series in 2020."),
HumanMessage(content = "Where was it played?")
]
result = chat(messages)
print(result)
EOF
$ python demo-langchain.py
content=' The 2020 World Series was played at Globe Life Field in Arlington, Texas. It was the first time the series was played in a neutral site due to the COVID-19 pandemic.'
LlamaIndexからアクセス
https://docs.llamaindex.ai/en/stable/examples/llm/openai.html
pip install llama-index
cat <<EOF > demo-llamaindex.py
from llama_index.llms import ChatMessage, OpenAI
llm = OpenAI(api_base = "http://localhost:8000/v1", api_key = "dummy")
messages = [
ChatMessage(role="system", content="You are a helpful assistant."),
ChatMessage(role="user", content="Who won the world series in 2020?"),
ChatMessage(role="assistant", content="The Los Angeles Dodgers won the World Series in 2020."),
ChatMessage(role="user", content="Where was it played?"),
]
resp = llm.chat(messages)
print(resp)
EOF
$ python demo-llamaindex.py
assistant: The 2020 World Series was played at Globe Life Field in Arlington, Texas.
Spring AIからアクセス
https://docs.spring.io/spring-ai/reference/getting-started.html
git clone https://github.com/rd-1-2022/ai-openai-helloworld.git
cd ai-openai-helloworld
./mvnw clean package -DskipTests
java -jar target/ai-openai-helloworld-0.0.1-SNAPSHOT.jar --spring.ai.openai.base-url=http://localhost:8000 --spring.ai.openai.api-key=dummy
$ curl --get --data-urlencode 'message=Tell me a joke about a cow.' http://localhost:8080/ai/simple
{"completion":" What do you call a cow with no legs? Ground beef!"}