--- title: llama-cpp-pythonを使ってローカルLLMでテキスト生成とOpenAI互換サーバーを立てる tags: ["Python", "llama.cpp", "OpenAI", "Machine Learning", "MPS", "LangChain", "LlamaIndex", "Spring AI"] categories: ["AI", "LLM", "llama.cpp"] date: 2023-11-17T08:03:02Z updated: 2023-11-22T07:42:00Z --- [llama.cpp](https://github.com/ggerganov/llama.cpp)はMetaのLLaMAモデルのC/C++版でラップトップで。 元々はApple SiliconのMacで動かす用だったようですが、LinuxやWindowsで大丈夫みたいです。Apple SiliconのMetalが利用できます。 このllama.cppのPythonバインディングである[llama-cpp-python](https://github.com/abetlen/llama-cpp-python)を試してみます。 llama-cpp-pythonは付加機能としてOpenAI互換のサーバーを立てることができます。 試した環境はこちらです **目次** ### llama-cpp-pythonのインストール まずはvenvを作成します。 ``` mkdir -p $HOME/work/llm cd $HOME/work/llm python3 -m venv .venv source .venv/bin/activate ``` llama-cpp-pythonのインストールします。serverも一緒にインストールします。 ``` CMAKE_ARGS="-DLLAMA_METAL=on" pip install --force-reinstall --no-cache-dir 'llama-cpp-python[server]' ``` > ℹ️ Apple SiliconのMac上でエラーが出る場合は、 https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md のセットアップを試してください。 ### Modelのダウンロード Hugging Face上のモデルをダウンロードします。GGUF形式である必要があります。 自分が試したのは以下の二つです。 * https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF * https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF ``` sudo mkdir -p /opt/models sudo chown -R $USER /opt/models ``` ここでは https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf を使用します。 ``` cd /opt/models/ wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf cd $HOME/work/llm ``` ### 簡単なテキスト生成 llama.cppを直接使って簡単なテキストを生成するコードを書きます。 #### llama-cpp-pythonで簡単なテキスト生成 ```python cat < demo.py from llama_cpp import Llama llm = Llama(model_path = "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf", n_gpu_layers = 1) output = llm(""" [INST] Give me a joke. [/INST] """) print(output['choices'][0]['text']) EOF ``` 実行します。何かジョークを言ってくれました。 ``` $ python3 demo.py 2> /dev/null Why don't scientists trust atoms? Because they make up everything! ``` #### LangChainで簡単なテキスト生成 https://python.langchain.com/docs/integrations/providers/llamacpp ``` pip install langchain llama-cpp-python ``` ```python cat < demo-langchain.py from langchain.llms import LlamaCpp llm = LlamaCpp( model_path = "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf", n_gpu_layers = 1, verbose = False, ) print(llm("Give me a joke.")) EOF ``` ``` $ python demo-langchain.py A: Ah, humor! A delightful way to lighten the mood. Here's one for you: Why don't scientists trust atoms? Because they make up everything! ``` #### LamaIndexで簡単なテキスト生成 https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html ``` pip install llama-index llama-cpp-python ``` ```python cat < demo-llamaindex.py from llama_index.llms import LlamaCPP from llama_index.llms.llama_utils import ( messages_to_prompt, completion_to_prompt, ) llm = LlamaCPP( model_path = "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf", model_kwargs = {"n_gpu_layers": 1}, messages_to_prompt = messages_to_prompt, completion_to_prompt = completion_to_prompt, verbose = False, ) print(llm.complete("Give me a joke.")) EOF ``` ``` $ python demo-llamaindex.py Sure! Here's a joke for you: Why don't scientists trust atoms? Because they make up everything! ``` ### OpenAI互換サーバーの起動 以下のコマンドでサーバーを立ち上げます。 ``` python3 -m llama_cpp.server --model /opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf --n_gpu_layers 1 ``` 以下からAPIドキュメントを確認できます。 http://localhost:8000/docs #### curlでアクセス 先ほどと同じインプットを今度はAPI経由で問い合わせます。 > OpenAIの["Create chat completion" API](https://platform.openai.com/docs/api-reference/chat/create)は`model`パラメータが必須ですが、
> llama-cpp-pythonの方は`model`パラメータはなくても良いみたいです。 ``` curl -s http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "Give me a joke."} ] }' | jq . ``` 同じジョークが返ってきました。 ```json { "id": "chatcmpl-93596a2a-6933-4c40-82a4-2fed63bc5dda", "object": "chat.completion", "created": 1700207715, "model": "/opt/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf", "choices": [ { "index": 0, "message": { "content": " Why don't scientists trust atoms? Because they make up everything!", "role": "assistant" }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 17, "completion_tokens": 14, "total_tokens": 31 } } ``` [OpenAIのドキュメント](https://platform.openai.com/docs/guides/text-generation/chat-completions-api)と同じ例を試します。 ``` curl -s http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-3.5-turbo", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"}, {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."}, {"role": "user", "content": "Where was it played?"} ] }' | jq . ``` 合っていそうな回答が返りました。`model`パラメータは何を入れても良いみたいです。 ```json { "id": "chatcmpl-94d6afac-f029-4284-a8e3-7e01e2ab569e", "object": "chat.completion", "created": 1700207737, "model": "gpt-3.5-turbo", "choices": [ { "index": 0, "message": { "content": " The 2020 World Series was played in Arlington, Texas. It was the first time that the World Series has been played in October since 1988, and it is the first time it has been held in October in the American League Championship Series format.", "role": "assistant" }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 68, "completion_tokens": 57, "total_tokens": 125 } } ``` #### PythonのOpenAIクライアントからアクセス https://platform.openai.com/docs/libraries/python-library ``` pip install openai ``` ```python cat < demo-openai.py from openai import OpenAI client = OpenAI(base_url = "http://localhost:8000/v1", api_key = "dummy") response = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"}, {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."}, {"role": "user", "content": "Where was it played?"} ] ) print(response) EOF ``` ``` $ python demo-openai.py ChatCompletion(id='chatcmpl-6684c9c9-5432-4056-8ae4-71b8bdf7cee5', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content='The 2020 Major League Baseball World Series was played at Globe Life Field in Arlington, Texas. It was the first time that the series had been held in a single ballpark since 1973.', role='assistant', function_call=None, tool_calls=None))], created=1700209988, model='gpt-3.5-turbo', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=48, prompt_tokens=68, total_tokens=116)) ``` #### LangChainからアクセス https://python.langchain.com/docs/integrations/platforms/openai ``` pip install langchain openai ``` ```python cat < demo-langchain.py from langchain.chat_models import ChatOpenAI from langchain.schema import AIMessage, HumanMessage, SystemMessage chat = ChatOpenAI(openai_api_base = "http://localhost:8000/v1", openai_api_key = "dummy") messages = [ SystemMessage(content = "You are a helpful assistant."), HumanMessage(content = "Who won the world series in 2020?"), AIMessage(content = "The Los Angeles Dodgers won the World Series in 2020."), HumanMessage(content = "Where was it played?") ] result = chat(messages) print(result) EOF ``` ``` $ python demo-langchain.py content=' The 2020 World Series was played at Globe Life Field in Arlington, Texas. It was the first time the series was played in a neutral site due to the COVID-19 pandemic.' ``` #### LlamaIndexからアクセス https://docs.llamaindex.ai/en/stable/examples/llm/openai.html ``` pip install llama-index ``` ```python cat < demo-llamaindex.py from llama_index.llms import ChatMessage, OpenAI llm = OpenAI(api_base = "http://localhost:8000/v1", api_key = "dummy") messages = [ ChatMessage(role="system", content="You are a helpful assistant."), ChatMessage(role="user", content="Who won the world series in 2020?"), ChatMessage(role="assistant", content="The Los Angeles Dodgers won the World Series in 2020."), ChatMessage(role="user", content="Where was it played?"), ] resp = llm.chat(messages) print(resp) EOF ``` ``` $ python demo-llamaindex.py assistant: The 2020 World Series was played at Globe Life Field in Arlington, Texas. ``` #### Spring AIからアクセス https://docs.spring.io/spring-ai/reference/getting-started.html ``` git clone https://github.com/rd-1-2022/ai-openai-helloworld.git cd ai-openai-helloworld ./mvnw clean package -DskipTests java -jar target/ai-openai-helloworld-0.0.1-SNAPSHOT.jar --spring.ai.openai.base-url=http://localhost:8000 --spring.ai.openai.api-key=dummy ``` ``` $ curl --get --data-urlencode 'message=Tell me a joke about a cow.' http://localhost:8080/ai/simple {"completion":" What do you call a cow with no legs? Ground beef!"} ```