llama.cpp

LLM inference engine in C/C++ for running quantized models

brewmacoslinux

Try with needOr install directly

About

LLM inference in C/C++

llama-clillama-serverllama-convertllama-quantize

Run inference with a quantized model interactively$ llama-cli -m model.gguf -p 'Hello, how are you?'

Start an OpenAI-compatible API server on port 8000$ llama-server -m model.gguf -ngl 33 --port 8000

Quantize a model from float32 to 4-bit for faster inference$ llama-quantize model.gguf model.q4_m.gguf Q4_M