llama.cpp

LLM inference engine in C/C++ for running quantized models

brewmacoslinux
Try with needOr install directly
Source

About

LLM inference in C/C++

Commands

llama-clillama-serverllama-convertllama-quantize

Examples

Run inference with a quantized model interactively$ llama-cli -m model.gguf -p 'Hello, how are you?'
Start an OpenAI-compatible API server on port 8000$ llama-server -m model.gguf -ngl 33 --port 8000
Quantize a model from float32 to 4-bit for faster inference$ llama-quantize model.gguf model.q4_m.gguf Q4_M