As a programmer, I always believe that the best way to understand something is by actually getting the hands dirty by testing it out ourselves. In past few years, with the introduction of ChatGPT, AI and associated technologies such as LLM come to a hot spot in tech industry and there are lots of platforms available and different models coming out for people to test out.

This post I will demonstrate a step-by-step guide on how to run Llama(a Meta LLM) on a MacOS machine with a model. This will just do a simple installation and run and test out how it works. It doesn't involve how to fine-tune or train your own data set.

Prerequisites

Binary: The llama-cpp binary can be installed with brew with below command
```
brew install llama.cpp
```
Model: Here we just download a model from Hugging Face(a LLM model repository just like code repository such as GitHub). Note for llama-server to run, we need a model with GGUF format. Can search gguf

Running

Once have the binary and model data downloaded, can just issue below command

llama-server --flash-attn --ctx-size 0 --model qwen2.5-coder-14b-instruct-q4_k_m.gguf

When this command runs, it should bring up a HTTP server and loads the model. If you see any error, can refer to Appendix.

Testing

To test the model. Can open a browser and access with http://127.0.0.1:8080

This indicates that you are able to access the model and can play around on your local machine.

Appendix

failed to allocate buffer for kv cache

llama_new_context_with_model: freq_scale    = 1
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 34359738368
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
common_init_from_params: failed to create context with model 'c:\ai\qwen2.5-coder-32b-instruct-q4_k_m.gguf'
srv    load_model: failed to load model, 'c:\ai\qwen2.5-coder-32b-instruct-q4_k_m.gguf'
main: exiting due to model loading error

This indicates your hardware is not able to allocate necessary buffer. You can either download a small model or upgrade your hardware spec.

Input prompt is too big compared to KV size. Please try increasing KV size.

llama_decode: failed to decode, ret = -3
srv  update_slots: failed to decode the batch: KV cache is full - try increasing it via the context size, i = 0, n_batch = 2048, ret = -3
slot      release: id  0 | task 3 | stop processing: n_past = 23, truncated = 0
srv    send_error: task id = 3, error: Input prompt is too big compared to KV size. Please try increasing KV size.
srv  cancel_tasks: cancel task, id_task = 3

The error description does suggest what to change, in this case, the context size, originally, we specified 0 in the command line about context size, now remove that option.

Testing LLM on MacOS with Llama

Prerequisites

Running

Testing

Appendix

failed to allocate buffer for kv cache

Input prompt is too big compared to KV size. Please try increasing KV size.

RELATED

0 COMMENT

RANDOM FUN

How technology changed our lives

ABOUT

HOW IT WORKS

FOLLOW US

FEEDBACK