Local LLM Quickstart for Apple Silicon

Replace your ChatGPT and Claude subscriptions in 30 minutes*

May 13, 2024

* more or less, depending on how quickly you can download 40 GB.

A neon llama mixing it up on the turntables

Don’t get me wrong; I really like Claude for coding, and ChatGPT has become a fixture of bedtime - after our most recent story my youngest kid said, “I love you, AI!” which is equal parts adorable and dystopian. But with the release of Llama 3 and recent updates to Mixtral, cutting the cord is becoming a viable option. Plenty of smarter people have already espoused the value of open-source AI, so I won’t get into that here - let’s dive in.

System Requirements

This guide is written assuming you have 64 GB of memory. If you have more to work with, or less, I’ve included a list of suggested models at the end. Any Silicon Mac with 32+ GB should work here; I’m sure you could squeeze out some tokens with only 16 GB but it won’t be a great experience.

Ollama

Just go to Ollama’s website, click the Download button in the corner, and follow the steps.

Ollama is the easiest way to download and run GGUF models. There are other formats that allow faster inference, but they aren’t as well-supported in the ecosystem (yet).

Download and run your first model

ollama run llama3:70b-instruct-q4_K_M

That’s it. When it’s done downloading, you can chat with Llama 3 right in your terminal. But nobody is replacing their AI subscriptions with a terminal; ChatGPT and Claude offer a full web interface that supports formatting, code, conversation history, and more. For that, we need Open WebUI.

While the model is downloading, let’s move on to the next step.

Install Open WebUI

You’ll need docker for this, so install that first if you don’t already have it. There are also alternative installation methods in the documentation.

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

This will run Open WebUI at http://localhost:3000 - you can replace 3000 with another port if that one is occupied.

Browse to Open WebUI and set up a user. Everything is local, but it does support multiple users, which is why you need to “sign up” with an email address and password.

When everything is set up and the model is done downloading, you’ll see a familiar UI for creating new conversations. Llama 3 should be available in the list of models in the upper left; go ahead and send a message to test it out.

Note: the first response may take longer to begin streaming because Ollama has to load the model into memory. By default, models will be kept in memory for 5 minutes before being unloaded.

More models!

Llama 3 is amazing - especially the 70B variant we installed above - but it’s a little slow. Let’s install some more models using ollama pull:

mixtral:8x7b-instruct-v0.1-q6_K - this is my default; it’s faster while still packing plenty of knowledge and reasoning capabilities.
dolphin-mixtral:8x7b-v2.7-q6_K - an uncensored Mixtral by Eric Hartford. Useful for any… forbidden knowledge.

I definitely suggest reading more about dolphin-mixtral; there’s a useful system prompt in Eric’s blog post. You can configure system prompts using Ollama Modelfiles; here’s an example:

Load Based Dolphin Mixtral from the Open WebUI Hub. Copy the Modelfile Content.
Back in Open WebUI, go to Modelfiles > Create a modelfile
Enter a name and description
Paste the Modelfile Content
Replace dolphin-mixtral with dolphin-mixtral:8x7b-v2.7-q6_K

This should be enough to get started - but if you want to do image generation and voice conversations, read on.

Bonus: connect the OpenAI API

Open WebUI supports local image generation and TTS, but the quality of your experience will depend a lot on your configuration. A stopgap is to use the OpenAI API, which is well-supported - and for my usage, a lot cheaper than a ChatGPT subscription.

To configure it:

Login to the OpenAI Platform and create a new API key.
Back in Open WebUI, click your profile image or name in the lower-left corner and select Settings.
Go to Connections, show the OpenAI settings, and paste your API key. This will add OpenAI’s models to your model list.
Go to Audio and under Text-to-Speech Engine, select OpenAI. Paste your API key again.
- You can configure local Whisper for Speech-to-Text here. It’s really easy; just follow the steps in the README to install it.
- There is a Conversation Mode but I find it has some rough edges, so I have it disabled for now.
- You can also configure a voice; I like echo.
Go to Images and select OpenAI (DALL-E) for your Image Generation Engine. Paste your API key again.
- You will need to turn on Image Generation (Experimental).
- Select your model; I use DALL-E 3.
- If you use DALL-E 3, 1024x1024 is a good setting for the Image Size.

And that’s it! You now have a full-featured AI platform with smart, responsive models, image generation, and voice input and output, with no monthly subscription.

There’s plenty more that you can do from here, including multimodal applications with vision; the Open WebUI README is a good place to start.

Addendum: model quants and memory requirements

When running an LLM on consumer hardware, you will want to select a quantized model that represents weights and activations with lower precision. This is a tradeoff where you get competitive capabilities with fewer resources - but you pay a penalty in perplexity. A detailed breakdown is in the k-quants PR.

The model name usually includes the quantization method as a suffix:

fp16: no quantization. Avoid.
q<bits>_<quantization method>; for example, q4_0 or q6_K.

This process has worked out well for me:

How much physical memory am I working with? Let’s say 64 GB.
How much memory is used by the OS and my other applications? ~15 GB in my case, leaving 49 GB.
Review the available model tags and their associated sizes.
Select the highest-precision quant that fits in memory alongside my other applications, with a little elbow room.

As far as I understand it, _0 is the original quantization method. _1 is always an improvement, and k-quants such as K_M are better still. Ollama defaults to Q4_0, so it pays to be explicit here.

These will probably work well depending on your available memory:

32 GB
- llama3:8b-instruct-q8_0
- mistral:7b-instruct-q8_0
  - Note this is mistral, not mixtral; it would be tough to fit an MoE model in 32 GB. You can experiment with mixtral:8x7b-instruct-v0.1-q2_K.
96 GB
- llama3:70b-instruct-q8_0
- mixtral:8x22b-instruct-v0.1-q3_K_L

The Universe of SixBangs

Discussion about this post