Set Up Ollama to Run Local LLMs Completely Offline

Running AI Models Without Sending Your Data Anywhere

Every time you type a prompt into a cloud-based AI assistant, that text leaves your machine. It travels to a remote server, gets processed by someone else’s infrastructure, and may be logged, reviewed, or used to improve future models. For developers working with proprietary code, researchers handling sensitive documents, or anyone who simply prefers keeping their data local, that arrangement is a problem. Ollama solves it by letting you run large language models directly on your own hardware, no internet connection required after setup.

Ollama is an open-source tool that wraps the complexity of running LLMs – model downloading, quantization formats, inference configuration – into a clean command-line interface. It supports a growing library of models including Llama 3, Mistral, Gemma, Phi, and Qwen, among others. Once a model is pulled to your machine, it runs entirely offline. There is no account to create, no API key to manage, and no usage limit tied to a subscription tier.

Setup takes under fifteen minutes on most systems.

A home server or desktop computer set up for local AI processing — Photo by Brett Sayles / Pexels

Installing Ollama on Your System

Ollama runs on Linux, macOS, and Windows. On Linux and macOS, the fastest installation method is a single curl command. Open a terminal and run the following:

curl -fsSL https://ollama.com/install.sh | sh

That script detects your operating system, downloads the correct binary, and installs it to /usr/local/bin on Linux or the Applications folder on macOS. On Windows, Ollama provides a standard installer available from its official site at ollama.com. Download the .exe, run it, and the application installs alongside a system tray icon that manages the background service. After installation on any platform, verify the install worked by opening a terminal and typing ollama –version. You should see a version number printed back immediately.

Linux users should also check whether Ollama detected a GPU. If you have an NVIDIA card, ensure the CUDA drivers are installed before running the install script. AMD GPU support is available on Linux through ROCm. Ollama will fall back to CPU inference automatically if no compatible GPU is found, but expect significantly slower response times without GPU acceleration. For reference, a 7-billion-parameter model running on CPU typically produces output at a few tokens per second, while the same model on a mid-range GPU can reach 30 to 60 tokens per second. If you want to monitor how your system handles the load during inference, Glances gives you real-time CPU, RAM, and GPU visibility from the terminal.

A computer terminal displaying command-line code for running an AI model — Photo by Godfrey Atima / Pexels

Pulling a Model and Running Your First Prompt

With Ollama installed and the service running, pulling a model is a single command. To download Llama 3.2 (the 3-billion-parameter version, which is lightweight enough for most laptops), run:

ollama pull llama3.2

The download size for a quantized 3B model sits around 2 GB. Larger models like Llama 3.1 8B come in at roughly 4.7 GB, and the 70B variant requires approximately 40 GB of disk space and substantial RAM to run. For most first-time setups, starting with a 3B or 7B model makes sense. Once the pull completes, launch an interactive chat session directly in the terminal with ollama run llama3.2. A prompt cursor appears and the model responds to anything you type. To exit, type /bye. That session runs with zero outbound network traffic – you can verify this by disconnecting from Wi-Fi entirely before running the command.

Ollama also exposes a local REST API on port 11434 by default, which means you can connect other tools to it. Applications like Open WebUI provide a browser-based chat interface that looks and behaves like ChatGPT but talks exclusively to your local Ollama instance. To test the API directly, send a POST request to http://localhost:11434/api/generate with a JSON body containing your model name and prompt. This API compatibility is what allows Ollama to integrate with code editors, productivity tools, and custom scripts without any additional configuration. The service starts automatically at boot on Linux via systemd and on macOS and Windows through their respective startup mechanisms.

Managing models after the initial pull uses a handful of commands worth memorizing. ollama list shows every model currently stored on your machine along with its size and the date it was pulled. ollama rm modelname deletes a model to free disk space. ollama show modelname prints the model’s parameter count, quantization format, and the system prompt template it uses by default. Models are stored in ~/.ollama/models on Linux and macOS, so if disk space is tight on your primary drive, you can set the OLLAMA_MODELS environment variable to point to a different location before pulling anything.

Customizing Model Behavior with Modelfiles

Ollama supports a concept called a Modelfile – a short configuration file that lets you create a custom version of any base model. This is where Ollama moves beyond a simple inference runner. A Modelfile can define a persistent system prompt, adjust temperature and context window size, and set default parameters that apply every time you start that custom model. The syntax is straightforward: create a plain text file named Modelfile, add a FROM directive pointing to a base model, and optionally include a SYSTEM block with your instructions.

A minimal Modelfile looks like this:

FROM llama3.2
SYSTEM “You are a concise assistant that always responds in bullet points.”

Save that file, then build the custom model with ollama create my-model -f Modelfile. From that point, running ollama run my-model always starts with that system prompt pre-loaded. You can push as much complexity into the system prompt as needed – persona instructions, output format requirements, domain constraints. This is particularly useful for teams building internal tools where every member needs consistent behavior from the model without manually re-entering instructions each session.

A laptop on a desk showing a local AI chat interface running offline — Photo by ready made / Pexels

The gap between running a model and actually deploying it usefully in a workflow is mostly a configuration problem, and Modelfiles close most of it. The real test comes when you start connecting Ollama to other services – document processors, note-taking apps, terminal-based agents – and that depends entirely on what you need the model to do. Start with a 7B model, watch how much RAM it consumes at idle versus under load, and work up from there only if the output quality demands it.