Local AI Setup Guide: Ollama, Open WebUI, and AnythingLLM
Running AI on your own hardware has gotten genuinely practical. Your data stays local, inference costs nothing per token after setup, and the model runs offline once downloaded. The setup takes an hour, maybe two if you hit Docker networking issues. The tools that make it work, Ollama for inference, Open WebUI or AnythingLLM for the interface, are all actively maintained and work well on consumer hardware.
This guide covers hardware requirements by tier, the full installation process, model selection by task type, and where a personal setup stops being enough.
why running AI locally is worth the setup time#
your data never leaves your machine#
Local inference means the prompts you send and the responses you receive stay on your hardware. No terms of service, no opt-out settings, no training on your data, because your data never leaves the machine. For anyone handling client information, proprietary research, or internal documents, this is the primary reason to bother.
44% of organizations cite data privacy and security as the top barrier to LLM adoption (Kong Enterprise AI Report / Hostinger LLM Statistics, 2025). Local inference is the direct fix for that particular problem.
no API costs: run inference at volume without per-token billing#
API costs add up. Processing large document batches, running inference in a development workflow, or using a model for repeated analysis tasks can get expensive on per-token billing. With a local setup, the marginal cost per query after the initial hardware investment is zero.
works fully offline once models are downloaded#
Once you have pulled a model with Ollama, it runs without internet access. This matters for air-gapped environments, travel, and anywhere network connectivity is unreliable. No API calls required during inference.
the hardware bar has dropped: what you actually need in 2026#
Efficient quantization formats, Apple Silicon's unified memory architecture, and increasingly capable small models have made useful local AI inference possible on hardware most developers and IT leads already own. A 2022 MacBook Pro, a desktop with 32GB RAM, or a machine with a midrange GPU are all sufficient for 7B model inference at production quality.
hardware requirements#
minimum viable setup (CPU-only): 8-core CPU, 16GB RAM, 50GB SSD#
CPU-only inference is slower but works. A modern 8-core CPU with 16GB of RAM can run a 7B model at Q4 quantization, typically 5-15 tokens per second. That is readable in real time. For document analysis, one-shot tasks, and workflows that tolerate a few seconds per response, this is enough.
The 50GB SSD estimate covers the Ollama install, a couple of 7B models at Q4 (roughly 4-5GB each), and Open WebUI or AnythingLLM.
recommended setup (GPU-accelerated): 8GB+ VRAM, 32GB RAM#
A dedicated GPU with 8GB+ VRAM delivers 40-50 tokens per second on 7B models. That is a 3-8x throughput improvement over CPU-only on the same model (Arsturn Hardware Guide / LocalLLM.in, 2026). Responses feel instant rather than timed.
With 32GB system RAM and a GPU with 8GB VRAM, you can run a 7B model with full GPU offload and still have headroom for the OS, browser, and other applications.
Apple Silicon: why M-series Macs are unusually efficient for local inference#
Apple Silicon Macs use unified memory: the same RAM pool is accessible to both CPU and GPU, with no data transfer penalty between them. An M3 MacBook Pro with 24GB of unified memory can outrun many dedicated GPU setups that have less VRAM.
A 7B model at Q4 quantization requires approximately 4-5GB of RAM (Ollama VRAM Guide / LocalLLM.in, 2026). An M3 Max with 64GB can run 13B or even 30B models without running out of memory, and the performance is better than you might expect from a laptop. For developers who want local AI that travels with them, Apple Silicon is the most practical option per dollar right now.
NVIDIA GPU tiers and what each unlocks#
| GPU | VRAM | What it runs well |
|---|---|---|
| RTX 3060 / 4060 | 8-12GB | 7B models, full GPU offload |
| RTX 3080 / 4070 | 10-12GB | 7B-13B models depending on quantization |
| RTX 3090 / 4090 | 24GB | 13B-34B models, fast inference |
| A100 / H100 (datacenter) | 40-80GB | 70B+ models at full precision |
For consumer GPU recommendations in 2026, the RTX 4060 Ti (16GB) offers the best price-to-VRAM ratio at the midrange.
a practical rule of thumb: 4-5GB RAM per 7B model at Q4 quantization#
- 7B model at Q4: ~4-5GB
- 13B model at Q4: ~8-9GB
- 34B model at Q4: ~20GB
- 70B model at Q4: ~40GB
These numbers tell you whether a model fits in your GPU VRAM for full GPU offload (fast) or has to use partial GPU + CPU inference (slower). If the model does not fully fit in VRAM, Ollama falls back gracefully to CPU inference for the overflow.
step 1: install Ollama#
Ollama crossed 162,000 GitHub stars by early 2026, up from 28,900 in Q1 2024 (GitHub / Runa Capital ROSS Index, 2024-2026). It is the standard inference engine for local model deployment and the foundation both Open WebUI and AnythingLLM build on.
installation: macOS, Linux, and Windows (WSL2)#
macOS:
curl -fsSL https://ollama.com/install.sh | shOr download the macOS app from ollama.com for a GUI install. The app installs the CLI automatically.
Linux:
curl -fsSL https://ollama.com/install.sh | shWindows: Download the installer from ollama.com. WSL2 is recommended if you want to run Open WebUI or AnythingLLM via Docker alongside Ollama. The native Windows install is fine for standalone Ollama use.
pull your first model#
ollama pull llama3.2 # Meta's Llama 3.2 (3B) -- fast, general purpose
ollama pull phi4-mini # Microsoft Phi-4 Mini (3.8B) -- strong at reasoning
ollama pull gemma3 # Google Gemma 3 (4B) -- efficient multilingual supportFor a first install, llama3.2 or phi4-mini are solid choices. After pulling, test:
ollama run llama3.2You should see a prompt. Type a message, press Enter. If you get a response, the inference layer is working.
understanding quantization levels: Q4_K_M is the standard starting point#
Quantization reduces model precision to shrink the file size and memory footprint. The tradeoff is a small reduction in output quality.
Q8 is near full precision, highest quality, largest size. Use it if you have the VRAM to spare. Q4_K_M is the standard balance: 4-bit quantization with K-means optimization, small size, good quality, and the default for most use cases. Q2 and Q3 produce noticeable quality degradation and are only worth considering on very constrained hardware.
When using ollama pull, the default download is typically Q4_K_M. Start with the default. Only experiment with Q8 if you notice quality issues and have VRAM headroom.
choosing your interface: Open WebUI vs AnythingLLM#
Open WebUI: best for developers, power users, and multi-model switching#
Open WebUI is a full-featured chat interface for Ollama and API-compatible model providers. It works like a polished chat application: model switching, chat history, document upload for in-context Q&A, multi-user support. Over 126,000 GitHub stars as of early 2026 (GitHub / OpenAlternative, 2026), making it the most widely adopted local AI interface.
Use Open WebUI if you:
- Want to switch between multiple locally installed models in the same interface
- Need a chat UI that works like Claude.ai or ChatGPT but against your local models
- Are the primary user and want direct control over settings and model parameters
- Want to share access with a small number of other users (multi-user auth is built in)
AnythingLLM: best for teams, document Q&A, and workspace organization#
AnythingLLM is organized around workspaces, each with its own document corpus, LLM configuration, and conversation history. This is useful when you are managing several knowledge domains and do not want them bleeding into each other. Think: one workspace for client documents, another for your codebase.
55,794 GitHub stars as of early 2026 (GitHub / OpenAlternative, 2026). Fewer than Open WebUI, but AnythingLLM has a clearer lead in team and document-focused deployments.
Use AnythingLLM if you:
- Need to upload PDFs and ask questions against their content
- Are setting up for a small team that needs separate document collections per user or project
- Want workspace-level isolation between different knowledge domains
- Prefer a desktop app over a web interface
when to use both: different jobs, different tools#
Honestly, most people do not need both. But running Open WebUI for general chat and AnythingLLM for a specific document corpus is a reasonable split. They connect to the same Ollama backend, so there is no duplication on the inference side.
which one to install first#
If your primary need is general-purpose local AI chat: start with Open WebUI.
If your primary need is document Q&A against specific files: start with AnythingLLM's desktop app.
If you are not sure, Open WebUI has the shorter setup path and covers more ground.
step 2a: install Open WebUI#
Docker install (recommended for most users)#
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:mainThis runs Open WebUI on port 3000. Access it at http://localhost:3000.
If you are on a Mac and Docker runs inside a VM (Docker Desktop), replace host.docker.internal:host-gateway with your actual host IP. The default host-gateway may not route to your Ollama instance correctly.
connecting Open WebUI to your local Ollama instance#
On first launch, Open WebUI will prompt for an Ollama base URL. Enter http://localhost:11434 (the default Ollama port). Open WebUI fetches the list of available models automatically.
If Open WebUI is running in Docker and cannot reach Ollama at localhost, try http://host.docker.internal:11434 (macOS/Windows Docker Desktop) or http://172.17.0.1:11434 (Linux Docker bridge).
enabling document upload and RAG for local files#
Open WebUI includes a built-in RAG (retrieval-augmented generation) system. Upload documents via the paperclip icon in the chat interface and ask questions against them. It is fine for quick Q&A against a single document. For production document workflows where you need organized collections across multiple files, AnythingLLM's workspace model is more structured.
setting up multi-user access and basic auth#
On first launch, Open WebUI prompts you to create an admin account. Subsequent users can sign up via the same URL. To restrict registration so only invited users can access, go to Admin Panel > Settings > General > disable new user signups.
step 2b: install AnythingLLM#
desktop app vs server: which to use#
The desktop app is a standalone Electron app. Install it like any desktop application. Best for individual use: one person, one machine, no Docker, no configuration files.
The server version is a Docker-based deployment, better for small teams where multiple people need access from different machines. Requires a server or always-on machine.
For a first install, the desktop app is the fastest path.
connecting AnythingLLM to Ollama as the LLM provider#
On first launch, AnythingLLM asks for an LLM provider. Select "Ollama," enter http://localhost:11434 as the base URL, and pick your model from the dropdown. AnythingLLM will use Ollama for all inference.
creating workspaces and uploading documents for Q&A#
Create a workspace (click "+ New Workspace"), give it a name, then upload documents via the drag-and-drop interface. AnythingLLM chunks the documents, embeds them, and stores the vectors in its local database. Ask questions in the chat interface: the agent retrieves relevant chunks and constructs an answer from your documents.
For large document sets, the workspace concept is a lot cleaner than Open WebUI's per-chat document uploads.
setting up vector storage: LanceDB (default), ChromaDB, or Weaviate#
AnythingLLM uses LanceDB by default, an embedded vector database that requires no additional setup. For most users, the default is fine.
For larger document corpora or teams that need shared vector storage, AnythingLLM supports ChromaDB and Weaviate as alternative backends. Configure in the Settings > Vector Database section.
model selection: what to run on your hardware#
under 16GB RAM: Phi-4 Mini (3.8B), Gemma 3 4B, Llama 3.2 3B#
These run on constrained hardware at usable speeds. Phi-4 Mini punches above its weight on reasoning tasks. Gemma 3 4B handles multilingual tasks well. Llama 3.2 3B is the fastest of the three and the most tested.
16-32GB RAM: Llama 3.1 8B, Mistral 7B, DeepSeek-R1 7B#
This is where most developer and IT setups land. Llama 3.1 8B is a safe general-purpose default. Mistral 7B is noticeably better on coding. DeepSeek-R1 7B is the one to try if you are working on reasoning or logic tasks.
32GB+ RAM or 16GB+ VRAM: Llama 3.3 70B (Q4), Qwen2.5 32B#
At this tier you can run models that are competitive with frontier API models on many benchmarks. Llama 3.3 70B at Q4 quantization needs 32GB+ RAM and is slow without a GPU with 24GB+ VRAM, but the output quality is a different league from the 7B models. Qwen2.5 32B is the stronger pick for multilingual or coding work.
model recommendations by task type#
| Task | Recommended model |
|---|---|
| General chat and Q&A | Llama 3.1 8B or Llama 3.3 70B |
| Code generation | Mistral 7B or DeepSeek-R1 7B |
| Document analysis / RAG | Llama 3.1 8B or Phi-4 Mini |
| Reasoning / logic | DeepSeek-R1 7B or Phi-4 |
| Multilingual | Gemma 3 4B or Qwen2.5 |
how to benchmark: tokens per second as your practical signal#
Tokens per second (tok/s) is the number that matters for day-to-day use. Below 5 tok/s feels sluggish. 15-25 tok/s is comfortable. Above 40 tok/s and you stop noticing the generation.
Test a model after pulling it:
ollama run llama3.1 "Summarize the history of the Roman Empire in 200 words."Watch the token generation speed displayed after the run. If it is below your usable threshold, try a smaller model or a lower quantization.
where DIY local AI hits its limits#
team access: sharing a local model across multiple users requires infrastructure#
A local Ollama instance on one machine works for one person. Sharing it across a team means configuring network access, managing concurrent connections, setting up authentication, and handling resource contention when multiple users run inference at the same time. That is manageable, but it is not a personal setup anymore.
55% of enterprise AI inference is now performed on-premises or at the edge, up from 12% in 2023 (dasroot.net / IDC data, 2026). The infrastructure required for that is meaningfully more complex than a personal Ollama install.
compliance: HIPAA and GDPR require documented controls beyond a personal install#
A personal Ollama setup on your laptop is not a HIPAA-compliant deployment. It does not matter how local it is. Compliance requires documented access controls, audit logging, data-at-rest encryption configuration, and evidence of policy enforcement. A personal install has none of that.
RAG at scale: large document corpora need proper vector infrastructure#
For personal document Q&A with tens of documents, AnythingLLM's built-in LanceDB works fine. For an organization with thousands of documents, or where search recall quality actually matters for decisions, you need dedicated vector infrastructure: proper chunking pipelines, embedding model selection, indexing strategies, and retrieval tuning. The default setup is not built for that.
reliability: local hardware lacks the uptime and redundancy of a managed deployment#
A personal setup on a workstation goes down when the machine restarts, when the power cuts, when someone closes the laptop. For workflows that need consistent availability, you need infrastructure, not a personal machine.
when to move to a managed deployment#
A personal Ollama setup is the right starting point: for individuals evaluating local AI, developers building against local models, or teams doing early exploration before they know what they need.
When the requirements grow, multiple users, compliance, integration with business tools, reliable uptime, a single-machine setup starts creating more problems than it solves.
Silverthread Labs builds managed deployments on infrastructure you control: GPU servers, model serving, access management, RAG pipelines, and integration with your existing tools, without the data going anywhere you do not want it. View the self-hosted AI service page or contact us directly to discuss your team's requirements.
frequently asked questions#
How do I run an AI model locally on my own computer?
Install Ollama from ollama.com, run ollama pull llama3.2 to download a model, then run ollama run llama3.2 to start a session. For a web interface, install Open WebUI via Docker and connect it to your Ollama instance at http://localhost:11434.
How much RAM do I need to run a 7B model locally?
16GB is the practical minimum. At Q4 quantization, a 7B model uses approximately 4-5GB of RAM, which leaves headroom on a 16GB system for the OS and other applications. Under 16GB, use a smaller model: Phi-4 Mini (3.8B) or Llama 3.2 (3B).
Is it free to run AI locally with Ollama?
The Ollama software and all the open-weight models available through it are free. You pay for the hardware, a machine that meets the RAM and optionally GPU requirements. No API costs, no subscription fees, no per-query charges.
What is the difference between Open WebUI and AnythingLLM?
Open WebUI is a general-purpose chat interface: polished, model-flexible, good for everyday chat and Q&A. AnythingLLM is organized around document workspaces, making it the better fit for document Q&A and teams that need to keep different knowledge domains separate. Both use Ollama as the inference backend.
Can I use local AI without an internet connection?
Yes. Once Ollama is installed and models are downloaded, the stack runs completely offline. No API calls during inference. The only time internet is required is when pulling new models.
What is Q4 quantization and should I use it?
Q4_K_M is 4-bit quantization with K-means optimization. It cuts file size roughly in half compared to full precision, with a small quality tradeoff. For most tasks in everyday use, the quality difference is not noticeable. Start with Q4_K_M. Move to Q8 only if you notice specific quality issues and have the VRAM to spare.
