Self-Hosting Ollama: A Home Lab Journey

Running large language models (LLMs) locally in a home lab environment offers unique advantages beyond just cost savings. It provides an invaluable learning opportunity to understand AI infrastructure from the ground up, experiment freely without API rate limits, maintain complete privacy over your data, and gain deep technical knowledge about model deployment, resource optimization, and system architecture.

In this guide, I'll walk you through my home lab setup for running Ollama across two Lenovo ThinkCentre M720q machines, each with different GPU configurations. This dual-node setup demonstrates how commodity hardware can power a practical, educational AI infrastructure.

Why Run LLMs Locally?

Before diving into the technical details, let's explore the benefits of self-hosting:

Learning and Education: Hands-on experience with model deployment, GPU utilization, system optimization, and troubleshooting builds deep technical understanding
Privacy and Control: Your data never leaves your infrastructure - no third-party services, no data retention policies to worry about
Cost Efficiency: After initial hardware investment, no per-token costs or monthly subscriptions
Experimentation Freedom: Test different models, configurations, and use cases without worrying about API costs
Network Independence: Once models are downloaded, you can operate without internet connectivity
Home Lab Integration: Integrate AI capabilities into your existing home automation, monitoring, and development workflows

Hardware Setup

My setup consists of two Lenovo ThinkCentre M720q tiny desktops - compact, efficient machines that pack surprising power:

Node 1 - Intel Arc Setup

Model: Lenovo ThinkCentre M720q
CPU: Intel i5-8600T (6 cores / 6 threads)
RAM: 64GB DDR4
GPU: Intel Arc A310 (4GB VRAM)
OS: Ubuntu 24.04 LTS
Purpose: Embedding generation and smaller models

Node 2 - NVIDIA Setup

Model: Lenovo ThinkCentre M720q
CPU: Intel i5-8600T (6 cores / 6 threads)
RAM: 32GB DDR4
GPU: NVIDIA Quadro T1000 (8GB VRAM)
OS: Ubuntu 24.04 LTS
Purpose: Text generation and larger models

Why These GPUs? Both the Intel Arc A310 and NVIDIA Quadro T1000 are low-profile cards, which is crucial for the ThinkCentre M720q's compact form factor. These tiny desktop machines have extremely limited internal space, and standard-height GPUs simply won't fit. The low-profile design allows for powerful GPU acceleration while maintaining the small footprint that makes these machines ideal for home lab deployments.

Node 1: Intel Arc A310 Setup

The Intel Arc A310 is an excellent budget GPU for AI workloads. While it has limited VRAM (4GB), it's perfect for embedding models and smaller LLMs. Ollama uses Vulkan backend for Intel Arc GPUs, which provides great performance.

Step 1: System Preparation

Start with a fully updated system:

sudo apt update
sudo apt upgrade -y

Step 2: Verify GPU Detection

Check if the Intel Arc GPU is detected:

lspci | grep -i vga

You should see output similar to:

03:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A310] (rev 05)

Step 3: Install Intel GPU Drivers

Install the Intel GPU drivers and required dependencies:

sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:kobuk-team/intel-graphics
sudo apt update

# Install Intel GPU drivers and OpenCL support
sudo apt install -y \
  libze-intel-gpu1 \
  libze1 \
  intel-metrics-discovery \
  intel-opencl-icd \
  clinfo \
  intel-gsc

# Install media acceleration drivers
sudo apt install -y \
  intel-media-va-driver-non-free \
  libmfx-gen1 \
  libvpl2 \
  libvpl-tools \
  libva-glx2 \
  va-driver-all \
  vainfo

# Install development libraries
sudo apt install -y \
  libze-dev \
  intel-ocloc \
  libze-intel-gpu-raytracing

Note: The kobuk-team/intel-graphics PPA provides up-to-date Intel GPU drivers optimized for Ubuntu. These drivers include support for Intel Arc discrete GPUs.

Step 4: Install Ollama

Install Ollama using the official installation script:

curl -fsSL https://ollama.com/install.sh | sh

Verify the installation:

sudo systemctl status ollama.service

You should see output indicating the service is running:

● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)
     Active: active (running) since Wed 2025-12-31 09:30:48 UTC; 41s ago
   Main PID: 1274 (ollama)
      Tasks: 9 (limit: 76930)
     Memory: 9.8M (peak: 21.1M)
        CPU: 67ms
     CGroup: /system.slice/ollama.service
             └─1274 /usr/local/bin/ollama serve

Step 5: Configure Ollama for Intel Arc with Vulkan

Edit the Ollama systemd service configuration:

sudo systemctl edit --full ollama.service

Replace the contents with the following configuration:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
Environment="GGML_VK_VISIBLE_DEVICES=1"
Environment="OLLAMA_VULKAN=1"
Environment="OLLAMA_NEW_ENGINE=1"
Environment="OLLAMA_DEBUG=1"
Environment="OLLAMA_HOST=0.0.0.0:11434"

[Install]
WantedBy=default.target

Understanding the Environment Variables

Let's break down each environment variable and its purpose:

PATH: Standard system PATH ensuring all required binaries are accessible
GGML_VK_VISIBLE_DEVICES=1: Specifies which Vulkan device to use (device index 1). This tells the GGML library (used by Ollama) to use the Intel Arc GPU via Vulkan
OLLAMA_VULKAN=1: Enables Vulkan backend for GPU acceleration. Essential for Intel Arc GPUs as they work best with Vulkan
OLLAMA_NEW_ENGINE=1: Enables the new inference engine with improved performance and features
OLLAMA_DEBUG=1: Enables debug logging for troubleshooting and monitoring GPU utilization
OLLAMA_HOST=0.0.0.0:11434: Makes Ollama accessible from other machines on the network (binds to all interfaces). Change to 127.0.0.1:11434 if you only want local access

Vulkan Backend: Intel Arc GPUs perform exceptionally well with Vulkan. The experimental Vulkan support is disabled by default, so explicitly enabling it with OLLAMA_VULKAN=1 is crucial for GPU acceleration.

Step 6: Apply Changes and Restart

Reload systemd and restart Ollama:

sudo systemctl daemon-reload
sudo systemctl restart ollama.service
sudo systemctl status ollama.service

Check the logs to verify GPU detection:

sudo journalctl -u ollama.service -n 50

Look for lines indicating Vulkan is enabled and the GPU is detected.

Node 2: NVIDIA Quadro T1000 Setup

The NVIDIA Quadro T1000 with 8GB VRAM is a solid professional GPU perfect for running 7B-13B parameter models. NVIDIA has excellent Linux driver support, making the setup straightforward.

Step 1: System Preparation

Update the system:

sudo apt update
sudo apt upgrade -y

Step 2: Verify GPU Detection

Check if the NVIDIA GPU is detected:

lspci | grep -i vga

You should see output similar to:

01:00.0 VGA compatible controller: NVIDIA Corporation TU117GL [T1000 8GB] (rev a1)

Step 3: Install NVIDIA Drivers

Add the graphics drivers PPA and install the NVIDIA driver:

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt install -y nvidia-driver-580

Driver Version: Version 580 is the latest at the time of writing. You can check available versions with ubuntu-drivers devices or use sudo ubuntu-drivers autoinstall to automatically install the recommended driver.

Reboot the system to load the new drivers:

sudo reboot

After reboot, verify the driver installation:

nvidia-smi

This should display information about your GPU, including temperature, memory usage, and driver version.

Step 4: Install Ollama

Install Ollama using the official script:

curl -fsSL https://ollama.com/install.sh | sh

Step 5: Configure Ollama for NVIDIA

Edit the Ollama service configuration:

sudo systemctl edit --full ollama.service

Replace the contents with:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
Environment="OLLAMA_NEW_ENGINE=1"
Environment="OLLAMA_DEBUG=1"
Environment="OLLAMA_HOST=0.0.0.0:11434"

[Install]
WantedBy=default.target

Understanding the Configuration

The NVIDIA configuration is simpler than Intel Arc because NVIDIA GPU support is native to Ollama:

No GPU-specific variables needed: Ollama automatically detects NVIDIA GPUs and uses CUDA for acceleration
OLLAMA_NEW_ENGINE=1: Enables the improved inference engine
OLLAMA_DEBUG=1: Enables debug logging for monitoring
OLLAMA_HOST=0.0.0.0:11434: Exposes Ollama on all network interfaces for remote access

The simplicity of this configuration reflects NVIDIA's mature CUDA support - no special flags needed, it just works.

Step 6: Apply Changes and Restart

Reload and restart the service:

sudo systemctl daemon-reload
sudo systemctl restart ollama.service
sudo systemctl status ollama.service

Monitor GPU usage while running models:

watch -n 1 nvidia-smi

Testing Your Setup

Now that both nodes are configured, let's test them!

Download Models

On Node 1 (Intel Arc - 4GB VRAM), download smaller models:

ollama pull llama3.2:3b
ollama pull nomic-embed-text

On Node 2 (NVIDIA - 8GB VRAM), download larger models:

ollama pull llama3.2:8b
ollama pull mistral:7b

Run Interactive Sessions

Test each node with an interactive session:

# On Node 1
ollama run llama3.2:3b

# On Node 2
ollama run llama3.2:8b

API Access from Other Machines

Since both services are bound to 0.0.0.0:11434, you can access them from other machines:

# From your workstation - access Node 1 (assuming 192.168.1.100)
curl http://192.168.1.100:11434/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

# Access Node 2 (assuming 192.168.1.101)
curl http://192.168.1.101:11434/api/generate -d '{
  "model": "llama3.2:8b",
  "prompt": "Explain quantum computing.",
  "stream": false
}'

Performance Expectations

Based on real-world usage, here's what to expect from each node:

Node 1 - Intel Arc A310 Performance

Llama 3.2 3B: 40-50 tokens/second
Phi 3 3.8B: 45-55 tokens/second
Embedding models: 200-300 embeddings/second
Best for: Embedding generation, smaller models, high-throughput tasks

Node 2 - NVIDIA Quadro T1000 Performance

Llama 3.2 8B: 30-35 tokens/second
Mistral 7B: 35-40 tokens/second
Llama 3.2 13B (Q4): 15-20 tokens/second
Best for: Text generation, conversational AI, larger models

Monitoring and Maintenance

Systemd Service Management

# Check service status
sudo systemctl status ollama.service

# View recent logs
sudo journalctl -u ollama.service -n 100

# Follow logs in real-time
sudo journalctl -u ollama.service -f

# Restart service
sudo systemctl restart ollama.service

GPU Monitoring

On Node 1 (Intel Arc):

# Install GPU monitoring tool
sudo apt install intel-gpu-tools

# Monitor GPU usage
sudo intel_gpu_top

On Node 2 (NVIDIA):

# Real-time GPU monitoring
watch -n 1 nvidia-smi

# Detailed monitoring
nvidia-smi dmon

Troubleshooting

Intel Arc GPU Not Utilized

If the Intel Arc GPU isn't being used:

# Check Vulkan support
vulkaninfo --summary

# Verify GPU is visible
lspci | grep -i vga

# Check service logs
sudo journalctl -u ollama.service | grep -i vulkan

# Ensure OLLAMA_VULKAN=1 is set
sudo systemctl show ollama.service | grep OLLAMA_VULKAN

NVIDIA GPU Not Detected

If NVIDIA GPU isn't working:

# Check driver status
nvidia-smi

# If command not found, reinstall drivers
sudo apt install --reinstall nvidia-driver-580

# Check if GPU is visible
lspci | grep -i nvidia

# Reboot if needed
sudo reboot

Network Access Issues

If you can't access Ollama from other machines:

# Check if service is listening on correct interface
sudo ss -tlnp | grep 11434

# Test locally first
curl http://localhost:11434/api/tags

# Check firewall
sudo ufw status
sudo ufw allow 11434/tcp

Lessons Learned

Running this dual-node setup has taught me valuable lessons about AI infrastructure:

GPU Selection Matters: The 4GB vs 8GB VRAM difference significantly impacts model choice and performance
Vulkan vs CUDA: Intel Arc requires explicit Vulkan configuration, while NVIDIA "just works" with CUDA
Debug Logging is Essential: OLLAMA_DEBUG=1 provides crucial insights into GPU utilization and model loading
Network Flexibility: Exposing services on 0.0.0.0 enables flexible deployment patterns and remote access
Resource Allocation: 64GB RAM on Node 1 allows for larger embedding batches, while 32GB on Node 2 is sufficient for generation
Home Lab Integration: These nodes integrate seamlessly with other home lab services via simple HTTP APIs

Conclusion

Building a dual-node Ollama setup demonstrates that running production-grade AI infrastructure doesn't require expensive cloud services or enterprise hardware. Two compact ThinkCentre M720q machines, provide a capable, educational, and private AI platform.

The home lab approach offers unmatched learning opportunities - from driver installation and systemd configuration to GPU optimization and model selection. Every challenge solved builds deeper understanding of how AI systems work at a fundamental level.

Whether you're a student learning AI deployment, a developer building AI-powered applications, or a privacy-conscious user wanting control over your data, self-hosting Ollama provides a practical, cost-effective solution that grows with your needs.