← Back to Home

Embeddings Generation on Low VRAM GPUs: A Practical Guide with Intel Arc A310

Published on January 17, 2026

Running AI workloads doesn't always require expensive, high-end GPUs. Embeddings generation—one of the most practical AI use cases—is perfectly suited for budget hardware. In this post, I'll show you how to generate embeddings using an Intel Arc A310 with just 4GB of VRAM.

What Are Embeddings?

Embeddings are dense vector representations of data—text, images, audio, or any other content—produced by neural networks. Instead of working with raw text like "The cat sat on the mat," we transform it into a fixed-length array of floating-point numbers that capture the semantic meaning.

"The cat sat on the mat" → [Embedding Model] → [0.12, -0.45, 0.78, 0.33, -0.91, ...]

These vectors typically have 384, 768, or 1024 dimensions. The key insight is that semantically similar content produces similar vectors. "A feline rested on the rug" would generate a vector very close to our cat example, even though the words are completely different.

How Similarity Works

Once we have embeddings, we can measure how similar two pieces of content are using mathematical operations. The most common approach is cosine similarity, which measures the angle between two vectors:

Why Embeddings Matter

Embeddings unlock powerful capabilities that traditional keyword matching simply can't achieve:

Why Low VRAM GPUs Work for Embeddings

Here's the good news: embedding models are significantly smaller than generative LLMs. While running a 7B parameter chat model requires 8-14GB of VRAM, most embedding models fit comfortably in 1-2GB. This makes them perfect for budget GPUs.

Embedding inference is also less demanding than text generation. You're doing a single forward pass through the network rather than generating tokens one by one. This means even modest hardware can achieve excellent throughput.

Intel Arc A310 Specifications

The Intel Arc A310 is an excellent choice for embeddings because of its price-to-performance ratio and growing software ecosystem. It's one of the most affordable discrete GPUs that can accelerate AI workloads.

Embedding Models That Fit in 4GB VRAM

Option 1: Using Ollama for Embeddings

If you already have Ollama running (see my previous post on self-hosting Ollama), it's the quickest way to start generating embeddings. Ollama provides a simple API for embedding generation.

Pulling an Embedding Model

ollama pull nomic-embed-text

Generating Embeddings via API

import requests
import json

def get_embedding_ollama(text: str, model: str = "nomic-embed-text") -> list[float]:
    """Generate embeddings using Ollama's API."""
    response = requests.post(
        "http://localhost:11434/api/embeddings",
        json={
            "model": model,
            "prompt": text
        }
    )
    response.raise_for_status()
    return response.json()["embedding"]

# Example usage
text = "The quick brown fox jumps over the lazy dog"
embedding = get_embedding_ollama(text)

print(f"Embedding dimensions: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")

Batch Processing with Ollama

def get_embeddings_batch(texts: list[str], model: str = "nomic-embed-text") -> list[list[float]]:
    """Generate embeddings for multiple texts."""
    embeddings = []
    for text in texts:
        embedding = get_embedding_ollama(text, model)
        embeddings.append(embedding)
    return embeddings

# Process multiple documents
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with many layers.",
    "Natural language processing helps computers understand text.",
    "Computer vision enables machines to interpret images."
]

embeddings = get_embeddings_batch(documents)
print(f"Generated {len(embeddings)} embeddings")
Note: Ollama's Intel Arc support is still maturing. As of early 2026, Ollama may fall back to CPU for some operations on Intel GPUs. However, Ollama does support Vulkan, which provides GPU acceleration for Intel Arc cards. Ensure you have the latest Vulkan drivers installed and check the Ollama documentation for the latest compatibility information.

Option 2: OpenVINO for Intel Arc

Intel's OpenVINO (Open Visual Inference and Neural Network Optimization) toolkit is a comprehensive inference optimization framework specifically designed for Intel hardware. Unlike general-purpose ML frameworks, OpenVINO focuses solely on inference performance, using compiler-level optimizations, model quantization, and hardware-specific instruction sets to squeeze maximum performance from Intel CPUs and GPUs.

For Intel Arc GPUs, OpenVINO provides native support and can deliver significantly better performance than generic PyTorch or TensorFlow implementations. The toolkit automatically handles model conversion from popular formats (PyTorch, ONNX, TensorFlow) and optimizes them for your specific hardware at runtime. This means you get production-ready performance without manual model optimization.

OpenVINO shines when you need:

The setup is slightly more involved than Ollama, but the performance gains make it worthwhile for serious workloads.

Installing OpenVINO

Set up a Python virtual environment and install OpenVINO:

# Create a virtual environment
python -m venv openvino-env
source openvino-env/bin/activate

# Install OpenVINO and dependencies
pip install openvino openvino-tokenizers
pip install optimum[openvino]
pip install transformers torch

Verifying GPU Detection

from openvino.runtime import Core

# Check available devices
core = Core()
devices = core.available_devices

print("Available OpenVINO devices:")
for device in devices:
    print(f"  - {device}: {core.get_property(device, 'FULL_DEVICE_NAME')}")

Example output:

Available OpenVINO devices:
  - CPU: Intel(R) Core(TM) i5-8600T CPU @ 2.30GHz
  - GPU: Intel(R) Arc(TM) A310 LP Graphics (dGPU)

Converting and Running Models with OpenVINO

OpenVINO automatically converts Hugging Face models to its optimized format on first use. The conversion happens in the background and the model is cached, so subsequent runs are faster. The OVModelForFeatureExtraction class handles this conversion and provides a drop-in replacement for standard PyTorch models.

When you specify device="GPU", OpenVINO will compile the model specifically for your Intel Arc GPU, taking advantage of hardware-specific optimizations. The first model load may take 30-60 seconds as it converts and optimizes, but this is a one-time cost.

Converted models are cached in ~/.cache/huggingface/hub/. You can change this location by setting the HF_HOME environment variable.

from optimum.intel import OVModelForFeatureExtraction
from transformers import AutoTokenizer
import torch
import numpy as np

def load_openvino_model(model_name: str):
    """Load a model optimized for OpenVINO."""
    # This will convert and cache the model automatically
    model = OVModelForFeatureExtraction.from_pretrained(
        model_name,
        export=True,
        device="GPU"  # Use Intel GPU
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, tokenizer

def get_embedding_openvino(text: str, model, tokenizer) -> np.ndarray:
    """Generate embedding using OpenVINO model."""
    inputs = tokenizer(
        text,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )
    
    outputs = model(**inputs)
    
    # Mean pooling over token embeddings
    attention_mask = inputs["attention_mask"]
    token_embeddings = outputs.last_hidden_state
    
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    embedding = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )
    
    return embedding.numpy()

# Load model
model, tokenizer = load_openvino_model("sentence-transformers/all-MiniLM-L6-v2")

# Generate embedding
text = "OpenVINO optimizes inference on Intel hardware."
embedding = get_embedding_openvino(text, model, tokenizer)

print(f"Embedding shape: {embedding.shape}")

Batch Processing with OpenVINO

def get_embeddings_batch_openvino(
    texts: list[str], 
    model, 
    tokenizer,
    batch_size: int = 32
) -> np.ndarray:
    """Generate embeddings for multiple texts in batches."""
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        
        # Tokenize batch
        inputs = tokenizer(
            batch,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors="pt"
        )
        
        # Get model outputs
        outputs = model(**inputs)
        
        # Mean pooling
        attention_mask = inputs["attention_mask"]
        token_embeddings = outputs.last_hidden_state
        
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(
            token_embeddings.size()
        ).float()
        
        batch_embeddings = torch.sum(
            token_embeddings * input_mask_expanded, 1
        ) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        
        all_embeddings.append(batch_embeddings.numpy())
    
    return np.vstack(all_embeddings)

# Load model
model, tokenizer = load_openvino_model("sentence-transformers/all-MiniLM-L6-v2")

# Process multiple documents
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with many layers.",
    "Natural language processing helps computers understand text.",
    "Computer vision enables machines to interpret images."
] * 250  # 1000 documents

embeddings = get_embeddings_batch_openvino(documents, model, tokenizer, batch_size=32)

print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding shape: {embeddings.shape}")
When to use OpenVINO: Consider OpenVINO if you need maximum inference performance on Intel hardware, plan to deploy in production, or want INT8 quantization for even smaller models.
Troubleshooting: If OpenVINO doesn't detect your GPU, ensure drivers are installed correctly and try setting device="CPU" as a fallback. You can also check GPU status with clinfo command after installing clinfo package.

Performance Considerations

Batch Size Tuning

With 4GB of VRAM, you need to balance batch size against memory usage. Here are starting points for the Intel Arc A310:

Monitoring VRAM Usage

# Monitor Intel GPU usage in real-time
intel_gpu_top

# Or use Python
import torch
if torch.xpu.is_available():
    print(f"VRAM used: {torch.xpu.memory_allocated() / 1024**2:.1f} MB")
    print(f"VRAM cached: {torch.xpu.memory_reserved() / 1024**2:.1f} MB")

Expected Throughput

On the Intel Arc A310, you can expect roughly:

Actual performance depends on text length, batch size, and system configuration.

Conclusion

Embeddings generation is an excellent AI use case for low VRAM GPUs like the Intel Arc A310. With just 4GB of VRAM, you can run production-quality embedding models that power semantic search, recommendations, and RAG applications.

We covered two viable approaches:

Related: Check out my post on Self-Hosting AI with Ollama for setting up a complete local AI environment.