Embeddings Generation on Low VRAM GPUs: A Practical Guide with Intel Arc A310

Running AI workloads doesn't always require expensive, high-end GPUs. Embeddings generation—one of the most practical AI use cases—is perfectly suited for budget hardware. In this post, I'll show you how to generate embeddings using an Intel Arc A310 with just 4GB of VRAM.

What Are Embeddings?

Embeddings are dense vector representations of data—text, images, audio, or any other content—produced by neural networks. Instead of working with raw text like "The cat sat on the mat," we transform it into a fixed-length array of floating-point numbers that capture the semantic meaning.

"The cat sat on the mat" → [Embedding Model] → [0.12, -0.45, 0.78, 0.33, -0.91, ...]

These vectors typically have 384, 768, or 1024 dimensions. The key insight is that semantically similar content produces similar vectors. "A feline rested on the rug" would generate a vector very close to our cat example, even though the words are completely different.

How Similarity Works

Once we have embeddings, we can measure how similar two pieces of content are using mathematical operations. The most common approach is cosine similarity, which measures the angle between two vectors:

A similarity of 1.0 means the vectors point in the same direction (very similar)
A similarity of 0.0 means they're perpendicular (unrelated)
A similarity of -1.0 means they point in opposite directions (semantically opposite)

Why Embeddings Matter

Embeddings unlock powerful capabilities that traditional keyword matching simply can't achieve:

Semantic Search: Find documents by meaning, not just keywords. Search for "transportation vehicles" and find results about cars, bikes, and trains.
Recommendation Systems: Suggest similar products, articles, or content based on vector proximity.
Clustering & Classification: Group similar items together automatically without manual labeling.
RAG (Retrieval-Augmented Generation): Power context retrieval for LLMs by finding the most relevant documents to include in prompts.

Why Low VRAM GPUs Work for Embeddings

Here's the good news: embedding models are significantly smaller than generative LLMs. While running a 7B parameter chat model requires 8-14GB of VRAM, most embedding models fit comfortably in 1-2GB. This makes them perfect for budget GPUs.

Embedding inference is also less demanding than text generation. You're doing a single forward pass through the network rather than generating tokens one by one. This means even modest hardware can achieve excellent throughput.

Intel Arc A310 Specifications

VRAM: 4GB GDDR6
Architecture: Intel Xe HPG (Alchemist)
Xe Cores: 6
Ray Tracing Units: 6
TDP: 75W (no external power required)

The Intel Arc A310 is an excellent choice for embeddings because of its price-to-performance ratio and growing software ecosystem. It's one of the most affordable discrete GPUs that can accelerate AI workloads.

Embedding Models That Fit in 4GB VRAM

all-MiniLM-L6-v2 — 384 dimensions, ~90MB (excellent for getting started)
bge-small-en-v1.5 — 384 dimensions, ~130MB (strong performance)
nomic-embed-text-v1.5 — 768 dimensions, ~275MB (higher quality)
mxbai-embed-large-v1 — 1024 dimensions, ~670MB (best quality that fits)

Option 1: Using Ollama for Embeddings

If you already have Ollama running (see my previous post on self-hosting Ollama), it's the quickest way to start generating embeddings. Ollama provides a simple API for embedding generation.

Pulling an Embedding Model

ollama pull nomic-embed-text

Generating Embeddings via API

import requests
import json

def get_embedding_ollama(text: str, model: str = "nomic-embed-text") -> list[float]:
    """Generate embeddings using Ollama's API."""
    response = requests.post(
        "http://localhost:11434/api/embeddings",
        json={
            "model": model,
            "prompt": text
        }
    )
    response.raise_for_status()
    return response.json()["embedding"]

# Example usage
text = "The quick brown fox jumps over the lazy dog"
embedding = get_embedding_ollama(text)

print(f"Embedding dimensions: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")

Batch Processing with Ollama

def get_embeddings_batch(texts: list[str], model: str = "nomic-embed-text") -> list[list[float]]:
    """Generate embeddings for multiple texts."""
    embeddings = []
    for text in texts:
        embedding = get_embedding_ollama(text, model)
        embeddings.append(embedding)
    return embeddings

# Process multiple documents
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with many layers.",
    "Natural language processing helps computers understand text.",
    "Computer vision enables machines to interpret images."
]

embeddings = get_embeddings_batch(documents)
print(f"Generated {len(embeddings)} embeddings")

Note: Ollama's Intel Arc support is still maturing. As of early 2026, Ollama may fall back to CPU for some operations on Intel GPUs. However, Ollama does support Vulkan, which provides GPU acceleration for Intel Arc cards. Ensure you have the latest Vulkan drivers installed and check the Ollama documentation for the latest compatibility information.

Option 2: OpenVINO for Intel Arc

Intel's OpenVINO (Open Visual Inference and Neural Network Optimization) toolkit is a comprehensive inference optimization framework specifically designed for Intel hardware. Unlike general-purpose ML frameworks, OpenVINO focuses solely on inference performance, using compiler-level optimizations, model quantization, and hardware-specific instruction sets to squeeze maximum performance from Intel CPUs and GPUs.

For Intel Arc GPUs, OpenVINO provides native support and can deliver significantly better performance than generic PyTorch or TensorFlow implementations. The toolkit automatically handles model conversion from popular formats (PyTorch, ONNX, TensorFlow) and optimizes them for your specific hardware at runtime. This means you get production-ready performance without manual model optimization.

OpenVINO shines when you need:

Maximum throughput: Optimized inference paths can be 2-3x faster than generic implementations
Production deployment: Stable, well-tested runtime suitable for long-running applications
Model quantization: Built-in INT8 quantization for even smaller models and faster inference
Cross-hardware compatibility: Same code runs on Intel CPUs, GPUs, and VPUs

The setup is slightly more involved than Ollama, but the performance gains make it worthwhile for serious workloads.

Installing OpenVINO

Set up a Python virtual environment and install OpenVINO:

# Create a virtual environment
python -m venv openvino-env
source openvino-env/bin/activate

# Install OpenVINO and dependencies
pip install openvino openvino-tokenizers
pip install optimum[openvino]
pip install transformers torch

Verifying GPU Detection

from openvino.runtime import Core

# Check available devices
core = Core()
devices = core.available_devices

print("Available OpenVINO devices:")
for device in devices:
    print(f"  - {device}: {core.get_property(device, 'FULL_DEVICE_NAME')}")

Example output:

Available OpenVINO devices:
  - CPU: Intel(R) Core(TM) i5-8600T CPU @ 2.30GHz
  - GPU: Intel(R) Arc(TM) A310 LP Graphics (dGPU)

Converting and Running Models with OpenVINO

OpenVINO automatically converts Hugging Face models to its optimized format on first use. The conversion happens in the background and the model is cached, so subsequent runs are faster. The OVModelForFeatureExtraction class handles this conversion and provides a drop-in replacement for standard PyTorch models.

When you specify device="GPU", OpenVINO will compile the model specifically for your Intel Arc GPU, taking advantage of hardware-specific optimizations. The first model load may take 30-60 seconds as it converts and optimizes, but this is a one-time cost.

Converted models are cached in ~/.cache/huggingface/hub/. You can change this location by setting the HF_HOME environment variable.

from optimum.intel import OVModelForFeatureExtraction
from transformers import AutoTokenizer
import torch
import numpy as np

def load_openvino_model(model_name: str):
    """Load a model optimized for OpenVINO."""
    # This will convert and cache the model automatically
    model = OVModelForFeatureExtraction.from_pretrained(
        model_name,
        export=True,
        device="GPU"  # Use Intel GPU
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, tokenizer

def get_embedding_openvino(text: str, model, tokenizer) -> np.ndarray:
    """Generate embedding using OpenVINO model."""
    inputs = tokenizer(
        text,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )
    
    outputs = model(**inputs)
    
    # Mean pooling over token embeddings
    attention_mask = inputs["attention_mask"]
    token_embeddings = outputs.last_hidden_state
    
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    embedding = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )
    
    return embedding.numpy()

# Load model
model, tokenizer = load_openvino_model("sentence-transformers/all-MiniLM-L6-v2")

# Generate embedding
text = "OpenVINO optimizes inference on Intel hardware."
embedding = get_embedding_openvino(text, model, tokenizer)

print(f"Embedding shape: {embedding.shape}")

Batch Processing with OpenVINO

def get_embeddings_batch_openvino(
    texts: list[str], 
    model, 
    tokenizer,
    batch_size: int = 32
) -> np.ndarray:
    """Generate embeddings for multiple texts in batches."""
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        
        # Tokenize batch
        inputs = tokenizer(
            batch,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors="pt"
        )
        
        # Get model outputs
        outputs = model(**inputs)
        
        # Mean pooling
        attention_mask = inputs["attention_mask"]
        token_embeddings = outputs.last_hidden_state
        
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(
            token_embeddings.size()
        ).float()
        
        batch_embeddings = torch.sum(
            token_embeddings * input_mask_expanded, 1
        ) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        
        all_embeddings.append(batch_embeddings.numpy())
    
    return np.vstack(all_embeddings)

# Load model
model, tokenizer = load_openvino_model("sentence-transformers/all-MiniLM-L6-v2")

# Process multiple documents
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with many layers.",
    "Natural language processing helps computers understand text.",
    "Computer vision enables machines to interpret images."
] * 250  # 1000 documents

embeddings = get_embeddings_batch_openvino(documents, model, tokenizer, batch_size=32)

print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding shape: {embeddings.shape}")

When to use OpenVINO: Consider OpenVINO if you need maximum inference performance on Intel hardware, plan to deploy in production, or want INT8 quantization for even smaller models.

Troubleshooting: If OpenVINO doesn't detect your GPU, ensure drivers are installed correctly and try setting device="CPU" as a fallback. You can also check GPU status with clinfo command after installing clinfo package.

Performance Considerations

Batch Size Tuning

With 4GB of VRAM, you need to balance batch size against memory usage. Here are starting points for the Intel Arc A310:

all-MiniLM-L6-v2: batch size 32-64
bge-small-en-v1.5: batch size 24-32
nomic-embed-text-v1.5: batch size 8-16
mxbai-embed-large-v1: batch size 4-8

Monitoring VRAM Usage

# Monitor Intel GPU usage in real-time
intel_gpu_top

# Or use Python
import torch
if torch.xpu.is_available():
    print(f"VRAM used: {torch.xpu.memory_allocated() / 1024**2:.1f} MB")
    print(f"VRAM cached: {torch.xpu.memory_reserved() / 1024**2:.1f} MB")

Expected Throughput

On the Intel Arc A310, you can expect roughly:

100-200 embeddings/second with all-MiniLM-L6-v2
50-100 embeddings/second with bge-small-en-v1.5
20-50 embeddings/second with larger models

Actual performance depends on text length, batch size, and system configuration.

Conclusion

Embeddings generation is an excellent AI use case for low VRAM GPUs like the Intel Arc A310. With just 4GB of VRAM, you can run production-quality embedding models that power semantic search, recommendations, and RAG applications.

We covered two viable approaches:

Ollama for the quickest start with minimal setup
OpenVINO for maximum performance on Intel hardware

Related: Check out my post on Self-Hosting AI with Ollama for setting up a complete local AI environment.