Running AI workloads doesn't always require expensive, high-end GPUs. Embeddings generation—one of the most practical AI use cases—is perfectly suited for budget hardware. In this post, I'll show you how to generate embeddings using an Intel Arc A310 with just 4GB of VRAM.
What Are Embeddings?
Embeddings are dense vector representations of data—text, images, audio, or any other content—produced by neural networks. Instead of working with raw text like "The cat sat on the mat," we transform it into a fixed-length array of floating-point numbers that capture the semantic meaning.
These vectors typically have 384, 768, or 1024 dimensions. The key insight is that semantically similar content produces similar vectors. "A feline rested on the rug" would generate a vector very close to our cat example, even though the words are completely different.
How Similarity Works
Once we have embeddings, we can measure how similar two pieces of content are using mathematical operations. The most common approach is cosine similarity, which measures the angle between two vectors:
- A similarity of 1.0 means the vectors point in the same direction (very similar)
- A similarity of 0.0 means they're perpendicular (unrelated)
- A similarity of -1.0 means they point in opposite directions (semantically opposite)
Why Embeddings Matter
Embeddings unlock powerful capabilities that traditional keyword matching simply can't achieve:
- Semantic Search: Find documents by meaning, not just keywords. Search for "transportation vehicles" and find results about cars, bikes, and trains.
- Recommendation Systems: Suggest similar products, articles, or content based on vector proximity.
- Clustering & Classification: Group similar items together automatically without manual labeling.
- RAG (Retrieval-Augmented Generation): Power context retrieval for LLMs by finding the most relevant documents to include in prompts.
Why Low VRAM GPUs Work for Embeddings
Here's the good news: embedding models are significantly smaller than generative LLMs. While running a 7B parameter chat model requires 8-14GB of VRAM, most embedding models fit comfortably in 1-2GB. This makes them perfect for budget GPUs.
Embedding inference is also less demanding than text generation. You're doing a single forward pass through the network rather than generating tokens one by one. This means even modest hardware can achieve excellent throughput.
Intel Arc A310 Specifications
- VRAM: 4GB GDDR6
- Architecture: Intel Xe HPG (Alchemist)
- Xe Cores: 6
- Ray Tracing Units: 6
- TDP: 75W (no external power required)
The Intel Arc A310 is an excellent choice for embeddings because of its price-to-performance ratio and growing software ecosystem. It's one of the most affordable discrete GPUs that can accelerate AI workloads.
Embedding Models That Fit in 4GB VRAM
all-MiniLM-L6-v2— 384 dimensions, ~90MB (excellent for getting started)bge-small-en-v1.5— 384 dimensions, ~130MB (strong performance)nomic-embed-text-v1.5— 768 dimensions, ~275MB (higher quality)mxbai-embed-large-v1— 1024 dimensions, ~670MB (best quality that fits)
Option 1: Using Ollama for Embeddings
If you already have Ollama running (see my previous post on self-hosting Ollama), it's the quickest way to start generating embeddings. Ollama provides a simple API for embedding generation.
Pulling an Embedding Model
ollama pull nomic-embed-text
Generating Embeddings via API
import requests
import json
def get_embedding_ollama(text: str, model: str = "nomic-embed-text") -> list[float]:
"""Generate embeddings using Ollama's API."""
response = requests.post(
"http://localhost:11434/api/embeddings",
json={
"model": model,
"prompt": text
}
)
response.raise_for_status()
return response.json()["embedding"]
# Example usage
text = "The quick brown fox jumps over the lazy dog"
embedding = get_embedding_ollama(text)
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
Batch Processing with Ollama
def get_embeddings_batch(texts: list[str], model: str = "nomic-embed-text") -> list[list[float]]:
"""Generate embeddings for multiple texts."""
embeddings = []
for text in texts:
embedding = get_embedding_ollama(text, model)
embeddings.append(embedding)
return embeddings
# Process multiple documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Deep learning uses neural networks with many layers.",
"Natural language processing helps computers understand text.",
"Computer vision enables machines to interpret images."
]
embeddings = get_embeddings_batch(documents)
print(f"Generated {len(embeddings)} embeddings")
Option 2: OpenVINO for Intel Arc
Intel's OpenVINO (Open Visual Inference and Neural Network Optimization) toolkit is a comprehensive inference optimization framework specifically designed for Intel hardware. Unlike general-purpose ML frameworks, OpenVINO focuses solely on inference performance, using compiler-level optimizations, model quantization, and hardware-specific instruction sets to squeeze maximum performance from Intel CPUs and GPUs.
For Intel Arc GPUs, OpenVINO provides native support and can deliver significantly better performance than generic PyTorch or TensorFlow implementations. The toolkit automatically handles model conversion from popular formats (PyTorch, ONNX, TensorFlow) and optimizes them for your specific hardware at runtime. This means you get production-ready performance without manual model optimization.
OpenVINO shines when you need:
- Maximum throughput: Optimized inference paths can be 2-3x faster than generic implementations
- Production deployment: Stable, well-tested runtime suitable for long-running applications
- Model quantization: Built-in INT8 quantization for even smaller models and faster inference
- Cross-hardware compatibility: Same code runs on Intel CPUs, GPUs, and VPUs
The setup is slightly more involved than Ollama, but the performance gains make it worthwhile for serious workloads.
Installing OpenVINO
Set up a Python virtual environment and install OpenVINO:
# Create a virtual environment
python -m venv openvino-env
source openvino-env/bin/activate
# Install OpenVINO and dependencies
pip install openvino openvino-tokenizers
pip install optimum[openvino]
pip install transformers torch
Verifying GPU Detection
from openvino.runtime import Core
# Check available devices
core = Core()
devices = core.available_devices
print("Available OpenVINO devices:")
for device in devices:
print(f" - {device}: {core.get_property(device, 'FULL_DEVICE_NAME')}")
Example output:
Available OpenVINO devices:
- CPU: Intel(R) Core(TM) i5-8600T CPU @ 2.30GHz
- GPU: Intel(R) Arc(TM) A310 LP Graphics (dGPU)
Converting and Running Models with OpenVINO
OpenVINO automatically converts Hugging Face models to its optimized format on first use. The conversion happens in the background and the model is cached, so subsequent runs are faster. The OVModelForFeatureExtraction class handles this conversion and provides a drop-in replacement for standard PyTorch models.
When you specify device="GPU", OpenVINO will compile the model specifically for your Intel Arc GPU, taking advantage of hardware-specific optimizations. The first model load may take 30-60 seconds as it converts and optimizes, but this is a one-time cost.
Converted models are cached in ~/.cache/huggingface/hub/. You can change this location by setting the HF_HOME environment variable.
from optimum.intel import OVModelForFeatureExtraction
from transformers import AutoTokenizer
import torch
import numpy as np
def load_openvino_model(model_name: str):
"""Load a model optimized for OpenVINO."""
# This will convert and cache the model automatically
model = OVModelForFeatureExtraction.from_pretrained(
model_name,
export=True,
device="GPU" # Use Intel GPU
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
return model, tokenizer
def get_embedding_openvino(text: str, model, tokenizer) -> np.ndarray:
"""Generate embedding using OpenVINO model."""
inputs = tokenizer(
text,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
)
outputs = model(**inputs)
# Mean pooling over token embeddings
attention_mask = inputs["attention_mask"]
token_embeddings = outputs.last_hidden_state
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
embedding = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)
return embedding.numpy()
# Load model
model, tokenizer = load_openvino_model("sentence-transformers/all-MiniLM-L6-v2")
# Generate embedding
text = "OpenVINO optimizes inference on Intel hardware."
embedding = get_embedding_openvino(text, model, tokenizer)
print(f"Embedding shape: {embedding.shape}")
Batch Processing with OpenVINO
def get_embeddings_batch_openvino(
texts: list[str],
model,
tokenizer,
batch_size: int = 32
) -> np.ndarray:
"""Generate embeddings for multiple texts in batches."""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
# Tokenize batch
inputs = tokenizer(
batch,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
)
# Get model outputs
outputs = model(**inputs)
# Mean pooling
attention_mask = inputs["attention_mask"]
token_embeddings = outputs.last_hidden_state
input_mask_expanded = attention_mask.unsqueeze(-1).expand(
token_embeddings.size()
).float()
batch_embeddings = torch.sum(
token_embeddings * input_mask_expanded, 1
) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
all_embeddings.append(batch_embeddings.numpy())
return np.vstack(all_embeddings)
# Load model
model, tokenizer = load_openvino_model("sentence-transformers/all-MiniLM-L6-v2")
# Process multiple documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Deep learning uses neural networks with many layers.",
"Natural language processing helps computers understand text.",
"Computer vision enables machines to interpret images."
] * 250 # 1000 documents
embeddings = get_embeddings_batch_openvino(documents, model, tokenizer, batch_size=32)
print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding shape: {embeddings.shape}")
device="CPU" as a fallback. You can also check GPU status with clinfo command after installing clinfo package.
Performance Considerations
Batch Size Tuning
With 4GB of VRAM, you need to balance batch size against memory usage. Here are starting points for the Intel Arc A310:
all-MiniLM-L6-v2: batch size 32-64bge-small-en-v1.5: batch size 24-32nomic-embed-text-v1.5: batch size 8-16mxbai-embed-large-v1: batch size 4-8
Monitoring VRAM Usage
# Monitor Intel GPU usage in real-time
intel_gpu_top
# Or use Python
import torch
if torch.xpu.is_available():
print(f"VRAM used: {torch.xpu.memory_allocated() / 1024**2:.1f} MB")
print(f"VRAM cached: {torch.xpu.memory_reserved() / 1024**2:.1f} MB")
Expected Throughput
On the Intel Arc A310, you can expect roughly:
- 100-200 embeddings/second with all-MiniLM-L6-v2
- 50-100 embeddings/second with bge-small-en-v1.5
- 20-50 embeddings/second with larger models
Actual performance depends on text length, batch size, and system configuration.
Conclusion
Embeddings generation is an excellent AI use case for low VRAM GPUs like the Intel Arc A310. With just 4GB of VRAM, you can run production-quality embedding models that power semantic search, recommendations, and RAG applications.
We covered two viable approaches:
- Ollama for the quickest start with minimal setup
- OpenVINO for maximum performance on Intel hardware