← Back to index

Comparing Texts with Embedding Similarity: Validate LLM Output

Published on April 30, 2026

When you need to check whether two pieces of text mean the same thing — whether an LLM-generated answer matches an expected reference, or whether two phrasings of the same idea are equivalent — exact string comparison fails immediately. Embeddings do not. This post walks through building a practical text comparison toolkit in Python using Ollama for embedding generation. It assumes Ollama is already running with nomic-embed-text and qwen3.6 available.

The Problem with Exact Matching

Consider a reference answer and three candidate answers generated by an LLM:

Here is how exact match, fuzzy match (difflib.SequenceMatcher), and embedding cosine similarity score each candidate against the reference:

Candidate Exact match (==) difflib ratio Cosine similarity
Candidate A — same meaning, different wording False 0.47 0.93
Candidate B — same meaning, different structure False 0.41 0.91
Candidate C — different meaning, shares keywords False 0.52 0.61

Exact match rejects all three. Fuzzy matching actually ranks Candidate C higher than A and B because it shares more surface-level tokens with the reference ("Ollama", "tool", "that"). Only cosine similarity correctly identifies A and B as near-equivalent and C as substantially different.

How Embedding Similarity Works

An embedding model converts a piece of text into a dense vector of floating-point numbers — typically 384, 768, or 1024 dimensions. Texts with similar meaning produce vectors that point in similar directions in that high-dimensional space. Cosine similarity measures the angle between two vectors: a score of 1.0 means identical direction (semantically equivalent), 0.0 means orthogonal (unrelated), and negative values indicate opposing meaning.

Already familiar with embeddings? If you have read the Embeddings Generation on Low VRAM GPUs post, you can skip this section — the concepts are the same. The rest of the post is self-contained either way.

Practical thresholds for general-purpose text comparison with nomic-embed-text:

Score range Interpretation
≥ 0.90 Very high similarity — likely equivalent meaning
0.75 – 0.89 Related — partially overlapping meaning
0.50 – 0.74 Topically related but different content
< 0.50 Substantially different

These thresholds are a starting point for general English text. I briefly cover how to calibrate them for a specific domain.

Configuring the Ollama Host

All code in this post reads the Ollama host from an environment variable with a localhost fallback:

import os

OLLAMA_HOST = os.environ.get("OLLAMA_HOST", "http://localhost:11434")

The TextComparator Class

TextComparator wraps embedding generation and cosine similarity into a reusable interface. It reads the Ollama host from the environment but accepts explicit overrides for cases where you need to target a specific instance.

import os
import requests
import numpy as np
from typing import List

class TextComparator:
    def __init__(
        self,
        ollama_host: str = None,
        model: str = "nomic-embed-text"
    ):
        self.host = ollama_host or os.environ.get("OLLAMA_HOST", "http://localhost:11434")
        self.model = model

    def _get_embedding(self, text: str) -> List[float]:
        response = requests.post(
            f"{self.host}/api/embeddings",
            json={"model": self.model, "prompt": text}
        )
        response.raise_for_status()
        return response.json()["embedding"]

    def _cosine_similarity(self, vec_a: List[float], vec_b: List[float]) -> float:
        a = np.array(vec_a)
        b = np.array(vec_b)
        norm_a = np.linalg.norm(a)
        norm_b = np.linalg.norm(b)
        if norm_a == 0 or norm_b == 0:
            return 0.0
        return float(np.dot(a, b) / (norm_a * norm_b))

    def compare(self, text_a: str, text_b: str) -> float:
        """Return cosine similarity score between two texts (0.0 – 1.0)."""
        emb_a = self._get_embedding(text_a)
        emb_b = self._get_embedding(text_b)
        return self._cosine_similarity(emb_a, emb_b)

    def is_equivalent(
        self,
        text_a: str,
        text_b: str,
        threshold: float = 0.90
    ) -> bool:
        """Return True if the two texts are semantically equivalent."""
        return self.compare(text_a, text_b) >= threshold

    def compare_many(
        self,
        reference: str,
        candidates: List[str]
    ) -> List[dict]:
        """
        Compare a reference text against multiple candidates.
        Returns candidates sorted by similarity score descending.
        """
        ref_emb = self._get_embedding(reference)
        results = []
        for candidate in candidates:
            cand_emb = self._get_embedding(candidate)
            score = self._cosine_similarity(ref_emb, cand_emb)
            results.append({"text": candidate, "score": round(score, 4)})
        return sorted(results, key=lambda x: x["score"], reverse=True)

Usage examples:

comparator = TextComparator()

# Single comparison
score = comparator.compare(
    "Ollama runs LLMs locally on your own hardware.",
    "You can self-host language models with Ollama without cloud dependencies."
)
print(f"Score: {score:.4f}")

# Boolean equivalence check
is_same = comparator.is_equivalent(
    "The model failed to load due to insufficient VRAM.",
    "Not enough GPU memory to load the model.",
    threshold=0.88
)
print(f"Equivalent: {is_same}")

Validating LLM Output Against Expected Answers

LLMValidator runs a test suite of (question, expected answer) pairs against a live LLM, scores each generated response against the expected answer using TextComparator, and returns a structured validation report.

import os
import requests
from typing import List, Tuple

class LLMValidator:
    def __init__(
        self,
        ollama_host: str = None,
        embedding_model: str = "nomic-embed-text",
        generation_model: str = "qwen3.6",
        threshold: float = 0.85
    ):
        host = ollama_host or os.environ.get("OLLAMA_HOST", "http://localhost:11434")
        
        self.host = host
        self.generation_model = generation_model
        self.threshold = threshold
        self.comparator = TextComparator(ollama_host=host, model=embedding_model)

    def _generate(self, question: str) -> str:
        response = requests.post(
            f"{self.host}/api/generate",
            json={
                "model": self.generation_model,
                "system": "You are a helpful and precise assistant for validating LLM outputs. Keep your answerst short, preferably one sentence, and concise.",
                "prompt": question,
                "stream": False
            }
        )
        response.raise_for_status()
        return response.json()["response"].strip()

    def run(
        self,
        test_suite: List[Tuple[str, str]]
    ) -> List[dict]:
        """
        Run a list of (question, expected_answer) pairs through the LLM
        and score each response. Returns a validation report.
        """
        report = []
        for question, expected in test_suite:
            generated = self._generate(question)
            score = self.comparator.compare(expected, generated)
            report.append({
                "question": question,
                "expected": expected,
                "generated": generated,
                "similarity_score": round(score, 4),
                "passed": score >= self.threshold
            })
        return report

Running a 5-case test suite:

test_suite = [
    # 1. Factual question
    (
        "What does the ollama pull command do?",
        "The ollama pull command downloads a large language model from the Ollama registry and saves it locally so you can run it offline."
    ),
    # 2. Summarisation task
    (
        "Summarise in one sentence why embeddings are useful for semantic search.",
        "Embeddings convert text into dense numerical vectors that capture contextual meaning, enabling searches to match concepts and intent rather than just exact keywords."
    ),
    # 3. Instruction-following (list exactly three items)
    (
        "When running LLMs locally with Ollama, does it query the Internet?",
        "No, once a model is downloaded, Ollama runs entirely offline and never queries the internet; it only connects online to download models or check for updates."
    ),
    # 4. Paraphrasing — two valid phrasings of the same answer
    (
        "What is cosine similarity used for in NLP?",
        "In NLP, cosine similarity is primarily used to measure the directional or semantic similarity between text embeddings, documents, or word vectors, enabling tasks like document matching, clustering, and semantic search."
    ),
    # 5. Intentionally wrong expected answer (regression simulation)
    (
        "What programming language is Ollama written in?",
        "Ollama is written in Rust and uses the Tokio async runtime."
    ),
]

validator = LLMValidator()
report = validator.run(test_suite)

# Print results table
print(f"{'Question':<42} {'Score':>6}  {'Result'}")
print("-" * 62)
for row in report:
    q = row["question"][:40] + ".." if len(row["question"]) > 40 else row["question"]
    status = "PASS" if row["passed"] else "FAIL"
    print(f"{q:<42} {row['similarity_score']:>6.4f}  {status}")

Expected output (scores are approximate and will vary by model version):

Question Similarity score Result
What does the ollama pull command do? 0.8795 PASS
Summarise in one sentence why embeddings... 0.9809 PASS
List exactly three benefits of running... 0.9034 PASS
What is cosine similarity used for in NLP? 0.9858 PASS
What programming language is Ollama written... 0.8111 FAIL

The validator correctly flags case 5 — the expected answer mentions Rust, but the model correctly answers Go, producing a low similarity score against the wrong reference. Cases 1, 2, and 4 pass comfortably. Case 3 illustrates an important nuance: the model likely returned a full sentence rather than a bare three-item list, yet the semantic content is close enough to pass at 0.85.

Interpreting Results

Embedding similarity scores are signals, not verdicts. A few important caveats to keep in mind when using them in practice:

Similarity ≠ correctness. A hallucinated but fluent answer on the same topic as the reference can score 0.85 or higher. This tool is best used to detect semantic drift and phrasing equivalence — not as a factual accuracy checker.

Batch Comparison with compare_many

Some questions have multiple equally valid correct answers. Rather than picking a single reference, you can compare a generated response against all valid references and accept the result if any score clears the threshold.

comparator = TextComparator()

# Three equally valid ways to describe the same concept
valid_references = [
    "Cosine similarity returns a value between -1 and 1 indicating how similar two vectors are.",
    "It measures the angle between two vectors in a high-dimensional space — 1.0 means identical direction.",
    "Cosine similarity is a metric for comparing the orientation of two embedding vectors, not their magnitude."
]

generated = (
    "Cosine similarity computes a score from -1 to 1 that reflects "
    "how closely two vectors point in the same direction."
)

results = comparator.compare_many(generated, valid_references)

for r in results:
    print(f"{r['score']:.4f}  {r['text'][:70]}")

compare_many returns candidates sorted by score descending. To accept the generated answer if any reference matches:

best_score = results[0]["score"]
threshold = 0.85

if best_score >= threshold:
    print(f"Valid — best match score: {best_score:.4f}")
else:
    print(f"No reference matched — best score was only {best_score:.4f}")

Practical Threshold Calibration

The default thresholds (0.90 for is_equivalent, 0.85 for LLMValidator) are reasonable starting points for general English text. For a specific domain, calibrate them against a small labeled dataset.

Collect 20–30 text pairs and label each as equivalent or different. Run TextComparator.compare on each pair and inspect where the scores cluster. The threshold should sit in the gap between the two clusters.

comparator = TextComparator()

# Labeled pairs: (text_a, text_b, label)
# Domain: Ollama configuration Q&A
labeled_pairs = [
    # Equivalent pairs
    ("Set OLLAMA_HOST to 0.0.0.0 to allow remote connections.",
     "To accept connections from other machines, bind Ollama to 0.0.0.0.",
     "equivalent"),
    ("Use ollama list to see which models are downloaded.",
     "The ollama list command shows all models available locally.",
     "equivalent"),
    ("Ollama stores models in ~/.ollama/models by default.",
     "By default, downloaded models are kept in the .ollama/models directory in your home folder.",
     "equivalent"),
    ("Set OLLAMA_NUM_PARALLEL to control concurrent request handling.",
     "OLLAMA_NUM_PARALLEL determines how many requests Ollama processes simultaneously.",
     "equivalent"),
    ("Run ollama serve to start the Ollama API server.",
     "Starting the API server is done with the ollama serve command.",
     "equivalent"),
    # Different pairs
    ("Ollama supports streaming responses via the API.",
     "The ollama pull command downloads a model from the registry.",
     "different"),
    ("GPU acceleration is automatic when a compatible GPU is detected.",
     "Ollama was first released in 2023.",
     "different"),
    ("Use ollama rm to delete a model from local storage.",
     "The context window size is set with the num_ctx parameter.",
     "different"),
    ("OLLAMA_MAX_LOADED_MODELS limits how many models stay in memory.",
     "Cosine similarity measures the angle between two vectors.",
     "different"),
    ("Set the temperature to 0.1 for deterministic outputs.",
     "Ollama exposes a REST API on port 11434 by default.",
     "different"),
]

print(f"{'Label':<12} {'Score':>6}  Text A (truncated)")
print("-" * 65)
for text_a, text_b, label in labeled_pairs:
    score = comparator.compare(text_a, text_b)
    print(f"{label:<12} {score:>6.4f}  {text_a[:45]}...")

Inspect the output: equivalent pairs typically cluster above 0.88; different pairs typically fall below 0.65. The midpoint of the gap — often around 0.80–0.85 for this kind of technical Q&A — becomes your domain-specific threshold.

Limitations

Token limit. nomic-embed-text has a 512-token context limit. nomic-embed-text-v1.5 extends this to 8192 tokens. For documents longer than ~400 words, split the text into chunks before embedding — comparing a 2000-word document as a single embedding silently truncates most of the content and produces unreliable scores.