Comparing Texts with Embedding Similarity: Validate LLM Output

When you need to check whether two pieces of text mean the same thing — whether an LLM-generated answer matches an expected reference, or whether two phrasings of the same idea are equivalent — exact string comparison fails immediately. Embeddings do not. This post walks through building a practical text comparison toolkit in Python using Ollama for embedding generation. It assumes Ollama is already running with nomic-embed-text and qwen3.6 available.

The Problem with Exact Matching

Consider a reference answer and three candidate answers generated by an LLM:

Reference: "Ollama is a tool for running large language models locally on your own hardware."
Candidate A: "Ollama lets you run LLMs on your own machine without sending data to the cloud." (semantically identical, different wording)
Candidate B: "You can use Ollama to self-host language models and keep your data private." (semantically identical, different structure)
Candidate C: "Ollama is a command-line tool written in Go that was released in 2023." (factually different, shares topic keywords)

Here is how exact match, fuzzy match (difflib.SequenceMatcher), and embedding cosine similarity score each candidate against the reference:

Candidate	Exact match (==)	difflib ratio	Cosine similarity
Candidate A — same meaning, different wording	False	0.47	0.93
Candidate B — same meaning, different structure	False	0.41	0.91
Candidate C — different meaning, shares keywords	False	0.52	0.61

Exact match rejects all three. Fuzzy matching actually ranks Candidate C higher than A and B because it shares more surface-level tokens with the reference ("Ollama", "tool", "that"). Only cosine similarity correctly identifies A and B as near-equivalent and C as substantially different.

How Embedding Similarity Works

An embedding model converts a piece of text into a dense vector of floating-point numbers — typically 384, 768, or 1024 dimensions. Texts with similar meaning produce vectors that point in similar directions in that high-dimensional space. Cosine similarity measures the angle between two vectors: a score of 1.0 means identical direction (semantically equivalent), 0.0 means orthogonal (unrelated), and negative values indicate opposing meaning.

Already familiar with embeddings? If you have read the Embeddings Generation on Low VRAM GPUs post, you can skip this section — the concepts are the same. The rest of the post is self-contained either way.

Practical thresholds for general-purpose text comparison with nomic-embed-text:

Score range	Interpretation
≥ 0.90	Very high similarity — likely equivalent meaning
0.75 – 0.89	Related — partially overlapping meaning
0.50 – 0.74	Topically related but different content
< 0.50	Substantially different

These thresholds are a starting point for general English text. I briefly cover how to calibrate them for a specific domain.

Configuring the Ollama Host

All code in this post reads the Ollama host from an environment variable with a localhost fallback:

import os

OLLAMA_HOST = os.environ.get("OLLAMA_HOST", "http://localhost:11434")

The TextComparator Class

TextComparator wraps embedding generation and cosine similarity into a reusable interface. It reads the Ollama host from the environment but accepts explicit overrides for cases where you need to target a specific instance.

import os
import requests
import numpy as np
from typing import List

class TextComparator:
    def __init__(
        self,
        ollama_host: str = None,
        model: str = "nomic-embed-text"
    ):
        self.host = ollama_host or os.environ.get("OLLAMA_HOST", "http://localhost:11434")
        self.model = model

    def _get_embedding(self, text: str) -> List[float]:
        response = requests.post(
            f"{self.host}/api/embeddings",
            json={"model": self.model, "prompt": text}
        )
        response.raise_for_status()
        return response.json()["embedding"]

    def _cosine_similarity(self, vec_a: List[float], vec_b: List[float]) -> float:
        a = np.array(vec_a)
        b = np.array(vec_b)
        norm_a = np.linalg.norm(a)
        norm_b = np.linalg.norm(b)
        if norm_a == 0 or norm_b == 0:
            return 0.0
        return float(np.dot(a, b) / (norm_a * norm_b))

    def compare(self, text_a: str, text_b: str) -> float:
        """Return cosine similarity score between two texts (0.0 – 1.0)."""
        emb_a = self._get_embedding(text_a)
        emb_b = self._get_embedding(text_b)
        return self._cosine_similarity(emb_a, emb_b)

    def is_equivalent(
        self,
        text_a: str,
        text_b: str,
        threshold: float = 0.90
    ) -> bool:
        """Return True if the two texts are semantically equivalent."""
        return self.compare(text_a, text_b) >= threshold

    def compare_many(
        self,
        reference: str,
        candidates: List[str]
    ) -> List[dict]:
        """
        Compare a reference text against multiple candidates.
        Returns candidates sorted by similarity score descending.
        """
        ref_emb = self._get_embedding(reference)
        results = []
        for candidate in candidates:
            cand_emb = self._get_embedding(candidate)
            score = self._cosine_similarity(ref_emb, cand_emb)
            results.append({"text": candidate, "score": round(score, 4)})
        return sorted(results, key=lambda x: x["score"], reverse=True)

Usage examples:

comparator = TextComparator()

# Single comparison
score = comparator.compare(
    "Ollama runs LLMs locally on your own hardware.",
    "You can self-host language models with Ollama without cloud dependencies."
)
print(f"Score: {score:.4f}")

# Boolean equivalence check
is_same = comparator.is_equivalent(
    "The model failed to load due to insufficient VRAM.",
    "Not enough GPU memory to load the model.",
    threshold=0.88
)
print(f"Equivalent: {is_same}")

Validating LLM Output Against Expected Answers

LLMValidator runs a test suite of (question, expected answer) pairs against a live LLM, scores each generated response against the expected answer using TextComparator, and returns a structured validation report.

import os
import requests
from typing import List, Tuple

class LLMValidator:
    def __init__(
        self,
        ollama_host: str = None,
        embedding_model: str = "nomic-embed-text",
        generation_model: str = "qwen3.6",
        threshold: float = 0.85
    ):
        host = ollama_host or os.environ.get("OLLAMA_HOST", "http://localhost:11434")
        
        self.host = host
        self.generation_model = generation_model
        self.threshold = threshold
        self.comparator = TextComparator(ollama_host=host, model=embedding_model)

    def _generate(self, question: str) -> str:
        response = requests.post(
            f"{self.host}/api/generate",
            json={
                "model": self.generation_model,
                "system": "You are a helpful and precise assistant for validating LLM outputs. Keep your answerst short, preferably one sentence, and concise.",
                "prompt": question,
                "stream": False
            }
        )
        response.raise_for_status()
        return response.json()["response"].strip()

    def run(
        self,
        test_suite: List[Tuple[str, str]]
    ) -> List[dict]:
        """
        Run a list of (question, expected_answer) pairs through the LLM
        and score each response. Returns a validation report.
        """
        report = []
        for question, expected in test_suite:
            generated = self._generate(question)
            score = self.comparator.compare(expected, generated)
            report.append({
                "question": question,
                "expected": expected,
                "generated": generated,
                "similarity_score": round(score, 4),
                "passed": score >= self.threshold
            })
        return report

Running a 5-case test suite:

test_suite = [
    # 1. Factual question
    (
        "What does the ollama pull command do?",
        "The ollama pull command downloads a large language model from the Ollama registry and saves it locally so you can run it offline."
    ),
    # 2. Summarisation task
    (
        "Summarise in one sentence why embeddings are useful for semantic search.",
        "Embeddings convert text into dense numerical vectors that capture contextual meaning, enabling searches to match concepts and intent rather than just exact keywords."
    ),
    # 3. Instruction-following (list exactly three items)
    (
        "When running LLMs locally with Ollama, does it query the Internet?",
        "No, once a model is downloaded, Ollama runs entirely offline and never queries the internet; it only connects online to download models or check for updates."
    ),
    # 4. Paraphrasing — two valid phrasings of the same answer
    (
        "What is cosine similarity used for in NLP?",
        "In NLP, cosine similarity is primarily used to measure the directional or semantic similarity between text embeddings, documents, or word vectors, enabling tasks like document matching, clustering, and semantic search."
    ),
    # 5. Intentionally wrong expected answer (regression simulation)
    (
        "What programming language is Ollama written in?",
        "Ollama is written in Rust and uses the Tokio async runtime."
    ),
]

validator = LLMValidator()
report = validator.run(test_suite)

# Print results table
print(f"{'Question':<42} {'Score':>6}  {'Result'}")
print("-" * 62)
for row in report:
    q = row["question"][:40] + ".." if len(row["question"]) > 40 else row["question"]
    status = "PASS" if row["passed"] else "FAIL"
    print(f"{q:<42} {row['similarity_score']:>6.4f}  {status}")

Expected output (scores are approximate and will vary by model version):

Question	Similarity score	Result
What does the ollama pull command do?	0.8795	PASS
Summarise in one sentence why embeddings...	0.9809	PASS
List exactly three benefits of running...	0.9034	PASS
What is cosine similarity used for in NLP?	0.9858	PASS
What programming language is Ollama written...	0.8111	FAIL

The validator correctly flags case 5 — the expected answer mentions Rust, but the model correctly answers Go, producing a low similarity score against the wrong reference. Cases 1, 2, and 4 pass comfortably. Case 3 illustrates an important nuance: the model likely returned a full sentence rather than a bare three-item list, yet the semantic content is close enough to pass at 0.85.

Interpreting Results

Embedding similarity scores are signals, not verdicts. A few important caveats to keep in mind when using them in practice:

High similarity does not mean factually correct. Two confident but wrong answers on the same topic will score high against each other. Similarity measures phrasing equivalence, not truth.
Low similarity does not always mean wrong. A valid answer phrased from a very different angle — or using domain jargon vs plain language — may score lower than expected even when it is correct.
Use similarity as a signal, not a binary gate for open-ended generation tasks. A hard pass/fail threshold works well for regression testing but poorly for evaluating free-form responses.
For structured output (JSON, numbered lists, code), combine similarity scoring with format validation. A response can be semantically similar to the expected output and still be structurally invalid.
Combine with keyword checks for factual assertions where specific terms must appear — for example, verifying that a response about CUDA configuration actually contains the word "CUDA".

Similarity ≠ correctness. A hallucinated but fluent answer on the same topic as the reference can score 0.85 or higher. This tool is best used to detect semantic drift and phrasing equivalence — not as a factual accuracy checker.

Batch Comparison with compare_many

Some questions have multiple equally valid correct answers. Rather than picking a single reference, you can compare a generated response against all valid references and accept the result if any score clears the threshold.

comparator = TextComparator()

# Three equally valid ways to describe the same concept
valid_references = [
    "Cosine similarity returns a value between -1 and 1 indicating how similar two vectors are.",
    "It measures the angle between two vectors in a high-dimensional space — 1.0 means identical direction.",
    "Cosine similarity is a metric for comparing the orientation of two embedding vectors, not their magnitude."
]

generated = (
    "Cosine similarity computes a score from -1 to 1 that reflects "
    "how closely two vectors point in the same direction."
)

results = comparator.compare_many(generated, valid_references)

for r in results:
    print(f"{r['score']:.4f}  {r['text'][:70]}")

compare_many returns candidates sorted by score descending. To accept the generated answer if any reference matches:

best_score = results[0]["score"]
threshold = 0.85

if best_score >= threshold:
    print(f"Valid — best match score: {best_score:.4f}")
else:
    print(f"No reference matched — best score was only {best_score:.4f}")

Practical Threshold Calibration

The default thresholds (0.90 for is_equivalent, 0.85 for LLMValidator) are reasonable starting points for general English text. For a specific domain, calibrate them against a small labeled dataset.

Collect 20–30 text pairs and label each as equivalent or different. Run TextComparator.compare on each pair and inspect where the scores cluster. The threshold should sit in the gap between the two clusters.

comparator = TextComparator()

# Labeled pairs: (text_a, text_b, label)
# Domain: Ollama configuration Q&A
labeled_pairs = [
    # Equivalent pairs
    ("Set OLLAMA_HOST to 0.0.0.0 to allow remote connections.",
     "To accept connections from other machines, bind Ollama to 0.0.0.0.",
     "equivalent"),
    ("Use ollama list to see which models are downloaded.",
     "The ollama list command shows all models available locally.",
     "equivalent"),
    ("Ollama stores models in ~/.ollama/models by default.",
     "By default, downloaded models are kept in the .ollama/models directory in your home folder.",
     "equivalent"),
    ("Set OLLAMA_NUM_PARALLEL to control concurrent request handling.",
     "OLLAMA_NUM_PARALLEL determines how many requests Ollama processes simultaneously.",
     "equivalent"),
    ("Run ollama serve to start the Ollama API server.",
     "Starting the API server is done with the ollama serve command.",
     "equivalent"),
    # Different pairs
    ("Ollama supports streaming responses via the API.",
     "The ollama pull command downloads a model from the registry.",
     "different"),
    ("GPU acceleration is automatic when a compatible GPU is detected.",
     "Ollama was first released in 2023.",
     "different"),
    ("Use ollama rm to delete a model from local storage.",
     "The context window size is set with the num_ctx parameter.",
     "different"),
    ("OLLAMA_MAX_LOADED_MODELS limits how many models stay in memory.",
     "Cosine similarity measures the angle between two vectors.",
     "different"),
    ("Set the temperature to 0.1 for deterministic outputs.",
     "Ollama exposes a REST API on port 11434 by default.",
     "different"),
]

print(f"{'Label':<12} {'Score':>6}  Text A (truncated)")
print("-" * 65)
for text_a, text_b, label in labeled_pairs:
    score = comparator.compare(text_a, text_b)
    print(f"{label:<12} {score:>6.4f}  {text_a[:45]}...")

Inspect the output: equivalent pairs typically cluster above 0.88; different pairs typically fall below 0.65. The midpoint of the gap — often around 0.80–0.85 for this kind of technical Q&A — becomes your domain-specific threshold.

Limitations

Token limit. nomic-embed-text has a 512-token context limit. nomic-embed-text-v1.5 extends this to 8192 tokens. For documents longer than ~400 words, split the text into chunks before embedding — comparing a 2000-word document as a single embedding silently truncates most of the content and produces unreliable scores.

General-purpose model limitations. nomic-embed-text is trained primarily on general English text. For specialised domains — legal contracts, medical notes, source code — a domain-specific embedding model will produce more accurate similarity scores.
Semantic similarity is not factual correctness. A hallucinated but fluent answer on the same topic as the reference may score 0.85+ against it. Use this approach for phrasing equivalence and context grounding detection, not factual accuracy verification.
Performance at scale. Each comparison requires two embedding inference calls. For large test suites — hundreds of pairs — use batch embedding to amortise the overhead.
Language sensitivity. nomic-embed-text is primarily an English model. Cross-language similarity scores are less reliable. For multilingual use cases, consider multilingual-e5-large, which is available through Ollama.
Structured output. Semantic similarity is a poor validator for structured formats — JSON schemas, numbered lists, code. A response can be semantically close to the expected output and still be structurally malformed. Always combine similarity scoring with format validation for structured tasks.