[Dz] Automated Translation Quality Verification with AI

What is Back-Translation?

On this blog, articles written in Japanese are translated into English by AI. But I didn't stop at translation — the blog also includes a system that automatically verifies translation quality.

The idea is simple. A Japanese article is translated into English, then that English is translated back into Japanese. The original Japanese and the round-tripped Japanese are then compared. If the translation distorted the meaning, the round-tripped version will differ significantly from the original. It's essentially a "double-check" for translation — like verifying your math by working backwards.

A Concrete Example

Consider translating the following sentence:

Original: "この映画は心に響いた" (This movie really moved me)

Translated by ChatGPT:

English: "This movie touched my heart"
Back to Japanese: "この映画は私の心に触れた"
→ Original meaning largely preserved ✓

Translated by a different AI model:

English: "This film resonated with me"
Back to Japanese: "この映画は私の共感を呼んだ"
→ Meaning gets through, but the nuance has shifted △

Quantifying this "closeness to the original Japanese" is what embedding and cosine similarity — explained next — are for.

Basic Flow

Translate the original (Japanese) to English using an AI translation model
Translate the English back to Japanese using the same model (back-translation)
Generate embeddings for both the original and back-translated Japanese (Gemini Embedding)
Calculate cosine similarity using PostgreSQL's pgvector
Higher score = better meaning preservation = better translation

What is Embedding?

Embedding is the process of converting text into a numerical vector of several hundred dimensions. Sentences with similar meanings are placed close together in the vector space.

"この映画は心に響いた" and "この映画は私の心に触れた" are close in meaning, so their vectors are also close. On the other hand, "この映画は私の共感を呼んだ" has a slightly different nuance, so its vector is a bit further away.

This distance will be quantified using cosine similarity and automatically calculated in the database using PostgreSQL's pgvector extension.

Why No Reference Translation is Needed

The key insight is that no reference translation is needed. Traditional translation evaluation methods required comparison against a human-made reference translation. This approach, however, assesses translation quality through the round-trip alone. Since there's zero cost for preparing reference translations, it can be easily adopted even for a personal blog.

Comparison with Traditional Evaluation Methods

Method	Era	Overview	Reference
BLEU	2002–	Evaluates n-gram match rates	Required
METEOR	2005–	Improved BLEU with synonyms and stems	Required
BERTScore	2019–	Evaluates semantic similarity via BERT embeddings	Required
COMET	2020–	Neural-based translation evaluation model	Required
LLM-as-Judge	2023–	Ask GPT-4 etc. to score the translation	Not required
This blog	—	Back-translation + embedding cosine similarity	Not required

This blog's approach is similar in concept to BERTScore (measuring semantic closeness via embeddings), but differs in being self-contained without requiring a reference translation.

Why Not Just Use LLM-as-Judge?

LLM-as-Judge — asking GPT-4 to "rate this translation out of 10" — also doesn't require a reference translation. So it's natural to wonder: "Why not just have AI score the translations and call it a day?" However, there are some concerns:

Black-box criteria — It's unclear what the model considers a "good translation"
Model bias — Reports suggest GPT-4 tends to rate its own translations higher
Low reproducibility — Scores can vary slightly even with identical inputs
Double cost — Requires two API calls: one for translation, one for evaluation

The back-translation + cosine similarity approach, on the other hand:

Mathematically explicit — Evaluation is based on an objective metric: vector distance
Highly reproducible — The same embeddings always produce the same score
Model-independent evaluation — The embedding model (Scouter) is independent of the translation model

Of course, there are still unknowns until the system is actually in production. Whether back-translation scores truly correlate with what humans perceive as translation quality will be verified through experimentation. It wouldn't be free, but maybe LLM-as-Judge turns out to be the right answer after all.

Multi-Model Comparison

The same source text is translated by Gemini, Llama (Groq), DeepSeek, and others, with scores ranked. Free-form blog writing should reveal each model's strengths and weaknesses clearly. DeepSeek in particular seems promising for Chinese translation. With three models available for free (Gemini + Llama + DeepSeek), there's plenty of room to experiment.

Unified Embedding Model

The embedding model is unified to Gemini Embedding 1 (embedding-001) for now.
I use multiple translation models, but the embedding model that measures them has to stay the same — otherwise the comparison isn't fair.
Different embedding models generate different vector spaces. Comparing vectors from ChatGPT's embeddings with Gemini's embeddings would produce meaningless results. Unifying the vector space is the only way to fairly evaluate "which model's translation best preserves the original meaning."
In Dragon Ball Z terms, a translation model's ability is the "power level" and the embedding model that measures it is the "Scouter." ...right? (Okay, maybe not exactly.)

Gemini Embedding 1 is based on the Gemini 1.0/1.5 generation technology and is essentially a model that creates text-only vector spaces.
It reportedly performs very well with Japanese, capable of understanding and vectorizing subtle Japanese nuances and context. Designed as a multilingual model that includes Japanese, it's apparently a go-to choice for handling Japanese documents in systems like RAG (Retrieval-Augmented Generation).

Translation Result Management

Translation results are managed in a hybrid approach, separating production and research use.

Supabase (research) — Stores all translation results and scores from every model. Viewable on a dashboard for comparison and analysis
Git (production) — The best-scoring translation is auto-committed to posts/en/ as Markdown. This is what readers see

Behind the scenes, translations from multiple models accumulate, enabling tracking of quality trends and cross-model comparisons.

Future Plans

Automated prompt tuning (automatically adding feedback like "use more idiomatic expressions" when scores are low)
Genre-specific analysis (do model strengths differ between movie reviews and tech articles?)
Expansion to other languages such as Chinese (DeepSeek might excel here)

For details on this blog's tech stack and design, see this post.