% cd ..

Automated Translation Quality Verification with AI

What is Back-Translation?

On this blog, articles written in Japanese are translated into English by AI. But rather than just translating, a mechanism for automatically verifying translation quality is built in.

The idea is simple. A Japanese article is translated into English, then that English is translated back into Japanese. The original Japanese and the round-tripped Japanese are then compared. If the translation distorted the meaning, the round-tripped version will differ significantly from the original. It's essentially a "sanity check" for translation.

A Concrete Example

Consider translating the following sentence:

Original: "この映画は心に響いた" (This movie resonated with my heart)

Translated by GPT:

  • English: "This movie touched my heart"
  • Back to Japanese: "この映画は私の心に触れた"
  • → Original meaning largely preserved ✅

Translated by a different model:

  • English: "This film resonated with me"
  • Back to Japanese: "この映画は私の共感を呼んだ"
  • → Meaning gets through, but the nuance has shifted ⚠️

Quantifying this "closeness to the original Japanese" is what embedding and cosine similarity — explained next — are for.

Basic Flow

  1. Translate the original (Japanese) to English using an AI translation model
  2. Translate the English back to Japanese using the same model (back-translation)
  3. Generate embeddings for both the original and back-translated Japanese (Gemini Embedding)
  4. Calculate cosine similarity using PostgreSQL's pgvector
  5. Higher score = better meaning preservation = better translation

What is Embedding?

Embedding is the process of converting text into a numerical vector of several hundred dimensions. Sentences with similar meanings are placed close together in the vector space.

"この映画は心に響いた" and "この映画は私の心に触れた" are close in meaning, so their vectors are also close. On the other hand, "この映画は私の共感を呼んだ" has a slightly different nuance, so its vector is a bit further away.

This distance is quantified using cosine similarity and will be automatically calculated in the database using PostgreSQL's pgvector extension.

Why No Reference Translation is Needed

The key insight is that no reference translation is needed. Traditional translation evaluation methods required comparison against a human-made reference translation. This approach, however, can assess translation quality through the round-trip alone. Since there's zero cost for preparing reference translations, it can be easily adopted even for a personal blog.

Comparison with Traditional Evaluation Methods

Method Era Overview Reference
BLEU 2002– Evaluates n-gram match rates Required
METEOR 2005– Improved BLEU with synonyms and stems Required
BERTScore 2019– Evaluates semantic similarity via BERT embeddings Required
COMET 2020– Neural-based translation evaluation model Required
LLM-as-Judge 2023– Ask GPT-4 etc. to score the translation Not required
This blog Back-translation + embedding cosine similarity Not required

This blog's approach is similar in concept to BERTScore (measuring semantic closeness via embeddings), but differs in being self-contained without requiring a reference translation.

Why Not Just Use LLM-as-Judge?

LLM-as-Judge — asking GPT-4 to "rate this translation out of 10" — also doesn't require a reference translation. So it's natural to wonder: "Why not just have AI score the translations and call it a day?" However, there are some concerns:

  • Black-box criteria — It's unclear what the model considers a "good translation"
  • Model bias — Reports suggest GPT-4 tends to rate its own translations higher
  • Low reproducibility — Scores can vary slightly even with identical inputs
  • Double cost — Requires two API calls: one for translation, one for evaluation

The back-translation + cosine similarity approach, on the other hand:

  • Mathematically explicit — Evaluation is based on an objective metric: vector distance
  • Highly reproducible — The same embeddings always produce the same score
  • Model-independent evaluation — The embedding model (ruler) is independent of the translation model

Of course, there are still unknowns until the system is actually in production. Whether back-translation scores truly correlate with "how good a translation feels to a human" will be verified through experimentation. It's also possible that the conclusion might be: just use LLM-as-Judge — even if it's not free.

Multi-Model Comparison

The same source text is translated by Gemini, Llama (Groq), DeepSeek, and others, with scores ranked. Casual blog writing with its free-form style should reveal each model's strengths and weaknesses quite clearly. DeepSeek in particular seems promising for Chinese translation.

With three free models (Gemini + Llama + DeepSeek) available at no cost, the plan is to start experimenting right away.

Unified Embedding Model

The embedding model used for quantification is unified to Gemini. While multiple translation models (the "arms") are used, using different embedding models (the "rulers") would make fair comparison impossible.

Different embedding models generate different vector spaces. Comparing vectors from GPT's embeddings with Gemini's embeddings would produce meaningless results. By aligning the ruler, it becomes possible for the first time to fairly evaluate "which model's translation best preserves the original meaning."

Translation Result Management

Translation results are managed in a hybrid approach, separating production and research use.

  • Supabase (research) — Stores all translation results and scores from every model. Viewable on a dashboard for comparison and analysis
  • Git (production) — The best-scoring translation is auto-committed to posts/en/ as Markdown. This is what readers see

Behind the scenes, translations from multiple models accumulate, enabling tracking of quality trends and cross-model comparisons.

Future Plans

  • Automated prompt tuning (automatically adding feedback like "use more idiomatic expressions" when scores are low)
  • Genre-specific analysis (do model strengths differ between movie reviews and tech articles?)
  • Expansion to other languages such as Chinese (DeepSeek might excel here)

For details on this blog's tech stack and design, see this post.

eof