[Dz] AI Translation Model Comparison: Back-Translation Scores Across the Gemini Family

Using the back-translation mechanism introduced in the previous article, I compared multiple translation models head-to-head.

This is an interim report calculating the back-translation score (cosine similarity) for three posts from this blog, translated by four models in the Google Gemini family, Claude, and manual translation.

Target Models

Model	Characteristics	Free Tier RPD
gemini-2.5-flash	Main model. Balanced	20
gemini-2.5-flash-lite	Lightweight version of 2.5	20
gemini-3-flash-preview	Next-gen model (preview)	20
gemini-3.1-flash-lite-preview	Lightweight version of 3.1 (preview)	500
claude-opus-4.6	Anthropic's top-tier model (evaluate existing English translation)	-
manual	Manual translation (evaluate existing English translation)	-

In "translate" mode, the same model handles both the forward translation and the back-translation. In "evaluate" mode, an existing English translation (written by Claude or a human) is back-translated by Gemini to calculate the score. More on this difference below.

Model Score Summary

All embedding models are unified as gemini-embedding-001 (768 dimensions).

hello-world (Short text, 188 characters)

Rank	Model	Score
1	gemini-3-flash-preview	0.9836
2	gemini-3.1-flash-lite-preview	0.9823
3	gemini-2.5-flash	0.9795
4	manual (evaluate)	0.9712
5	gemini-2.5-flash-lite	0.9701

nextjs-vercel (Technical article, 2,530 characters)

Rank	Model	Score
1	gemini-3-flash-preview	0.9915
2	gemini-2.5-flash-lite	0.9902
3	gemini-2.5-flash	0.9893
4	gemini-3.1-flash-lite-preview	0.9886
5	claude-opus-4.6 (evaluate)	0.9785

back-translation (Translation explanation article, 3,141 characters)

Rank	Model	Score
1	gemini-2.5-flash-lite	0.9870
2	gemini-2.5-flash	0.9851
3	gemini-3.1-flash-lite-preview	0.9838
4	claude-opus-4.6 (evaluate)	0.9834
5	gemini-3-flash-preview	0.9791

Observation: Rankings Shift Depending on the Article

Interestingly, the model rankings shifted depending on the article.

gemini-3-flash-preview achieved the highest score of 0.9915 in the technical article (nextjs-vercel), but ranked last with 0.9791 in the translation explanation article (back-translation)
gemini-2.5-flash-lite topped the explanation article with 0.9870, but ranked last in the short text (hello-world) with 0.9701
gemini-2.5-flash is a "stable model," consistently ranking in the top tier (2nd–3rd) across all articles

In short, there's no single "best model" across the board. The winner depends on the nature of the text — technical, casual, or short.

Hypothesis on Trends by Article Type

Article Characteristics	Strongest Model	Hypothesis
Technical article (clear terminology)	gemini-3-flash-preview	Good at 1:1 technical term mapping?
Explanatory article (long logic)	gemini-2.5-flash-lite	Good at simple translation while maintaining context?
Short text (low information volume)	gemini-3-flash-preview	Accurate translation even with little context?

With only three articles, this is still just a hypothesis, but as the number of articles increases, it might be possible to match models to content types — "Model A for movie reviews, Model B for technical articles."

Score Differences Between Translate and Evaluate Modes

One thing to watch out for: scores from these two modes aren't directly comparable.

Translate mode: The same model handles both "translation" and "re-translation." Because the model's own translation quirks cancel each other out in the round-trip, the scores tend to be higher.
Evaluate mode: Gemini re-translates an existing English translation (written by Claude or a human). Since the translator and the re-translator are different, scores tend to be lower.

For example, the evaluate scores for claude-opus-4.6 (0.9834, 0.9785) might look low when compared directly with Gemini's translate scores, but this does not necessarily mean Claude's translation is "inferior." In round-trips between different models, differences in expression choice affect the score.

For a fair comparison, you'd probably want to run all models in translate mode (each translating and back-translating itself) or all in evaluate mode (a single model back-translating everyone's output).

Embedding Model Comparison: gemini-embedding-001 vs 002

Borrowing the Dragon Ball analogy from the previous article: translation models are "power levels" and the embedding model that measures them is the "Scouter." No matter how strong the Saiyans (translation models) are, a Scouter can't produce useful numbers if it can't properly quantify that strength. You need the same Scouter to measure every fighter for a fair comparison — and if the power level is too high, an outdated Scouter will just break.

Now let's compare the old and new Scouter models. Google currently provides two embedding models.
gemini-embedding-001 is a model specialized for creating text-only vector spaces, while gemini-embedding-002 is based on Gemini 3 and is described as "natively multimodal" — capable of vectorizing not just text but also images, audio, video, and PDFs in the same space.
The newer base model is likely the bigger factor, but it's also intriguing to see how multimodal capability affects results for what are essentially text-only blog posts.

Embedding Model	Dimension (after reduction)	RPD
gemini-embedding-001	768	1,000
gemini-embedding-2-preview	768	1,000

I compared the scores by changing only the embedding model while keeping the translation model × article combination the same.

Score Comparison Table

Article	Translation Model	emb-1	emb-2	Difference
hello-world	gemini-2.5-flash	0.9795	0.9531	-0.026
hello-world	gemini-2.5-flash-lite	0.9701	0.9698	-0.000
hello-world	gemini-3-flash-preview	0.9836	0.9818	-0.002
hello-world	gemini-3.1-flash-lite-preview	0.9823	0.9711	-0.011
nextjs-vercel	gemini-2.5-flash	0.9893	0.9646	-0.025
nextjs-vercel	gemini-2.5-flash-lite	0.9902	0.9457	-0.045
nextjs-vercel	gemini-3-flash-preview	0.9915	0.9667	-0.025
nextjs-vercel	gemini-3.1-flash-lite-preview	0.9886	0.9595	-0.029
back-translation	gemini-2.5-flash	0.9851	0.9498	-0.035
back-translation	gemini-2.5-flash-lite	0.9870	0.9376	-0.049
back-translation	gemini-3-flash-preview	0.9791	0.9511	-0.028
back-translation	gemini-3.1-flash-lite-preview	0.9838	0.9312	-0.053

Trends

emb-2 scores are generally 0.00 to 0.05 lower. This is not due to translation quality, but rather the characteristics of the embedding model's vector space.
emb-2 is considered more sensitive to subtle differences between texts (the vector space is more spread out).
However, the relative rankings between models remain mostly consistent. Models that scored high with emb-1 also tend to score high with emb-2.
The large variance in the difference is interesting. The difference for gemini-3-flash-preview × hello-world is almost zero (-0.002), whereas it is quite wide for gemini-3.1-flash-lite-preview × back-translation (-0.053). How much the embedding model's characteristics show up seems to depend on article length and model compatibility.

Which Scouter to Use?

For the time being, I plan to continue using gemini-embedding-001 as the main Scouter. The reasons are:

Compatibility with existing data (comparable to scores accumulated with emb-1).
If the relative ranking remains the same, a higher absolute value is more intuitively understandable.
emb-2 is still in preview, and API specifications might change.

However, since emb-2 might be a "next-gen Scouter capable of reading higher power levels," I'll keep collecting data for both.

Summary and Future Outlook

Key Findings

There is no "ultimate translation model" — The best model depends on the content of the article.
gemini-2.5-flash is the "stable type" — It ranks highly in all articles. It is a reasonable choice as a main model.
Embedding model differences affect the absolute score value but preserve the relative ranking.
Scores from translate and evaluate modes cannot be directly compared — modes must be consistent.

Scouters Explode — The Fate of AI Benchmarks, Dragon Ball Style

Scouters have limits.
When Vegeta read Goku's power level — "It's over 8,000!" (or 9,000, if you grew up with the English dub) — he crushed his Scouter in sheer disbelief. All models clustering in the 0.98 range may be exactly that: the Scouter's measurement capability approaching its ceiling.

On Planet Namek, Dodoria's and Zarbon's Scouters exploded trying to read Vegeta's power level, and higher-spec models were deployed. Testing emb-2 (the next-gen model) was driven by the same motivation. The results did show more differentiation — but even next-gen Scouters might eventually go "boom."

Ultimately, you need to sense ki without relying on a Scouter.
That is, complement the numbers with mechanisms that capture translation naturalness and nuance, such as critic models or human review. That's the next step.

Future Plans

Grow the article pool to verify trends — three articles aren't enough. See if genre-based patterns emerge.
Add Groq (Llama 3) and DeepSeek — compare against free models from other providers.
Build a translation quality dashboard — visualize scores in graphs so anyone can explore them.
Auto-select models by genre — pick the best model automatically based on article tags.
Introduce a critic model — evaluate translation naturalness from perspectives that embedding models can't measure.

The data in this article is as of 2026-04-04. As more articles and models are added, results will change — I plan to continue writing about how things evolve.

The mechanism for back-translation is introduced in this article, and the blog's tech stack is introduced in this article.