AI Translation Model Comparison: Back-Translation Scores Across the Gemini Family
Using the back-translation mechanism introduced in the previous article, I compared multiple translation models head-to-head.
This is an interim report calculating the back-translation score (cosine similarity) for three posts from this blog, translated by four models in the Google Gemini family, Claude, and manual translation.
Target Models
| Model | Characteristics | Free Tier RPD |
|---|---|---|
| gemini-2.5-flash | Main model. Balanced | 20 |
| gemini-2.5-flash-lite | Lightweight version of 2.5 | 20 |
| gemini-3-flash-preview | Next-gen model (preview) | 20 |
| gemini-3.1-flash-lite-preview | Lightweight version of 3.1 (preview) | 500 |
| claude-opus-4.6 | Anthropic's top-tier model (evaluate existing English translation) | - |
| manual | Manual translation (evaluate existing English translation) | - |
In "translate" mode, the same model handles both the forward translation and the back-translation. In "evaluate" mode, an existing English translation (written by Claude or a human) is back-translated by Gemini to calculate the score. More on this difference below.
Model Score Summary
All embedding models are unified as gemini-embedding-001 (768 dimensions).
hello-world (Short text, 188 characters)
| Rank | Model | Score |
|---|---|---|
| 1 | gemini-3-flash-preview | 0.9836 |
| 2 | gemini-3.1-flash-lite-preview | 0.9823 |
| 3 | gemini-2.5-flash | 0.9795 |
| 4 | manual (evaluate) | 0.9712 |
| 5 | gemini-2.5-flash-lite | 0.9701 |
nextjs-vercel (Technical article, 2,530 characters)
| Rank | Model | Score |
|---|---|---|
| 1 | gemini-3-flash-preview | 0.9915 |
| 2 | gemini-2.5-flash-lite | 0.9902 |
| 3 | gemini-2.5-flash | 0.9893 |
| 4 | gemini-3.1-flash-lite-preview | 0.9886 |
| 5 | claude-opus-4.6 (evaluate) | 0.9785 |
back-translation (Translation explanation article, 3,141 characters)
| Rank | Model | Score |
|---|---|---|
| 1 | gemini-2.5-flash-lite | 0.9870 |
| 2 | gemini-2.5-flash | 0.9851 |
| 3 | gemini-3.1-flash-lite-preview | 0.9838 |
| 4 | claude-opus-4.6 (evaluate) | 0.9834 |
| 5 | gemini-3-flash-preview | 0.9791 |
Observation: Rankings Shift Depending on the Article
Interestingly, the model rankings shifted depending on the article.
- gemini-3-flash-preview achieved the highest score of 0.9915 in the technical article (nextjs-vercel), but ranked last with 0.9791 in the translation explanation article (back-translation)
- gemini-2.5-flash-lite topped the explanation article with 0.9870, but ranked last in the short text (hello-world) with 0.9701
- gemini-2.5-flash is a "stable model," consistently ranking in the top tier (2nd–3rd) across all articles
In short, there's no single "best model" across the board. The winner depends on the nature of the text — technical, casual, or short.
Hypothesis on Trends by Article Type
| Article Characteristics | Strongest Model | Hypothesis |
|---|---|---|
| Technical article (clear terminology) | gemini-3-flash-preview | Good at 1:1 technical term mapping? |
| Explanatory article (long logic) | gemini-2.5-flash-lite | Good at simple translation while maintaining context? |
| Short text (low information volume) | gemini-3-flash-preview | Accurate translation even with little context? |
With only three articles, this is still just a hypothesis, but as the number of articles increases, it might be possible to match models to content types — "Model A for movie reviews, Model B for technical articles."
Score Differences Between Translate and Evaluate Modes
One thing to watch out for: scores from these two modes aren't directly comparable.
- Translate mode: The same model handles both "translation" and "re-translation." Because the model's own translation quirks cancel each other out in the round-trip, the scores tend to be higher.
- Evaluate mode: Gemini re-translates an existing English translation (written by Claude or a human). Since the translator and the re-translator are different, scores tend to be lower.
For example, the evaluate scores for claude-opus-4.6 (0.9834, 0.9785) might look low when compared directly with Gemini's translate scores, but this does not necessarily mean Claude's translation is "inferior." In round-trips between different models, differences in expression choice affect the score.
For a fair comparison, you'd probably want to run all models in translate mode (each translating and back-translating itself) or all in evaluate mode (a single model back-translating everyone's output).
Embedding Model Comparison: gemini-embedding-001 vs 002
Borrowing the Dragon Ball analogy from the previous article: translation models are "power levels" and the embedding model that measures them is the "Scouter." No matter how strong the Saiyans (translation models) are, a Scouter can't produce useful numbers if it can't properly quantify that strength. You need the same Scouter to measure every fighter for a fair comparison — and if the power level is too high, an outdated Scouter will just break.
Now let's compare the old and new Scouter models. Google currently provides two embedding models.
gemini-embedding-001 is a model specialized for creating text-only vector spaces, while gemini-embedding-002 is based on Gemini 3 and is described as "natively multimodal" — capable of vectorizing not just text but also images, audio, video, and PDFs in the same space.
The newer base model is likely the bigger factor, but it's also intriguing to see how multimodal capability affects results for what are essentially text-only blog posts.
| Embedding Model | Dimension (after reduction) | RPD |
|---|---|---|
| gemini-embedding-001 | 768 | 1,000 |
| gemini-embedding-2-preview | 768 | 1,000 |
I compared the scores by changing only the embedding model while keeping the translation model × article combination the same.
Score Comparison Table
| Article | Translation Model | emb-1 | emb-2 | Difference |
|---|---|---|---|---|
| hello-world | gemini-2.5-flash | 0.9795 | 0.9531 | -0.026 |
| hello-world | gemini-2.5-flash-lite | 0.9701 | 0.9698 | -0.000 |
| hello-world | gemini-3-flash-preview | 0.9836 | 0.9818 | -0.002 |
| hello-world | gemini-3.1-flash-lite-preview | 0.9823 | 0.9711 | -0.011 |
| nextjs-vercel | gemini-2.5-flash | 0.9893 | 0.9646 | -0.025 |
| nextjs-vercel | gemini-2.5-flash-lite | 0.9902 | 0.9457 | -0.045 |
| nextjs-vercel | gemini-3-flash-preview | 0.9915 | 0.9667 | -0.025 |
| nextjs-vercel | gemini-3.1-flash-lite-preview | 0.9886 | 0.9595 | -0.029 |
| back-translation | gemini-2.5-flash | 0.9851 | 0.9498 | -0.035 |
| back-translation | gemini-2.5-flash-lite | 0.9870 | 0.9376 | -0.049 |
| back-translation | gemini-3-flash-preview | 0.9791 | 0.9511 | -0.028 |
| back-translation | gemini-3.1-flash-lite-preview | 0.9838 | 0.9312 | -0.053 |
Trends
- emb-2 scores are generally 0.00 to 0.05 lower. This is not due to translation quality, but rather the characteristics of the embedding model's vector space.
- emb-2 is considered more sensitive to subtle differences between texts (the vector space is more spread out).
- However, the relative rankings between models remain mostly consistent. Models that scored high with emb-1 also tend to score high with emb-2.
- The large variance in the difference is interesting. The difference for gemini-3-flash-preview × hello-world is almost zero (-0.002), whereas it is quite wide for gemini-3.1-flash-lite-preview × back-translation (-0.053). How much the embedding model's characteristics show up seems to depend on article length and model compatibility.
Which Scouter to Use?
For the time being, I plan to continue using gemini-embedding-001 as the main Scouter. The reasons are:
- Compatibility with existing data (comparable to scores accumulated with emb-1).
- If the relative ranking remains the same, a higher absolute value is more intuitively understandable.
- emb-2 is still in preview, and API specifications might change.
However, since emb-2 might be a "next-gen Scouter capable of reading higher power levels," I'll keep collecting data for both.
Summary and Future Outlook
Key Findings
- There is no "ultimate translation model" — The best model depends on the content of the article.
- gemini-2.5-flash is the "stable type" — It ranks highly in all articles. It is a reasonable choice as a main model.
- Embedding model differences affect the absolute score value but preserve the relative ranking.
- Scores from translate and evaluate modes cannot be directly compared — modes must be consistent.
Scouters Explode — The Fate of AI Benchmarks, Dragon Ball Style
Scouters have limits.
When Vegeta read Goku's power level — "It's over 8,000!" (or 9,000, if you grew up with the English dub) — he crushed his Scouter in sheer disbelief. All models clustering in the 0.98 range may be exactly that: the Scouter's measurement capability approaching its ceiling.
On Planet Namek, Dodoria's and Zarbon's Scouters exploded trying to read Vegeta's power level, and higher-spec models were deployed. Testing emb-2 (the next-gen model) was driven by the same motivation. The results did show more differentiation — but even next-gen Scouters might eventually go "boom."
Ultimately, you need to sense ki without relying on a Scouter.
That is, complement the numbers with mechanisms that capture translation naturalness and nuance, such as critic models or human review. That's the next step.
Future Plans
- Grow the article pool to verify trends — three articles aren't enough. See if genre-based patterns emerge.
- Add Groq (Llama 3) and DeepSeek — compare against free models from other providers.
- Build a translation quality dashboard — visualize scores in graphs so anyone can explore them.
- Auto-select models by genre — pick the best model automatically based on article tags.
- Introduce a critic model — evaluate translation naturalness from perspectives that embedding models can't measure.
The data in this article is as of 2026-04-04. As more articles and models are added, results will change — I plan to continue writing about how things evolve.
The mechanism for back-translation is introduced in this article, and the blog's tech stack is introduced in this article.