AI Translation Model Comparison: The Power of the Gemini Family Through Back-Translation Scores
Introduction
Using the back-translation mechanism introduced in the previous article, I have conducted a practical comparison of multiple translation models.
This is an interim report calculating the back-translation score (cosine similarity) for three blog posts, translated by four models in the Google Gemini family, Claude, and manual translation.
Target Models
| Model | Characteristics | Free Tier RPD |
|---|---|---|
| gemini-2.5-flash | Main model. Balanced | 20 |
| gemini-2.5-flash-lite | Lightweight version of 2.5 | 20 |
| gemini-3-flash-preview | Next-gen model (preview) | 20 |
| gemini-3.1-flash-lite-preview | Lightweight version of 3.1 (preview) | 500 |
| claude-opus-4 | Anthropic's top-tier model (evaluate existing English translation) | - |
| manual | Manual translation (evaluate existing English translation) | - |
The "translate" mode is a method where the specified model translates and re-translates the text itself. The "evaluate" mode is a method where an existing English translation (created by Claude or manually) is re-translated into Japanese by Gemini to calculate the score. The difference is explained later.
Model Score Summary
All embedding models are unified as gemini-embedding-001 (768 dimensions).
hello-world (Short text, 188 characters)
| Rank | Model | Score |
|---|---|---|
| 1 | gemini-3-flash-preview | 0.9836 |
| 2 | gemini-3.1-flash-lite-preview | 0.9823 |
| 3 | gemini-2.5-flash | 0.9795 |
| 4 | manual (evaluate) | 0.9712 |
| 5 | gemini-2.5-flash-lite | 0.9701 |
nextjs-vercel (Technical article, 2,530 characters)
| Rank | Model | Score |
|---|---|---|
| 1 | gemini-3-flash-preview | 0.9915 |
| 2 | gemini-2.5-flash-lite | 0.9902 |
| 3 | gemini-2.5-flash | 0.9893 |
| 4 | gemini-3.1-flash-lite-preview | 0.9886 |
| 5 | claude-opus-4 (evaluate) | 0.9785 |
back-translation (Translation explanation article, 3,141 characters)
| Rank | Model | Score |
|---|---|---|
| 1 | gemini-2.5-flash-lite | 0.9870 |
| 2 | gemini-2.5-flash | 0.9851 |
| 3 | gemini-3.1-flash-lite-preview | 0.9838 |
| 4 | claude-opus-4 (evaluate) | 0.9834 |
| 5 | gemini-3-flash-preview | 0.9791 |
Observation: Rankings Shift Depending on the Article
The most interesting finding is that the model rankings change depending on the article.
- gemini-3-flash-preview achieved the highest score of 0.9915 in the technical article (nextjs-vercel), but ranked last with 0.9791 in the translation explanation article (back-translation).
- gemini-2.5-flash-lite topped the explanation article with 0.9870, but ranked last in the short text (hello-world) with 0.9701.
- gemini-2.5-flash is a "stable model," consistently ranking in the top tier (2nd–3rd) across all articles.
In other words, it is impossible to say "this model is the strongest" across the board. It appears that the best model depends on the nature of the text (technical vs. casual vs. short).
Hypothesis on Trends by Article Type
| Article Characteristics | Strongest Model | Hypothesis |
|---|---|---|
| Technical article (clear terminology) | gemini-3-flash-preview | Good at 1:1 technical term mapping? |
| Explanatory article (long logic) | gemini-2.5-flash-lite | Good at simple translation while maintaining context? |
| Short text (low information volume) | gemini-3-flash-preview | Accurate translation even with little context? |
With only three articles, this is still just a hypothesis, but as the number of articles increases, it might be possible to use specific models for specific purposes, such as "Model A for movie reviews, Model B for technical articles."
Score Differences Between Translate and Evaluate Modes
One point of caution is that the nature of the scores differs between "translate" mode and "evaluate" mode.
- Translate mode: The same model handles both "translation" and "re-translation." Because the model's own translation quirks cancel each other out in the round-trip, the scores tend to be higher.
- Evaluate mode: Gemini re-translates an existing English translation (written by Claude or a human). Since the translator and the re-translator are different, scores tend to be lower.
For example, the evaluate scores for claude-opus-4 (0.9834, 0.9785) might look low when compared directly with Gemini's translate scores, but this does not necessarily mean Claude's translation is "inferior." In round-trips between different models, differences in expression choice affect the score.
For a fair comparison, it is ideal to either have all models in translate mode (translate and re-translate themselves) or all models in evaluate mode (re-translated by the same model).
Embedding Model Comparison: gemini-embedding-001 vs 002
I also compared the "yardstick" used to calculate the scores, not just the translation models. Google currently provides two embedding models.
| Embedding Model | Dimension (after reduction) | RPD |
|---|---|---|
| gemini-embedding-001 | 768 | 1,000 |
| gemini-embedding-2-preview | 768 | 1,000 |
I compared the scores by changing only the embedding model while keeping the translation model × article combination the same.
Score Comparison Table
| Article | Translation Model | emb-1 | emb-2 | Difference |
|---|---|---|---|---|
| hello-world | gemini-2.5-flash | 0.9795 | 0.9531 | -0.026 |
| hello-world | gemini-2.5-flash-lite | 0.9701 | 0.9698 | -0.000 |
| hello-world | gemini-3-flash-preview | 0.9836 | 0.9818 | -0.002 |
| hello-world | gemini-3.1-flash-lite-preview | 0.9823 | 0.9711 | -0.011 |
| nextjs-vercel | gemini-2.5-flash | 0.9893 | 0.9646 | -0.025 |
| nextjs-vercel | gemini-2.5-flash-lite | 0.9902 | 0.9457 | -0.045 |
| nextjs-vercel | gemini-3-flash-preview | 0.9915 | 0.9667 | -0.025 |
| nextjs-vercel | gemini-3.1-flash-lite-preview | 0.9886 | 0.9595 | -0.029 |
| back-translation | gemini-2.5-flash | 0.9851 | 0.9498 | -0.035 |
| back-translation | gemini-2.5-flash-lite | 0.9870 | 0.9376 | -0.049 |
| back-translation | gemini-3-flash-preview | 0.9791 | 0.9511 | -0.028 |
| back-translation | gemini-3.1-flash-lite-preview | 0.9838 | 0.9312 | -0.053 |
Trends
- emb-2 scores are generally 0.00 to 0.05 lower. This is not due to translation quality, but rather the characteristics of the embedding model's vector space.
- emb-2 is considered more sensitive to subtle differences between texts (the vector space is more spread out).
- However, the relative rankings between models remain mostly consistent. Models that scored high with emb-1 also tend to score high with emb-2.
- The large variance in the difference is interesting. The difference for gemini-3-flash-preview × hello-world is almost zero (-0.002), whereas it is quite wide for gemini-3.1-flash-lite-preview × back-translation (-0.053). It seems the degree to which the embedding model characteristics manifest changes depending on the article length and model compatibility.
Which should be the "yardstick"?
For the time being, I plan to continue using gemini-embedding-001 as the main yardstick. The reasons are:
- Compatibility with existing data (comparable to scores accumulated with emb-1).
- If the relative ranking remains the same, a higher absolute value is more intuitively understandable.
- emb-2 is still in preview, and API specifications might change.
However, since emb-2 might be a "more rigorous yardstick," I plan to continue accumulating data for both.
Summary and Future Outlook
Key Findings
- There is no "ultimate translation model" — The best model depends on the content of the article.
- gemini-2.5-flash is the "stable type" — It ranks highly in all articles. It is a reasonable choice as a main model.
- Embedding model differences affect the absolute score value but preserve the relative ranking.
- Scores from translate and evaluate modes cannot be directly compared — modes must be consistent.
Future Plans
- Increase the number of articles to verify trends — Three articles are not enough for a sample size. See if genre-based trends emerge.
- Addition of Groq (Llama 3) and DeepSeek — Comparison with free models from other companies.
- Construction of a translation quality dashboard — Visualize scores in graphs so anyone can view them.
- Automatic model selection by genre — A system that automatically selects the best model based on article tags.
The data in this article is as of 2026-04-04. As more articles and models are added, results may change, so I plan to update this periodically.
The mechanism for back-translation is introduced in this article, and the blog's tech stack is introduced in this article.