% cd ..

AI Translation Model Comparison: The Power of the Gemini Family Through Back-Translation Scores

Introduction

Using the back-translation mechanism introduced in the previous article, I have conducted a practical comparison of multiple translation models.

This is an interim report calculating the back-translation score (cosine similarity) for three blog posts, translated by four models in the Google Gemini family, Claude, and manual translation.

Target Models

Model Characteristics Free Tier RPD
gemini-2.5-flash Main model. Balanced 20
gemini-2.5-flash-lite Lightweight version of 2.5 20
gemini-3-flash-preview Next-gen model (preview) 20
gemini-3.1-flash-lite-preview Lightweight version of 3.1 (preview) 500
claude-opus-4 Anthropic's top-tier model (evaluate existing English translation) -
manual Manual translation (evaluate existing English translation) -

The "translate" mode is a method where the specified model translates and re-translates the text itself. The "evaluate" mode is a method where an existing English translation (created by Claude or manually) is re-translated into Japanese by Gemini to calculate the score. The difference is explained later.

Model Score Summary

All embedding models are unified as gemini-embedding-001 (768 dimensions).

hello-world (Short text, 188 characters)

Rank Model Score
1 gemini-3-flash-preview 0.9836
2 gemini-3.1-flash-lite-preview 0.9823
3 gemini-2.5-flash 0.9795
4 manual (evaluate) 0.9712
5 gemini-2.5-flash-lite 0.9701

nextjs-vercel (Technical article, 2,530 characters)

Rank Model Score
1 gemini-3-flash-preview 0.9915
2 gemini-2.5-flash-lite 0.9902
3 gemini-2.5-flash 0.9893
4 gemini-3.1-flash-lite-preview 0.9886
5 claude-opus-4 (evaluate) 0.9785

back-translation (Translation explanation article, 3,141 characters)

Rank Model Score
1 gemini-2.5-flash-lite 0.9870
2 gemini-2.5-flash 0.9851
3 gemini-3.1-flash-lite-preview 0.9838
4 claude-opus-4 (evaluate) 0.9834
5 gemini-3-flash-preview 0.9791

Observation: Rankings Shift Depending on the Article

The most interesting finding is that the model rankings change depending on the article.

  • gemini-3-flash-preview achieved the highest score of 0.9915 in the technical article (nextjs-vercel), but ranked last with 0.9791 in the translation explanation article (back-translation).
  • gemini-2.5-flash-lite topped the explanation article with 0.9870, but ranked last in the short text (hello-world) with 0.9701.
  • gemini-2.5-flash is a "stable model," consistently ranking in the top tier (2nd–3rd) across all articles.

In other words, it is impossible to say "this model is the strongest" across the board. It appears that the best model depends on the nature of the text (technical vs. casual vs. short).

Hypothesis on Trends by Article Type

Article Characteristics Strongest Model Hypothesis
Technical article (clear terminology) gemini-3-flash-preview Good at 1:1 technical term mapping?
Explanatory article (long logic) gemini-2.5-flash-lite Good at simple translation while maintaining context?
Short text (low information volume) gemini-3-flash-preview Accurate translation even with little context?

With only three articles, this is still just a hypothesis, but as the number of articles increases, it might be possible to use specific models for specific purposes, such as "Model A for movie reviews, Model B for technical articles."

Score Differences Between Translate and Evaluate Modes

One point of caution is that the nature of the scores differs between "translate" mode and "evaluate" mode.

  • Translate mode: The same model handles both "translation" and "re-translation." Because the model's own translation quirks cancel each other out in the round-trip, the scores tend to be higher.
  • Evaluate mode: Gemini re-translates an existing English translation (written by Claude or a human). Since the translator and the re-translator are different, scores tend to be lower.

For example, the evaluate scores for claude-opus-4 (0.9834, 0.9785) might look low when compared directly with Gemini's translate scores, but this does not necessarily mean Claude's translation is "inferior." In round-trips between different models, differences in expression choice affect the score.

For a fair comparison, it is ideal to either have all models in translate mode (translate and re-translate themselves) or all models in evaluate mode (re-translated by the same model).

Embedding Model Comparison: gemini-embedding-001 vs 002

I also compared the "yardstick" used to calculate the scores, not just the translation models. Google currently provides two embedding models.

Embedding Model Dimension (after reduction) RPD
gemini-embedding-001 768 1,000
gemini-embedding-2-preview 768 1,000

I compared the scores by changing only the embedding model while keeping the translation model × article combination the same.

Score Comparison Table

Article Translation Model emb-1 emb-2 Difference
hello-world gemini-2.5-flash 0.9795 0.9531 -0.026
hello-world gemini-2.5-flash-lite 0.9701 0.9698 -0.000
hello-world gemini-3-flash-preview 0.9836 0.9818 -0.002
hello-world gemini-3.1-flash-lite-preview 0.9823 0.9711 -0.011
nextjs-vercel gemini-2.5-flash 0.9893 0.9646 -0.025
nextjs-vercel gemini-2.5-flash-lite 0.9902 0.9457 -0.045
nextjs-vercel gemini-3-flash-preview 0.9915 0.9667 -0.025
nextjs-vercel gemini-3.1-flash-lite-preview 0.9886 0.9595 -0.029
back-translation gemini-2.5-flash 0.9851 0.9498 -0.035
back-translation gemini-2.5-flash-lite 0.9870 0.9376 -0.049
back-translation gemini-3-flash-preview 0.9791 0.9511 -0.028
back-translation gemini-3.1-flash-lite-preview 0.9838 0.9312 -0.053

Trends

  • emb-2 scores are generally 0.00 to 0.05 lower. This is not due to translation quality, but rather the characteristics of the embedding model's vector space.
  • emb-2 is considered more sensitive to subtle differences between texts (the vector space is more spread out).
  • However, the relative rankings between models remain mostly consistent. Models that scored high with emb-1 also tend to score high with emb-2.
  • The large variance in the difference is interesting. The difference for gemini-3-flash-preview × hello-world is almost zero (-0.002), whereas it is quite wide for gemini-3.1-flash-lite-preview × back-translation (-0.053). It seems the degree to which the embedding model characteristics manifest changes depending on the article length and model compatibility.

Which should be the "yardstick"?

For the time being, I plan to continue using gemini-embedding-001 as the main yardstick. The reasons are:

  1. Compatibility with existing data (comparable to scores accumulated with emb-1).
  2. If the relative ranking remains the same, a higher absolute value is more intuitively understandable.
  3. emb-2 is still in preview, and API specifications might change.

However, since emb-2 might be a "more rigorous yardstick," I plan to continue accumulating data for both.

Summary and Future Outlook

Key Findings

  1. There is no "ultimate translation model" — The best model depends on the content of the article.
  2. gemini-2.5-flash is the "stable type" — It ranks highly in all articles. It is a reasonable choice as a main model.
  3. Embedding model differences affect the absolute score value but preserve the relative ranking.
  4. Scores from translate and evaluate modes cannot be directly compared — modes must be consistent.

Future Plans

  • Increase the number of articles to verify trends — Three articles are not enough for a sample size. See if genre-based trends emerge.
  • Addition of Groq (Llama 3) and DeepSeek — Comparison with free models from other companies.
  • Construction of a translation quality dashboard — Visualize scores in graphs so anyone can view them.
  • Automatic model selection by genre — A system that automatically selects the best model based on article tags.

The data in this article is as of 2026-04-04. As more articles and models are added, results may change, so I plan to update this periodically.

The mechanism for back-translation is introduced in this article, and the blog's tech stack is introduced in this article.

eof