[Dz] My Translation Scored 0.96, But the English Version Fell Apart

Back-translation scores aren't a silver bullet

On this blog, I have an AI translate my Japanese posts into English, and I auto-grade the quality using "back-translation."

The setup is simple: take the Japanese, translate it to English, then translate it back to Japanese. Embed both the original and the round-tripped version, and compare them with cosine similarity. A high score is supposed to mean the meaning came through intact.

I wrote up the full mechanism in this post.

The other day, though, I hit a case where the score came in at a healthy 0.9635, and yet the English was completely broken as a piece of writing.

What the Project Hail Mary post was actually about

The post in question is the one I wrote about the movie Project Hail Mary — here.

It's structured around two main parts.

A 4DX experience report — comparing it to Godzilla Minus One, and how the energy in the theater spiked once Rocky showed up
A look at the Japanese title — the origin of "Hail Mary," and the problem with how Japan localizes movie titles

I ran the whole thing through gemini-3.1-flash-lite-preview and measured the back-translation score: 0.9635. By the numbers, a "high-quality translation."
But then I actually read the English…

So what was wrong with it?

The back half of the article was a mess.

1. Explaining American football to Americans

The original carefully walks Japanese readers through the origin of "Hail Mary pass," the phrase the title plays on. For a Japanese audience that's a necessary piece of background.
But translate that straight into English and you end up with an article asking English readers — Americans in particular — "Hey, do you know the Hail Mary pass in football?"

In American football, a "Hail Mary" is that desperate, last-second, long-shot pass teams throw when they're about to lose. It's a literal prayer to the Virgin Mary — a "help me out here" move when you have no other options.

My guess is that less than 1% of the Japanese audience catches that cultural nuance.

Half the country watches the Super Bowl. The phrase is basically idiomatic at this point. And here's a Japanese guy, in slightly off English, proudly explaining it to them. Charitably, it's pretty ridiculous.
You can't ask an American reader "what percentage of Japanese people do you think get this?" — they're just going to say, "I mean, I know what it is, so…"

2. Translating a localization problem to a world that doesn't have one

"Does the Japanese title Project Hail Mary actually land with Japanese readers?" is a question that only really resonates if you're reading it in Japanese.

English-speaking audiences don't really have a concept of "movie title localization" to begin with. Tell them "Project Hail Mary was released as Project Hail Mary," and the reaction is "…okay?" Without the context that in Japan, books and films routinely get released under completely different titles for cultural or marketing reasons, the gripe — "fine, then go rename The Martian to The Man on Mars while you're at it" — just doesn't compute for an English reader.

Still, if they wanted to stay true to the book, I wish they'd stuck with the book's title for The Martian (which was released as Odyssey in Japan). I'm curious to see what they do with the title for Nolan's upcoming The Odyssey.

It's just dryly reporting that "in Japan it came out as Odyssey." For an English reader that's a "huh, neat" and they move on. All the "what are you even doing??" energy of the original is gone.

3. Grace's name is not exactly a riddle

In the original I explain that the protagonist's name, Grace, comes from "Hail Mary, full of grace" — a naming choice that's basically a dad joke.

But for English speakers, especially Christian ones, the connection between Grace and "full of grace" is obvious. The "dad joke" line still works, but everything leading up to it becomes pointless padding.

Also, fun fact: the protagonist's name, Grace, is a nod to the "Hail Mary" prayer itself ("Hail Mary, full of grace"). It's basically a massive dad joke of a naming convention.

Walking a Christian who goes to church every Sunday through the opening line of the Hail Mary prayer like it's a fun trivia tidbit is, well… a look.

How did this happen?

So why was the score so high?
That's the interesting part.

What the back-translation score measures is information loss and drift. It quantifies how well the meaning of the original survives the round trip.

And every one of these problems is one where the information was preserved just fine.

The Hail Mary pass explanation → accurately conveyed in English
The localization complaint → accurately stated in English
Grace's name origin → accurately explained in English

The meaning came through. That's why the score is high.
It just doesn't mean anything to a native English reader.

Two orthogonal signals

What this experiment made clear is that translation quality has at least two independent axes.

Axis	What it measures	Result this time
Meaning Retention (back-translation score)	Does the information survive without loss or distortion?	0.9635 (high)
Native Suitability	Does it read naturally and meaningfully to a reader of the target language?	Broken (low)

These two are orthogonal signals. One can be high while the other is low — it happens all the time.

The reverse is also possible. A bold, liberal translation that fills in cultural context might read beautifully to a native, but when you round-trip it back to the original language, it'll drift from the source — meaning the score drops.

Literal version vs. localized version

I actually have two English versions of this post: the AI's literal translation, and one I manually rewrote for an English-speaking audience. Let's look at the same passages from above in the localized version.

The Hail Mary pass explanation

Literal version (gemini-3.1-flash-lite-preview, score 0.9635):

In American football, a "Hail Mary" is that desperate, last-second, long-shot pass teams throw when they're about to lose. It's a literal prayer to the Virgin Mary — a "help me out here" move when you have no other options.

My guess is that less than 1% of the Japanese audience catches that cultural nuance.

Localized version (manual, score 0.9396):

Here's something you probably never think about: the title Project Hail Mary is a masterpiece of naming. It tells you everything — the desperation, the prayer, the impossibly long odds. You hear "Hail Mary" and you instantly get the football metaphor, the religious undertone, the whole vibe of humanity's last-ditch effort.

Now imagine you've never watched a single football game in your life. You've never been to church. "Mary" is just a foreign name, and "Hail" means nothing to you.

The literal version explains football to Americans and asks "what percentage of Japanese people do you think get this?"
The localized version turns it around: "this feeling that's second nature to you — in Japan, it just vanishes."

Grace's name

Literal version:

Also, fun fact: the protagonist's name, Grace, is a nod to the "Hail Mary" prayer itself ("Hail Mary, full of grace"). It's basically a massive dad joke of a naming convention.

Localized version:

You've probably already noticed this, but the protagonist's name is itself a wink — "Hail Mary, full of grace." Andy Weir is not above a dad joke, and honestly, I respect that. In Japanese, this wordplay is completely invisible. "Grace" is just another foreign name, and the Ave Maria connection flies over everyone's head.

The literal version goes "did you know about the Ave Maria?"
The localized version opens with "you've already noticed, right?" and pivots into "most Japanese readers will completely miss this."

The Japanese-title rant

Literal version:

Still, if they wanted to stay true to the book, I wish they'd stuck with the book's title for The Martian (which was released as Odyssey in Japan). I'm curious to see what they do with the title for Nolan's upcoming The Odyssey.

Localized version:

Speaking of Japanese titles — while Project Hail Mary kept the English title as-is, Andy Weir's other adaptation The Martian was released as Odyssey in Japan. Not The Martian. Not the Japanese novel title The Man on Mars. Just... Odyssey. Imagine if your local theater renamed Interstellar to Space Journey for no apparent reason. That's the level of inconsistency we're dealing with.

The literal version just states that "in Japan it came out as Odyssey."
The localized version reaches for an analogy — "what if your local theater renamed Interstellar to Space Journey?" — so an English reader can actually feel that this is weird.

The higher-scoring version (literal) reads as a faintly ridiculous article to a native English speaker.
The lower-scoring version (localized) actually works as a piece of writing.
That's what "orthogonal signals" means in practice.

So this post won't use the AI translation

In the end, I set skip_translation: true on the Project Hail Mary post. The auto-translation pipeline skips it entirely. The localized version I quoted above is something I wrote by hand, separate from the pipeline.

There's no value in mechanically translating a post that only works because it's being read in Japanese. That's not translation — that's just language conversion.

If you actually want an English version, you need localization, not translation. You shift the perspective and rewrite for an English-speaking reader. Skip the Hail Mary explainer; instead, write from the outside looking in at how this gets received in Japan. It takes that level of structural rework.

What comes next

After all this back-and-forth with one article, the next step — really, the necessary next step — became obvious.
The plan is to add a critic model to the translation pipeline.

Take the English the translator (a Gemini Flash-class model) produced and hand it to a different model (Claude or GPT-4o) with a prompt like: "point out the problems from a native reader's perspective." The goal is to surface the "doesn't feel native" signal that the back-translation score can't see, as a second, independent axis.

Move toward multi-faceted quality evaluation, instead of leaning on one score. It's still at the planning stage, but this experiment gave me real-data justification for why it's needed.