[Dz] Testing 3 Different Translation Prompt Patterns

The trigger: auto-translation reads stiff

This blog writes Japanese articles in Markdown and runs a translation pipeline that uses Gemini to translate them into English and write the results to posts/en/. The other day I edited the very first post, hello-world.md, which kicked off the auto-translation pipeline. Here's what came out.

This is the first post of Dazy Dayz Diary.

In this blog, I plan to write tech notes and share my thoughts on movies and anime I have watched.

### Features of this blog

- Articles written in Japanese are automatically translated by AI.
- Translation quality is automatically verified using back-translation.
- Multiple translation models are compared to select the optimal translation.

The meaning is perfectly accurate, but as English it reads stiff. In this blog, I plan to, Features of this blog, I have watched — these are textbook "translated English" tells. Here's the hand-written version I had before, for comparison.

This is the very first post on Dazy Dayz Diary.

I'll be writing about movies, anime, and tech notes on this blog.

### What Makes This Blog Unique

- Blog posts written in Japanese are automatically translated by AI
- Translation quality is verified through back-translation
- Multiple translation models are compared to select the best output

The second one clearly sounds like someone actually writing an English blog. Short sentences, a catchier heading, a contraction (I'll), direct voice.

But when you score these two with back-translation (cosine similarity between source_embedding and back_translation_embedding), the AI version actually scores higher (0.9858). "High score but stiff English" — exactly the kind of score-vs-human-judgment gap I dug into in a separate post.

Hypothesis: maybe the prompt never asks for naturalness

When I checked translate.ts, the prompt looked like this.

Translate the following Japanese text to English.
Output ONLY the translated text, no explanations or metadata.
Keep markdown formatting intact.

CRITICAL: Preserve all URLs in markdown links EXACTLY as they appear in the source.
- Do NOT modify, shorten, or "clean up" any URL path...

It tells the model not to break URLs and to keep markdown intact, but there's nothing about style, naturalness, or tone. No wonder it falls back to literal translation.

So I tried three prompt variants, dialing up the "naturalness" pressure step by step, and compared the English output against the back-translation scores.

Three-pattern experiment

Pattern A: light — one extra line

  Translate the following Japanese text to English.
+ Write as a native speaker would write a personal blog post — natural, conversational, idiomatic.
  Output ONLY the translated text...

Pattern B: medium — call out the anti-patterns

Pattern A plus a concrete list of things to avoid.

Avoid translation-ese patterns:
- JA→EN: Don't start sentences with "In this blog," or "I plan to" — use direct, conversational voice.
- JA→EN: Use contractions where natural (I'll, I've, don't, it's).
- JA→EN: Headings should be punchy and engaging, not literal restatements of the Japanese.
- EN→JA: 不自然な「〜について」「〜することができます」「この記事では」を避ける。
- EN→JA: 原文にない装飾語（「記念すべき」「ようこそ」「気ままに」など）を勝手に追加しない。

Pattern C: heavy — "not translation, rewriting"

Tell the model to act as a "bilingual blog editor" instead of a translator, and explicitly allow it to restructure sentences.

You are a bilingual blog editor, not a literal translator.
Rewrite the following Japanese text in English so it reads as if it were originally written by a native speaker for a personal blog.
Style requirements:
- Native, idiomatic, conversational. Restructure sentences freely if it helps naturalness.
...

Results

Pattern	Score	Opening line of the translation
Baseline	0.9858	`This is the first post of Dazy Dayz Diary.`
A: light	0.9798	`This is the first post on Dazy Dayz Diary.`
B: medium	0.9725	`This is the first post on the Dazy Dayz Diary.`
C: heavy	0.9383	`This is my very first post on Dazy Dayz Diary.`
Hand-written (reference)	—	`This is the very first post on Dazy Dayz Diary.`

For naturalness, C lands closest to the hand-written version. my very first post has the same flavor as the manual the very first post. Pattern B also produced clearly blog-flavored phrasing like I'll be using this space to jot down tech notes.

But the score moves in exactly the opposite direction — the higher I push naturalness, the lower the score drops.

Unpacking the score drop

My first thought was, "Sure, freer translation loses some meaning fidelity, that's the tradeoff." But when I read the actual back-translated Japanese, I realized the problem went deeper.

Here's Pattern C's back-translation.

# はじめまして！

## ついにスタート

Dazy Dayz Diaryへようこそ。記念すべき初投稿です。

ここでは、日々の技術的な備忘録や、最近観た映画やアニメについての感想を気ままに綴っていこうと思っています。

The Japanese is loaded with flourishes that aren't in the original: 「はじめまして！」「ついにスタート」「ようこそ」「記念すべき」「気ままに」. I asked for natural blog voice, and the LLM freely invented decorative phrases the source never had.

That's when I noticed the structural issue in translate.ts.

translateText(sourceText, "Japanese", "English", model)   // forward: natural English
translateText(translated,  "English",  "Japanese", model) // back:    natural Japanese too!

Both directions of the round trip use the same naturalness-focused prompt. So on the back-translation side too, I'm asking the LLM for "native blog voice in Japanese" — which is exactly why it adds embellishments. Those embellishments widen the cosine distance from the original, and the score drops.

So it's not that "translation quality got worse" — it's that the evaluation mechanism itself got contaminated.

Evaluation-mechanism contamination

For back-translation scores to function as a fair metric, the back-translation step needs to run under identical conditions every time. Specifically:

The back-translation prompt should be fixed (strict, literal)
The forward translation can be as free and natural as it wants
The back-translation runs in "pick up meaning only, no embellishment" mode

That way, no matter how natural the forward translation gets, the back-translation stays in the same literal mode, and the score actually measures "how well does the forward translation preserve the meaning of the source."

In this experiment, I used the same prompt in both directions, so Pattern C's 0.9383 isn't mostly "the meaning drifted" — it's mostly "the back-translation embellished and pulled the embedding apart."

What we've learned so far

Prompt instructions reliably shift style. Even just adding Write as a native speaker visibly softens the stiffness.
LLMs don't reliably obey prohibitions. Tell it "don't add embellishment" and the urge to produce native-sounding Japanese can win anyway.
Treat the forward and back translations as separate things. Running both with the same prompt contaminates the evaluation.
Scores aren't all-knowing. High score doesn't equal good English. The score is a proxy for "meaning preservation" — it can't directly measure stylistic naturalness or readability.

Sequel: I split the prompts, and a new kind of contamination appeared

I split translateText into forwardTranslate (naturalness-focused, roughly Pattern B) and backTranslate (evaluation-only, fixed literal). Now forward can paraphrase freely, and back always measures under the same conditions — in theory.

Here's the result, the fifth data point.

Pattern	Score	Back-translation character
Baseline	0.9858	Natural prompt, both directions
A: light	0.9798	Same
B: medium	0.9725	Same
C: heavy	0.9383	Same, heavily embellished
forward natural / back literal	0.9537	Mechanical, word-for-word back-translation

The English output is the best yet.

This is the very first post on the Dazy Dayz Diary.

I'll be using this space to jot down tech notes, along with my thoughts on the movies and anime I've been watching lately.

### What makes this blog tick

- All Japanese posts get automatically translated by AI.
- I'm using back-translation to keep the quality in check.
- I'm putting multiple translation models to the test to pick the best results.

This is the very first post is nearly identical to the hand-written version. What makes this blog tick is a natural, colloquial heading. And on the back side, the embellishments like 「ようこそ」「気ままに」 are gone — it's properly running in literal mode. Exactly what I wanted.

Except the score dropped below Pattern B (0.9725) to 0.9537. Why?

Look at the back-translation.

私はこのスペースを、技術的なメモを書き留めるため、そして最近見ている映画やアニメについての私の考えを書き留めるために使用します。

### このブログを動かしているもの

- すべての日本語の投稿は、AIによって自動的に翻訳されます。
- 私は品質を確認するために、逆翻訳を使用しています。
- 私は最良の結果を選ぶために、複数の翻訳モデルをテストにかけています。

By going literal, it produced mechanical, stiff written-register Japanese. 「私は…使用します」「私は…使用しています」「私は…テストにかけています」 — nobody talks like that.

Meanwhile, the original passage looked like this.

このブログでは、テック系メモや鑑賞した映画・アニメの感想などを書いていく予定です。

### このブログの特徴

- 日本語で書いた記事をAIによって多言語に自動翻訳する
- 往復翻訳（Back-Translation）で翻訳品質を自動検証
- 複数の翻訳モデルを比較して、最適な翻訳を選定

Casual です・ます-style colloquial Japanese with noun-ending phrases.

So embeddings pick up not just meaning but also style and register. When the original is casual conversational and the back-translation is stiff literal, the meaning may match but the embedding distance widens anyway.

The tradeoff trap

This experiment exposes a structural problem at the core of back-translation scores.

Strategy	Problem
Forward natural + back natural	Back invents embellishments and contaminates
Forward natural + back literal	Style/register distance leaks into the embedding
Forward literal + back literal	The public English translation is stiff (unusable)

Escaping these three requires the back-translation to do something delicate: preserve the original register while mechanically converting meaning only. Controlling that purely through prompting is probably impossible.

Another structural layer I noticed

At this point you might think, "Okay, then fix the prompt and just compare multiple models side by side." Kill the prompt as a variable, and only the model is left — that gives you a fair comparison.

That's mostly right, but there's another pitfall.

Looking at the current code:

forwardTranslate(text, ja, en, translationModel)  // forward
backTranslate(text,    en, ja, translationModel)  // back ← same model!

Forward and back use the same model. So when you run a multi-model comparison, for each model:

The forward translation quality varies
The back translation's strictness (how faithfully it obeys the literal instruction) also varies

Since the score sums both effects, you can't tell whether "high score" means "forward was good" or "back happened to produce Japanese closer to the source register."

The fix is to fix the back-translation model to a single one.

forward (under evaluation): gemini-2.5-flash, gemini-3.1-flash-lite, claude-opus, ...
back (the fixed judge):     claude-opus-4.6  ← shared across all models

This is the same philosophy as BLEU or BERTScore preparing a single human golden translation and comparing every model against it. We don't have a "golden translation," but a "golden back-translator" can stand in.

So to keep the evaluation mechanism honest, you need two orthogonal separations.

Prompt separation: forward natural, back strict
Model separation: back is a fixed strong model, forward is whatever's being evaluated

Lessons

Prompt instructions reliably shift style. Even just adding Write as a native speaker visibly softens the stiffness.
LLMs don't reliably obey prohibitions. Tell it "don't add embellishment" and the urge to produce native-sounding Japanese can win anyway.
Treat the forward and back translations as separate things. Running both with the same prompt contaminates the evaluation.
Embeddings pick up style and register too. They don't measure meaning in isolation. Stretch the stylistic distance between source and back-translation and the score will drop.
The back-translation model should be fixed too. Only after eliminating both prompt and model as variables can you fairly measure forward translation quality.
Scores aren't all-knowing. High score doesn't equal good English. The score is the sum of several factors (meaning preservation, stylistic match, how obediently the evaluator follows the literal prompt) — it doesn't directly measure naturalness or readability.

What's next (and what's not next)

To build a proper evaluation mechanism, I'll need to:

Add a --back-model option to runTranslate so forward and back models can be specified separately
Add a back_translation_model column to the Supabase translations table (or rename the existing model_name to forward_model)
Finalize the forward prompt
Finalize the back prompt (rethink it in the register-match direction)
Run a side-by-side comparison across multiple models

But before that, I feel I need more experimental data. This experiment was on a single article, hello-world.md. Sample size of one. And the anti-pattern examples I wrote into the prompt (In this blog,, I plan to, 「気ままに」, 「記念すべき」) are tied to the specific failures from this one article — they're way too specialized to call a general-purpose prompt. Whether the prompt actually generalizes, I won't know without trying it on more varied articles.

This sits at the core of the blog's infrastructure, so rather than pivoting in a hurry, I think the right move is to keep producing articles, accumulate diverse data with the current setup, and only start the serious refactor once I have a real sample to work from.

Closing

The "back-translation score as translation quality metric" approach itself is alive and well — as a concept it works. What this experiment uncovered was that the devil is in the implementation details. Going from a tiny 1-article 1-model experiment to seeing the structural problems of the entire evaluation mechanism, all of it pulled out like a chain — that was genuinely thrilling.

I imagine BLEU and BERTScore went through plenty of these same detail-level traps on their way to becoming what they are. Even at the scale of a personal blog, tackling a research topic head-on lets you re-live the struggles behind established methods — and that's a pretty rare experience to get.

The research on this blog continues.