% cd ..

Pagefind Japanese Search Deep Dive: Tokenizers and IME Fixes

Pagefind Japanese Search Deep Dive: Tokenizers and IME Fixes

Searching for "Cyber," Getting "Sidebar"?

I added a search feature to this blog using Pagefind in my previous post. The setup was straightforward, and for the most part, standard Japanese searches work fine.

But as I've been using it, I've noticed some cases where the accuracy feels a bit off.

When I search for "サイバー" (cyber), articles about "サイドバー" (sidebar) show up too. In English, "cyber" and "sidebar" look nothing alike. But in Japanese katakana — a phonetic script used for foreign words — they're surprisingly similar: "サイバー" is sa-i-baa and "サイドバー" is sa-i-do-baa. They share the fragments sa-i and baa, with only do in between. If a tokenizer splits them at the wrong boundaries, those shared fragments cause false matches.

Meanwhile, searches for kanji compounds like "対応" (response) and "対策" (countermeasure) work perfectly. No cross-contamination at all.

Something in the Japanese word segmentation is going wrong. When Intl.Segmenter splits "サイバー" (cyber) into "サイ" + "バー", those fragments happen to be real Japanese words: "サイ" means "rhino" and "バー" means "bar." So the tokenizer essentially turned "cyber" into a bar with a rhino hostess. Sure.

To figure out what was really going on, I decided to dig into Pagefind's source code.

How Pagefind Handles Japanese Search: A Two-Step Process

Pagefind's Extended build (which is what you get automatically when installing via npm) uses different tokenizers for Japanese processing — one for when it builds the index and another for when you actually search.

PhaseToolDescription
Index Building (Rust)charabia + lindera (IPAdic dictionary)Splits article content into words using morphological analysis
Search Query Analysis (Browser)Intl.SegmenterSplits user input into words

Lindera is a Rust-based morphological analyzer that utilizes IPAdic, the same standard dictionary used by MeCab (the most widely used open-source Japanese text segmentation tool). Intl.Segmenter, on the other hand, is a browser's built-in ICU-based word segmentation API.

Both tools share the same goal: "splitting Japanese text into words." However, their segmentation results don't always match.

Comparing Segmentation: MeCab (IPAdic) vs. Intl.Segmenter

I actually ran the same words through MeCab (with the IPAdic dictionary) and Node.js's Intl.Segmenter to compare their results.

InputMeCab / IPAdic (Index Side)Intl.Segmenter (Query Side)Match
サイバー
saibaa "cyber"
サイバーサイ + バーMismatch
サイドバー
saidobaa "sidebar"
サイド + バーサイド + バーMatch
ステミング
sutemingu "stemming"
ステミングステミングMatch
ドラゴンボール
doragonbooru "Dragon Ball"
ドラゴン + ボールドラゴン + ボールMatch
翻訳
honyaku "translation"
翻訳翻訳Match
対応
taiou "response"
対応対応Match
対策
taisaku "countermeasure"
対策対策Match

Most cases show a match, but "サイバー" (saibaa) is where we see a discrepancy. (Yep, it was the rhino's bar after all...)

Why "Sidebar" Shows Up

Here's how the flow works:

During Index Building (lindera/IPAdic):

  • "サイバー" → 1 token ("サイバー" is registered in IPAdic)
  • "サイドバー" → "サイド" + "バー", 2 tokens

During Search (Intl.Segmenter):

  • User types "サイバー" → Split into "サイ" + "バー", 2 tokens

Since the query gets searched as the two words, "サイ" and "バー," it ends up hitting the "バー" (part of "サイドバー") already in the index. That's why "サイドバー" popped up when I searched for "サイバー."

For common kanji compounds like "対応" and "対策," both tokenizers produce the same segmentation, so they're searched correctly.

Why Can't the Query Side Use the Same Tokenizer?

I dug into Pagefind's source code. While there's a wasm field for each language in pagefind-entry.json, this is actually for the Snowball stemmer (root word extraction) WASM and has nothing to do with the query-side tokenizer.

Since a Snowball stemmer doesn't exist for Japanese, "wasm": null is simply indicating that stemming isn't supported. There's currently no mechanism in Pagefind to swap out the query-side tokenizer.

{
  "languages": {
    "ja": {
      "hash": "ja_798d52173d",
      "wasm": null,
      "page_count": 18
    }
  }
}

OSS Dependencies

Here's how the library dependencies stack up for Pagefind's Japanese processing.

Pagefind (CloudCannon)
  └── charabia (Meilisearch)
       └── lindera (lindera org)
            └── IPAdic dictionary
  • Pagefind — The static site search engine developed by CloudCannon, and the main subject of this post.
  • charabia — A multilingual tokenizer library developed by Meilisearch (a search engine company). It handles language detection and segmentation.
  • lindera — A Rust-based morphological analysis library maintained by lindera org. This is where the real Japanese word segmentation happens.
  • IPAdic — The Japanese dictionary lindera uses, also familiar from MeCab.

Each of these is an independent open-source project with separate maintainers. Pagefind doesn't directly interact with lindera; it uses it indirectly via charabia.

Path to a Solution

If we could use lindera on the query side too, the segmentation results for both the index and the query would match, and the problem would be solved. But getting there involves at least two major hurdles:

  1. Lindera WASM Build — The entire lindera library, plus the IPAdic dictionary, would need to be compiled to WASM, which would result in a pretty large binary size.
  2. Pagefind's Query Tokenizer Replacement Mechanism — Right now, the query side is hardcoded to Intl.Segmenter, and there's no way to swap it out.

It looks like a long road ahead, but if it ever happens, Japanese search accuracy would improve dramatically. Rooting for lindera and CloudCannon.

A Structural Limitation — But It's Almost Fine

This is a design limitation of Pagefind. It's just not realistic to bring a server-side tool like lindera (Rust) directly into the browser. So, the query side has to rely on the browser's built-in Intl.Segmenter.

That said, for general Japanese (like kanji compounds or common katakana words), both tokenizers produce mostly consistent segmentation. So, it's not a huge practical issue. Stemming (root word extraction) isn't supported either, but for a personal blog's search, it's usually sufficient.

If you need perfect Japanese search, you'd need a server-side search engine like Algolia or Meilisearch. But considering the constraints — free, zero-config, and build-time complete — I think Pagefind is still the best choice out there.

IME (Japanese Input) and Search: Compatibility Issues

Tokenizer issues aside, there's another problem unique to Japanese input.

Pagefind's incremental search runs a search with every character you type. It works pretty smoothly, but when it comes to Japanese, IME (Input Method Editor) conversion complicates things.

A quick primer for non-Japanese readers: when typing Japanese, you don't type characters directly. You type phonetically on a standard keyboard (e.g., s-u-t-e-m-i-n-g-u), which first appears as hiragana — a cursive phonetic script: すてみんぐ. You then press a key to convert it into the intended form, in this case katakana: ステミング. Until that conversion is confirmed, the text is provisional.

The problem is that Pagefind searches on every keystroke, including during this provisional phase. So when typing "ステミング" (stemming), the search fires on "す," "すて," "すてみ" — none of which match the katakana "ステミング" in the index. You only get correct results after conversion is confirmed, which makes for somewhat finicky behavior.

This isn't just a Pagefind issue; it's a common problem with Japanese input in incremental searches in general.

The Fix: compositionend + Debounce

By using the browser's compositionstart and compositionend events, we can detect if an IME conversion is in progress.

  • When compositionstart fires, pause the search.
  • When compositionend fires (i.e., conversion confirmed), execute the search with the current string.

This way, the search won't run while you're typing "すてみんぐ," and instead, it'll execute the search the moment "ステミング" is confirmed.

Additionally, I've added a 200ms debounce (delayed execution) for alphanumeric input. Instead of hitting the search API on every single keystroke, this mechanism waits for input to stop for 200ms before executing the search. It feels almost instant but helps cut down on wasteful search requests during fast typing.

// IME conversion in progress flag
const composingRef = useRef(false);

<input
  onChange={(e) => handleInput(e.target.value)}
  onCompositionStart={() => { composingRef.current = true; }}
  onCompositionEnd={handleCompositionEnd}
/>

A simple implementation that covers both IME handling and performance optimization.

Wrapping Up

  • Pagefind's Japanese search uses a two-tiered structure with different tokenizers for the index side (lindera/IPAdic) and the query side (Intl.Segmenter).
  • If their segmentation results differ, unexpected search results can occur (e.g., "サイバー" → "サイドバー").
  • Solving this requires both WASM-ification of lindera and a mechanism in Pagefind to replace the query tokenizer.
  • For general Japanese, it's practically sufficient. It's fine for personal blog searches.
  • The problem of search running during IME conversion can be addressed with compositionend + debounce.