% cd ..

Automating Related Posts with Embeddings

Automating Related Posts with Embeddings

What About Related Posts?

It's pretty common to see a "Related Posts" section at the end of blog articles. If you're on WordPress, popular plugins like YARPP (Yet Another Related Posts Plugin) handle all that for you, so there's no need to manually manage links.

While YARPP's source code is public, the specifics of its scoring aren't officially documented. Roughly, it seems to work by scoring articles based on common words in titles and body text, plus matching categories and tags. If the total score goes above a certain threshold, the article is considered "related." You can even weight each element (like "don't consider," "consider," or "consider [important]"), which is a simple but well-established approach.

For this blog, I've built an automatic related posts detection system using a different approach. I simply repurposed the embedding (text vectorization) mechanism I'm already using for verifying translation quality.

YARPP (Word-Based) vs. My Blog (Embedding-Based)

YARPP (Word-Based)My Blog (Embedding-Based)
CriteriaWord matches in title/body, category/tag matchesCosine similarity of semantic vectors for the entire text
StrengthsConnects articles with shared keywordsShould be able to find articles with similar meaning even if keywords don't match
Tunable SettingsWeighting of each element + thresholdThreshold only
Track RecordWordPress classic, years of proven operationExperimental, needs validation once more articles are published

The big advantage of the embedding-based approach is its ability to determine relatedness based on "semantic closeness." For example, "translation pipeline" and "back-translation" are conceptually similar, even if the words are different — embeddings should be able to pick them up as related. On the flip side, it's tough to verify its effectiveness with just a few articles, and it's still unknown how much of an edge it has over a well-established method like YARPP.

How It Works

Measuring Article Proximity with Embeddings

My blog already had an embedding system in place to calculate quality scores for back-translations (learn more here). It vectorizes the Japanese original text and stores it in Supabase's pgvector.

I'm using this same embedding for related post searches too. If the cosine similarity between Article A's and Article B's embeddings is high, I can determine their content is "close" or similar.

post_embeddings Table

I created a post_embeddings table to store embeddings on an article-by-article basis, separate from the translations table that holds translation results.

create table post_embeddings (
  post_id text primary key,
  embedding vector(768),
  embedding_model text not null default 'gemini-embedding-001',
  updated_at timestamptz default now()
);

When the translation pipeline (translate.ts) runs a translation, it also saves the embedding to this table. Since the source_embedding is already being calculated anyway, there's zero additional cost for API calls.

A Good Fit for Static Site Generation

With Next.js's Static Site Generation (SSG), all pages are generated at build time. This means:

  1. You push a new article.
  2. Vercel rebuilds all pages.
  3. New related article links automatically appear, even on old article pages.

This is a crucial difference compared to managing links manually. Once you've set up the system, all you have to do is write your articles.

You could achieve the same with YARPP on WordPress, but this approach might be superior in terms of server load. YARPP has gotten noticeably better recently with built-in caching and database optimizations, but it was once notorious as one of WordPress's "heavy plugins," even getting banned by some high-speed hosting services (like WP Engine) at one point.

Learning from Other Matching Services

So far, I've been talking about measuring "closeness between articles" using embeddings. But this problem of "finding and presenting similar items" is a theme that matching services worldwide have been tackling for years.

Think about "Recommended jobs for you" on a job search site, or "People who viewed this item also viewed" on an e-commerce site. The logic behind the scenes can broadly be categorized into three types.

MethodWhat It DoesExample
Attribute Matching (Rule-Based)Checks for matches in discrete attributes like tags, categories, skillsYARPP, job site skill filters
Collaborative FilteringRecommends "what people with similar behavior liked"Amazon's "Customers who viewed this item also viewed"
Embedding SimilarityVectorizes text or images to measure semantic closenessMatching resumes vs. job postings

Most major services today use a hybrid approach, combining all three of these methods. LinkedIn, for example, mixes "skill match (attribute) + jobs applied for by people with similar backgrounds (collaborative) + semantic closeness of resumes (embedding)" to generate its scores.

Where My Blog Stands and Where It's Going

My blog's related posts currently run solely on "embedding similarity" out of those three methods. My blog doesn't have the behavioral data (who read what) necessary for collaborative filtering. I chose this setup based on the hypothesis that embeddings should be able to pick up "semantic closeness" better than attribute matching (tags).

However, I can imagine scenarios where embeddings alone might hit their limits. For example, an article about AI and a movie review that both happen to discuss embeddings might appear as related despite being completely different in genre. The meaning is close, but the recommendation feels off. Combining embedding similarity with attribute matching (tags) should help mitigate that.

As for realistic next steps:

  • Hybridizing embeddings + tags — I'll incorporate tag matching into the embedding score to achieve both semantic closeness and genre alignment. Since I have both the data and the mechanism ready, I can try this immediately.
  • Putting collaborative filtering on hold — While technically possible if I gather data from access analytics on "what people read after this article," a personal blog of this size simply won't have enough data for the foreseeable future.

Since the embedding calculations are already running in my translation pipeline, there's zero additional cost for related articles. I think it's pretty neat that what matching businesses run on massive infrastructure, a personal blog can try on a small scale with Supabase's free tier.

Current Status & Future Plans

Since there are only a few articles published at this stage, I honestly can't say if the embedding-based related posts are functioning effectively yet. It's currently running with a provisional threshold of cosine similarity 0.8 or higher, showing a maximum of 3 articles, but the low article count means I can't properly validate it. I'll only be able to assess its superiority (or inferiority) compared to word-matching methods like YARPP once more articles are added and genres diversify.

Things I want to verify in the future:

  • Can it accurately pick up articles that are semantically close but use different keywords, once the number of posts grows?
  • Will the optimal threshold value (currently 0.8) change depending on the number of articles and their genres?
  • If embedding alone starts showing its limits, how much will a hybrid approach with tag matching improve things?

Once I've gathered enough real-world data, I plan to analyze the score distribution and hit rate.