Show HN: Context-aware Japanese furigana using Sudachi and ModernBERT

(ezfurigana.com)

11 points | by epitrochoid413 3 hours ago

7 comments

epitrochoid413 3 hours ago
I built a context-aware furigana converter for Japanese text, files, and web pages.
The main problem I wanted to solve was that simple dictionary-based furigana works well for common cases, but breaks on words where the reading depends on context:
* 市場: いちば or しじょう
* 大分: おおいた or だいぶ
* 人気: にんき or ひとけ
* 最中: さいちゅう or さなか or もなか
* 方: かた or ほう
The engine is a hybrid system:
* Sudachi for tokenization, base forms, POS, and candidate readings
* Expanded dictionary coverage for compounds and fixed expressions
* Custom rules for counters, suffixes, rendaku patterns, and phrase overrides
* ModernBERT fallback for 144 especially context-dependent target words
I have been testing it against an LLM-assisted benchmark of 7,500 Japanese lines. On the current benchmark, it gets about 12 wrong readings per 1,000 tokens. I treat that as a practical regression benchmark rather than a formal academic evaluation, but it has been useful for comparing versions and catching regressions.
The hardest remaining cases are personal names, place names, rendaku, rare vocabulary, and domain-specific terms.
I would especially appreciate examples where it gets the reading wrong, since those are the most useful for improving the system.
[-]
- tkgally 12 minutes ago
  Very, very nice.
  Since January, I’ve been having Claude build a static Japanese-English dictionary in which all of the kanji and jukugo can be displayed either with or without furigana:
  https://www.tkgje.jp/index.html
  I haven’t spotted any mistakes in the furigana myself, though there must be some. I have a scheduled routine running multiple times a day to have Claude check and polish existing entries; it should be correcting most of whatever furigana mistakes might be in the data. At some point, I will set up an agent to use a different LLM to run a similar set of checks to try to reduce the error rate even more.
  As you note, the readings of Japanese words depend on the context, so producing accurate furigana cannot be done programmatically. Sentences must be interpreted semantically, which is a lot slower.
  I am releasing all of the dictionary data into the public domain, and anyone is free to fork it or adapt it however they like:
  https://github.com/tkgally/je-dict-1
- fenomas an hour ago
  Nice work, just gave a quick pass but seems to work well!
  (Also: vouched, your comment was dead FYI)
  [-]
  - epitrochoid413 39 minutes ago
    Thanks, that’s great to hear. Thanks for the vouch too, I didn’t realize the comment was dead.
bluechair 25 minutes ago
Fantastic tool and love the delivery; no sign up required. Interested to hear how you pulled that off.
Also interested to hear if you plan to eventually support an option to add pitch accent; I've never seen what training material exists for that or how that is supported in unicode.
altilunium an hour ago
It really works. Very cool. I’ve been looking for this kind of service for a long time since I started learning Japanese, and I’ve rarely been satisfied with the available services.
[-]
- epitrochoid413 40 minutes ago
  Thank you, that means a lot. I’ve been working on it for about a year now, so it’s really encouraging to hear it’s useful. I’m hoping to keep pushing the accuracy further, especially on the remaining hard cases like rendaku and person/place names.