japanese-nlp is a six-stage NLP pipeline built for language education. Starting from a photo of handwritten Japanese hiragana, it produces romanization, an English translation, a per-token vocabulary breakdown with dictionary forms, and a grammar analysis — all from a single command.
The character recognition stage uses a custom CNN trained through five iterations in PyTorch, culminating in a Multi-Style model that achieves 97.42% test accuracy on 49 hiragana classes by combining real handwriting data (K49) with synthetically rendered font samples.
python main.py image/demo_sentence1.png
Validates the input file, converts to grayscale, reduces noise with a Gaussian blur, and resizes to 64×64 with aspect-ratio-preserving white padding.
Binarizes using Otsu's threshold, detects character bounding boxes via contour analysis, sorts them into reading order, and extracts padded character crops.
A custom CNN trained in PyTorch over five iterations. The final Multi-Style CNN trains on a 50/50 blend of K49 real handwriting (~270k samples) and synthetically rendered font images.
Uses SudachiPy with SplitMode.C (longest-match) to break the recognized string into morphemes annotated with surface form, dictionary form, reading, and English POS label.
Identifies nouns, verbs, adjectives, and particles. Detects tense (past / non-past) and politeness (polite / plain) from the token stream.
Generates hiragana → Hepburn romaji → English translation → vocabulary table with dictionary forms → grammar summary, printed to the terminal.
Five training iterations to arrive at the final deployed model.
| # | Model | Dataset | Test Accuracy | Note |
|---|---|---|---|---|
| 1 | Scratch CNN | ETL9B (71 classes) | 80.27% | Baseline |
| 2 | Hyperparameter tuning | ETL9B | 81.97% | Val/test overfitting |
| 3a | ResNet-18 (fine-tuned) | ETL9B | 80.27% | Equal to scratch |
| 3b | EfficientNet-B0 (fine-tuned) | ETL9B | 76.85% | ImageNet bias harmful |
| 4 | K49 CNN | K49 (49 cls, 270k) | 98.02% | +17.75% — data switch only |
| 5 | Multi-Style CNN | K49 + synthetic fonts | 97.42% ✓ | Final · generalizes across writing styles |
Full pipeline output for ねこがすきです:
| Token | Romaji | Part of Speech | Dictionary Form |
|---|---|---|---|
| ねこ | neko | Noun | 猫 (cat) |
| が | ga | Particle | subject marker |
| すき | suki | Noun | 好き (like) |
| です | desu | Aux. Verb | polite copula |
| Stage | File | Tests | Status |
|---|---|---|---|
| Stage 1 · Preprocessing | test_preprocessing.py | 32 | passing |
| Stage 2 · Segmentation | test_segmentation.py | 38 | passing |
| Stage 4 · Tokenization | test_tokenization.py | 24 | passing |
| Stage 5 · Grammar | test_grammar.py | 31 | passing |
| Total | 125 | all passing | |
CS 5624 semester project — Virginia Tech, Spring 2025.
The pipeline is a linear six-stage DAG. Each stage is an independent module under src/; main.py wires them together end-to-end.
python main.py <image>
OpenCV · NumPy
OpenCV
model_multistyle/hiragana_multistyle.pth
PyTorch · torchvision
SplitMode.C (longest-match).
Returns surface form, dictionary form, reading, and POS tag per morpheme.
SudachiPy · sudachidict-core
googletrans · Pillow
japanese-nlp/ ├── main.py # pipeline entry point ├── requirements.txt ├── README.md │ ├── src/ │ ├── stage1/ │ │ └── preprocessing.py # file validation, grayscale, noise, resize │ ├── stage2/ │ │ └── segmentation.py # binarize, contour detection, crops │ ├── stage3/ │ │ └── recognition.py # CNN inference, character prediction │ ├── stage4/ │ │ └── tokenization.py # SudachiPy morphological analysis │ ├── stage5/ │ │ └── grammar.py # tense, politeness, POS detection │ ├── stage6/ │ │ └── output.py # romaji, translation, vocab display │ └── cnn/ │ ├── cnn_multistyle.ipynb # final model — Multi-Style CNN training │ ├── cnn.ipynb # iteration 1: scratch CNN (ETL9B) │ ├── cnn_k49.ipynb # iteration 4: K49 dataset switch │ ├── cnn_resnet18_pretrained.ipynb # iteration 3a: ResNet-18 fine-tune │ ├── cnn_efficientnet_b0_pretrained.ipynb # iteration 3b: EfficientNet-B0 │ ├── model_comparison.ipynb │ ├── model_multistyle/ # deployed weights (97.42%) │ │ ├── hiragana_multistyle.pth │ │ └── label_map_multistyle.npy │ ├── model_k49/ │ ├── model_resnet18/ │ └── model_efficientnet/ │ ├── tests/ │ ├── test_preprocessing.py # 32 tests │ ├── test_segmentation.py # 38 tests │ ├── test_tokenization.py # 24 tests │ └── test_grammar.py # 31 tests │ ├── image/ # sample handwritten input images │ ├── demo_sentence1.png │ └── demo_sentence2.png … │ └── documentations/ ├── Final_report.pdf ├── report.tex # LaTeX source (ACL format) └── CS 5624 - Project Proposal (1).pdf
All project documents live in the documentations/ directory of the repository.