Handwritten Japanese Sentence Translation for Language Education
japanese-nlp  v1.9.0
Multi-stage NLP pipeline for handwritten Japanese hiragana recognition and educational output generation — CS 5624 · Virginia Tech.

Installation

$ git clone https://github.com/SRIKANTH284/japanese-nlp.git
$ pip install -r requirements.txt

Overview

japanese-nlp is a six-stage NLP pipeline built for language education. Starting from a photo of handwritten Japanese hiragana, it produces romanization, an English translation, a per-token vocabulary breakdown with dictionary forms, and a grammar analysis — all from a single command.

The character recognition stage uses a custom CNN trained through five iterations in PyTorch, culminating in a Multi-Style model that achieves 97.42% test accuracy on 49 hiragana classes by combining real handwriting data (K49) with synthetically rendered font samples.

Quick start:   python main.py image/demo_sentence1.png

Pipeline

Stage 1
Preprocess
Stage 2
Segment
Stage 3
CNN OCR
Stage 4
Tokenize
Stage 5
Grammar
Stage 6
Output

Stage 1 · Image Preprocessing

Validates the input file, converts to grayscale, reduces noise with a Gaussian blur, and resizes to 64×64 with aspect-ratio-preserving white padding.

Stage 2 · Sentence Segmentation

Binarizes using Otsu's threshold, detects character bounding boxes via contour analysis, sorts them into reading order, and extracts padded character crops.

Stage 3 · Handwritten Character Recognition

A custom CNN trained in PyTorch over five iterations. The final Multi-Style CNN trains on a 50/50 blend of K49 real handwriting (~270k samples) and synthetically rendered font images.

Key finding: Switching from ETL9B to K49 gave +17.75% accuracy with zero architectural changes. Data quality was the single most impactful decision in this project.

Stage 4 · Tokenization

Uses SudachiPy with SplitMode.C (longest-match) to break the recognized string into morphemes annotated with surface form, dictionary form, reading, and English POS label.

Stage 5 · Grammar Analysis

Identifies nouns, verbs, adjectives, and particles. Detects tense (past / non-past) and politeness (polite / plain) from the token stream.

Stage 6 · Educational Output

Generates hiragana → Hepburn romaji → English translation → vocabulary table with dictionary forms → grammar summary, printed to the terminal.

Model Comparison

Five training iterations to arrive at the final deployed model.

#ModelDatasetTest AccuracyNote
1Scratch CNNETL9B (71 classes)80.27%Baseline
2Hyperparameter tuningETL9B81.97%Val/test overfitting
3aResNet-18 (fine-tuned)ETL9B80.27%Equal to scratch
3bEfficientNet-B0 (fine-tuned)ETL9B76.85%ImageNet bias harmful
4K49 CNNK49 (49 cls, 270k)98.02%+17.75% — data switch only
5Multi-Style CNNK49 + synthetic fonts97.42% ✓Final · generalizes across writing styles

Demo Output

Full pipeline output for ねこがすきです:

$ python main.py image/demo_sentence1.png
ねこがすきです
neko ga suki desu
"I like cats"
TokenRomajiPart of SpeechDictionary Form
ねこnekoNoun猫 (cat)
gaParticlesubject marker
すきsukiNoun好き (like)
ですdesuAux. Verbpolite copula
Grammar: Non-past Polite

Test Coverage

StageFileTestsStatus
Stage 1 · Preprocessingtest_preprocessing.py32passing
Stage 2 · Segmentationtest_segmentation.py38passing
Stage 4 · Tokenizationtest_tokenization.py24passing
Stage 5 · Grammartest_grammar.py31passing
Total125all passing
Stage 3 (CNN) has no unit tests — model weights (9.6 MB each) make CI impractical. Stage 6 depends on an external translation API and is excluded from the test suite.

Team

CS 5624 semester project — Virginia Tech, Spring 2025.

Srikanth Badavath
@SRIKANTH284 · bsrikanth
Stage 3 (5 iterations, Multi-Style CNN), Stage 6, pipeline integration
Travis Chan
@trav-cc · tchan89
Stage 2, Stage 3 hyperparameter tuning
Yoonje Lee
@ylee201 · ylee201
Stage 2, Stage 4, Stage 3→4 connection
Sanjana Ghanta
@gsanjana · gsanjana
Stage 5 grammar analysis

System Architecture

The pipeline is a linear six-stage DAG. Each stage is an independent module under src/; main.py wires them together end-to-end.

End-to-End Architecture
Input
Image File
PNG / JPEG of a handwritten Japanese sentence (hiragana). Accepted via CLI: python main.py <image>
Stage 1
Preprocess
preprocessing.py — validates file type, converts to grayscale, applies Gaussian blur for noise reduction, resizes to 64×64 with white padding.
OpenCV · NumPy
Stage 2
Segment
segmentation.py — Otsu binarization, contour detection, sorts bounding boxes into reading order, extracts padded character crops.
OpenCV
Stage 3
CNN OCR
recognition.py — Multi-Style CNN trained on K49 real handwriting (≈270k samples) + synthetic font data. 97.42% accuracy, 49 hiragana classes. Model weights: model_multistyle/hiragana_multistyle.pth
PyTorch · torchvision
Stage 4
Tokenize
tokenization.py — SudachiPy SplitMode.C (longest-match). Returns surface form, dictionary form, reading, and POS tag per morpheme.
SudachiPy · sudachidict-core
Stage 5
Grammar
grammar.py — identifies nouns, verbs, adjectives, particles. Detects tense (past / non-past) and politeness (polite / plain) from the token stream.
Stage 6
Output
output.py — hiragana → Hepburn romaji → English translation (Google Translate API) → per-token vocabulary table → grammar summary.
googletrans · Pillow
Output
Terminal
Romanized text, English translation, vocabulary breakdown with dictionary forms, grammar tags — printed to stdout.

Repository Structure

japanese-nlp/
japanese-nlp/
├── main.py                              # pipeline entry point
├── requirements.txt
├── README.md
│
├── src/
│   ├── stage1/
│   │   └── preprocessing.py             # file validation, grayscale, noise, resize
│   ├── stage2/
│   │   └── segmentation.py              # binarize, contour detection, crops
│   ├── stage3/
│   │   └── recognition.py               # CNN inference, character prediction
│   ├── stage4/
│   │   └── tokenization.py              # SudachiPy morphological analysis
│   ├── stage5/
│   │   └── grammar.py                   # tense, politeness, POS detection
│   ├── stage6/
│   │   └── output.py                    # romaji, translation, vocab display
│   └── cnn/
│       ├── cnn_multistyle.ipynb             # final model — Multi-Style CNN training
│       ├── cnn.ipynb                        # iteration 1: scratch CNN (ETL9B)
│       ├── cnn_k49.ipynb                    # iteration 4: K49 dataset switch
│       ├── cnn_resnet18_pretrained.ipynb    # iteration 3a: ResNet-18 fine-tune
│       ├── cnn_efficientnet_b0_pretrained.ipynb  # iteration 3b: EfficientNet-B0
│       ├── model_comparison.ipynb
│       ├── model_multistyle/                # deployed weights (97.42%)
│       │   ├── hiragana_multistyle.pth
│       │   └── label_map_multistyle.npy
│       ├── model_k49/
│       ├── model_resnet18/
│       └── model_efficientnet/
│
├── tests/
│   ├── test_preprocessing.py            # 32 tests
│   ├── test_segmentation.py             # 38 tests
│   ├── test_tokenization.py             # 24 tests
│   └── test_grammar.py                  # 31 tests
│
├── image/                               # sample handwritten input images
│   ├── demo_sentence1.png
│   └── demo_sentence2.png  …
│
└── documentations/
    ├── Final_report.pdf
    ├── report.tex                       # LaTeX source (ACL format)
    └── CS 5624 - Project Proposal (1).pdf

Documentation

All project documents live in the documentations/ directory of the repository.