Handwritten Japanese Sentence Translation for Language Education

Installation

$ git clone https://github.com/SRIKANTH284/japanese-nlp.git

$ pip install -r requirements.txt

Overview

japanese-nlp is a six-stage NLP pipeline built for language education. Starting from a photo of handwritten Japanese hiragana, it produces romanization, an English translation, a per-token vocabulary breakdown with dictionary forms, and a grammar analysis — all from a single command.

The character recognition stage uses a custom CNN trained through five iterations in PyTorch, culminating in a Multi-Style model that achieves 97.42% test accuracy on 49 hiragana classes by combining real handwriting data (K49) with synthetically rendered font samples.

Quick start: python main.py image/demo_sentence1.png

Pipeline

Stage 1

Preprocess

→

Stage 2

Segment

→

Stage 3

CNN OCR

→

Stage 4

Tokenize

→

Stage 5

Grammar

→

Stage 6

Output

Stage 1 · Image Preprocessing

Validates the input file, converts to grayscale, reduces noise with a Gaussian blur, and resizes to 64×64 with aspect-ratio-preserving white padding.

Stage 2 · Sentence Segmentation

Binarizes using Otsu's threshold, detects character bounding boxes via contour analysis, sorts them into reading order, and extracts padded character crops.

Stage 3 · Handwritten Character Recognition

A custom CNN trained in PyTorch over five iterations. The final Multi-Style CNN trains on a 50/50 blend of K49 real handwriting (~270k samples) and synthetically rendered font images.

Key finding: Switching from ETL9B to K49 gave +17.75% accuracy with zero architectural changes. Data quality was the single most impactful decision in this project.

Stage 4 · Tokenization

Uses SudachiPy with SplitMode.C (longest-match) to break the recognized string into morphemes annotated with surface form, dictionary form, reading, and English POS label.

Stage 5 · Grammar Analysis

Identifies nouns, verbs, adjectives, and particles. Detects tense (past / non-past) and politeness (polite / plain) from the token stream.

Stage 6 · Educational Output

Generates hiragana → Hepburn romaji → English translation → vocabulary table with dictionary forms → grammar summary, printed to the terminal.

Model Comparison

Five training iterations to arrive at the final deployed model.

#	Model	Dataset	Test Accuracy	Note
1	Scratch CNN	ETL9B (71 classes)	80.27%	Baseline
2	Hyperparameter tuning	ETL9B	81.97%	Val/test overfitting
3a	ResNet-18 (fine-tuned)	ETL9B	80.27%	Equal to scratch
3b	EfficientNet-B0 (fine-tuned)	ETL9B	76.85%	ImageNet bias harmful
4	K49 CNN	K49 (49 cls, 270k)	98.02%	+17.75% — data switch only
5	Multi-Style CNN	K49 + synthetic fonts	97.42% ✓	Final · generalizes across writing styles

Demo Output

Full pipeline output for ねこがすきです:

$ python main.py image/demo_sentence1.png

ねこがすきです

neko ga suki desu

"I like cats"

Token	Romaji	Part of Speech	Dictionary Form
ねこ	neko	Noun	猫 (cat)
が	ga	Particle	subject marker
すき	suki	Noun	好き (like)
です	desu	Aux. Verb	polite copula

Grammar: Non-past Polite

Test Coverage

Stage	File	Tests	Status
Stage 1 · Preprocessing	`test_preprocessing.py`	32	passing
Stage 2 · Segmentation	`test_segmentation.py`	38	passing
Stage 4 · Tokenization	`test_tokenization.py`	24	passing
Stage 5 · Grammar	`test_grammar.py`	31	passing
Total		125	all passing

Stage 3 (CNN) has no unit tests — model weights (9.6 MB each) make CI impractical. Stage 6 depends on an external translation API and is excluded from the test suite.

Team

CS 5624 semester project — Virginia Tech, Spring 2025.

Srikanth Badavath

@SRIKANTH284 · bsrikanth

Stage 3 (5 iterations, Multi-Style CNN), Stage 6, pipeline integration

Travis Chan

@trav-cc · tchan89

Stage 2, Stage 3 hyperparameter tuning

Yoonje Lee

@ylee201 · ylee201

Stage 2, Stage 4, Stage 3→4 connection

Sanjana Ghanta

@gsanjana · gsanjana

Stage 5 grammar analysis

System Architecture

The pipeline is a linear six-stage DAG. Each stage is an independent module under src/; main.py wires them together end-to-end.

End-to-End Architecture

Input

Image File

PNG / JPEG of a handwritten Japanese sentence (hiragana). Accepted via CLI: python main.py <image>

↓

Stage 1

Preprocess

preprocessing.py — validates file type, converts to grayscale, applies Gaussian blur for noise reduction, resizes to 64×64 with white padding.
OpenCV · NumPy

↓

Stage 2

Segment

segmentation.py — Otsu binarization, contour detection, sorts bounding boxes into reading order, extracts padded character crops.
OpenCV

↓

Stage 3

CNN OCR

recognition.py — Multi-Style CNN trained on K49 real handwriting (≈270k samples) + synthetic font data. 97.42% accuracy, 49 hiragana classes. Model weights: model_multistyle/hiragana_multistyle.pth
PyTorch · torchvision

↓

Stage 4

Tokenize

tokenization.py — SudachiPy SplitMode.C (longest-match). Returns surface form, dictionary form, reading, and POS tag per morpheme.
SudachiPy · sudachidict-core

↓

Stage 5

Grammar

grammar.py — identifies nouns, verbs, adjectives, particles. Detects tense (past / non-past) and politeness (polite / plain) from the token stream.

↓

Stage 6

Output

output.py — hiragana → Hepburn romaji → English translation (Google Translate API) → per-token vocabulary table → grammar summary.
googletrans · Pillow

↓

Output

Terminal

Romanized text, English translation, vocabulary breakdown with dictionary forms, grammar tags — printed to stdout.

Repository Structure

japanese-nlp/

japanese-nlp/
├── main.py                              # pipeline entry point
├── requirements.txt
├── README.md
│
├── src/
│   ├── stage1/
│   │   └── preprocessing.py             # file validation, grayscale, noise, resize
│   ├── stage2/
│   │   └── segmentation.py              # binarize, contour detection, crops
│   ├── stage3/
│   │   └── recognition.py               # CNN inference, character prediction
│   ├── stage4/
│   │   └── tokenization.py              # SudachiPy morphological analysis
│   ├── stage5/
│   │   └── grammar.py                   # tense, politeness, POS detection
│   ├── stage6/
│   │   └── output.py                    # romaji, translation, vocab display
│   └── cnn/
│       ├── cnn_multistyle.ipynb             # final model — Multi-Style CNN training
│       ├── cnn.ipynb                        # iteration 1: scratch CNN (ETL9B)
│       ├── cnn_k49.ipynb                    # iteration 4: K49 dataset switch
│       ├── cnn_resnet18_pretrained.ipynb    # iteration 3a: ResNet-18 fine-tune
│       ├── cnn_efficientnet_b0_pretrained.ipynb  # iteration 3b: EfficientNet-B0
│       ├── model_comparison.ipynb
│       ├── model_multistyle/                # deployed weights (97.42%)
│       │   ├── hiragana_multistyle.pth
│       │   └── label_map_multistyle.npy
│       ├── model_k49/
│       ├── model_resnet18/
│       └── model_efficientnet/
│
├── tests/
│   ├── test_preprocessing.py            # 32 tests
│   ├── test_segmentation.py             # 38 tests
│   ├── test_tokenization.py             # 24 tests
│   └── test_grammar.py                  # 31 tests
│
├── image/                               # sample handwritten input images
│   ├── demo_sentence1.png
│   └── demo_sentence2.png  …
│
└── documentations/
    ├── Final_report.pdf
    ├── report.tex                       # LaTeX source (ACL format)
    └── CS 5624 - Project Proposal (1).pdf

Documentation

All project documents live in the documentations/ directory of the repository.