Teaching NLP at Virginia Tech: Lessons from the Classroom

In January 2026 I started as a Graduate Teaching Assistant for CS 5664: Social Media Analytics at Virginia Tech. The course covers NLP, text mining, network analysis, and forecasting — all applied to real social media data. I expected to spend most of my time grading. What I didn't expect was how much teaching would sharpen my own understanding of NLP fundamentals.

The NLP Pipeline (Where Most Mistakes Happen)

Before diving into the bugs I saw repeatedly, here's the end-to-end NLP pipeline students were expected to implement:

flowchart TD Raw["Raw Text (tweets, posts)"] Pre["Preprocessing tokenize · normalize remove noise"] Feat["Feature Extraction TF-IDF / embeddings"] Split["Train/Test Split (stratified)"] Model["Model Training Naive Bayes → BERT"] Eval["Evaluation F1 · Confusion matrix Error analysis"] Deploy["Inference (production)"] Raw --> Pre Pre --> Feat Feat --> Split Split --> Model Model --> Eval Eval -->|"iterate"| Pre Eval --> Deploy

Each arrow is a place something can go wrong silently. The model will train, produce numbers, and look successful — while being completely broken.

What the Course Covers

CS 5664 is graduate-level with students ranging from ML veterans to those taking their first serious NLP course. The curriculum spans:

Module	Topics	Tools
Text Preprocessing	Tokenization, stemming, lemmatization, stopword removal	NLTK, spaCy
Feature Extraction	TF-IDF, Word2Vec, GloVe, FastText	scikit-learn, Gensim
Classification	Naive Bayes, SVM, LSTM, BERT fine-tuning	PyTorch, HuggingFace
Network Analysis	Community detection, influence propagation, centrality	NetworkX, Gephi
Forecasting	ARIMA, LSTM for trending topics and engagement	statsmodels, PyTorch

Bug #1: Tokenization Assumptions

Most students assume text.split(' ') is tokenization. Social media text breaks every assumption standard tokenizers make:

import re
from nltk.tokenize import TweetTokenizer, word_tokenize

tweet = "loving this rn 😍 #MachineLearning @vt_cs can't wait!!! http://t.co/abc"

# ❌ Naive split — terrible for social media
tokens = tweet.split(' ')
# ['loving', 'this', 'rn', '😍', '#MachineLearning', "@vt_cs",
#  "can't", 'wait!!!', 'http://t.co/abc']
# Problems: emoji kept as noise, hashtag kept whole, URL kept

# ❌ Standard NLTK tokenizer — designed for news, not tweets
tokens = word_tokenize(tweet)
# ['loving', 'this', 'rn', '😍', '#', 'MachineLearning', ... ]
# Problems: splits hashtag — loses meaning entirely

# ✓ TweetTokenizer — built for this
tokenizer = TweetTokenizer(
    preserve_case=False,
    reduce_len=True,      # "waaaaait" → "waaait"
    strip_handles=True    # removes @mentions
)
tokens = tokenizer.tokenize(tweet)
# ['loving', 'this', 'rn', '😍', '#machinelearning', "can't", 'wait', '!']

# ✓ For classification: also strip URLs and normalize
def clean_tweet(text):
    text = re.sub(r'http\S+', '', text)           # remove URLs
    text = re.sub(r'[^\w\s#@!?]', '', text)       # keep useful punctuation
    return TweetTokenizer(preserve_case=False, reduce_len=True).tokenize(text)

This was the #1 source of silent bugs. The model trains fine — it just learns from garbage features.

Bug #2: Data Leakage Through the Vectorizer

This is the most common high-impact mistake, and it inflated reported accuracy by 5–15% in multiple submissions:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# ❌ WRONG — fit vectorizer on ENTIRE dataset (leakage)
vectorizer = TfidfVectorizer()
X_all = vectorizer.fit_transform(all_texts)          # sees test vocabulary!
X_train, X_test, y_train, y_test = train_test_split(X_all, labels)
model = MultinomialNB()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)   # inflated — test vocab was in training

# ✓ CORRECT — split FIRST, then fit vectorizer on train only
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    all_texts, labels, test_size=0.2, stratify=labels, random_state=42
)
vectorizer = TfidfVectorizer(max_features=10000)
X_train = vectorizer.fit_transform(X_train_raw)   # fit on train only
X_test  = vectorizer.transform(X_test_raw)        # transform (not fit_transform!)

model = MultinomialNB()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)   # honest evaluation

# ✓ Even better — use Pipeline to make leakage impossible
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000)),
    ('clf',   MultinomialNB())
])
pipeline.fit(X_train_raw, y_train)
score = pipeline.score(X_test_raw, y_test)

Bug #3: Ignoring Class Imbalance

Social media sentiment datasets are almost always heavily imbalanced. A model predicting "neutral" for everything achieves 90% accuracy on a 90/5/5 split — and is completely useless:

import numpy as np
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight
import seaborn as sns
import matplotlib.pyplot as plt

# The "90% accuracy" trap
y_pred_dummy = ["neutral"] * len(y_test)
print(f"Dummy accuracy: {(y_dummy_pred == y_test).mean():.2%}")  # 90%!

# ✓ Always use classification_report
print(classification_report(y_test, y_pred,
      target_names=["negative", "neutral", "positive"]))
#               precision  recall  f1-score  support
# negative          0.00    0.00      0.00      50
# neutral           0.90    1.00      0.95     900
# positive          0.00    0.00      0.00      50
# macro avg         0.30    0.33      0.32    1000  ← the real story

# ✓ Fix: class weights in training
classes = np.unique(y_train)
weights = compute_class_weight('balanced', classes=classes, y=y_train)
class_weight_dict = dict(zip(classes, weights))
# sklearn: class_weight=class_weight_dict
# PyTorch: CrossEntropyLoss(weight=torch.tensor(weights))

# ✓ Fix: stratified split (preserves class ratios)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y   # ← always for imbalanced data
)

# ✓ Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', xticklabels=classes, yticklabels=classes)
plt.title("Confusion Matrix — don't skip this")

After I added a mandatory confusion matrix + macro F1 requirement, the quality of submissions jumped significantly. Accuracy as the sole metric was banned.

Bug #4: BERT as a Black Box

Fine-tuning BERT and getting 94% accuracy felt like success — until students couldn't explain why the model failed on specific examples or whether it was learning sentiment vs topic correlation:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from captum.attr import IntegratedGradients  # attribution analysis

model_name = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Fine-tuning on CS5664 dataset (abbreviated)
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,     # prevent overfitting
    metric_for_best_model="f1",
)

# ✓ Error analysis — required in CS5664
def analyze_errors(model, tokenizer, test_data):
    errors = []
    for text, true_label in test_data:
        inputs = tokenizer(text, return_tensors="pt", truncation=True)
        with torch.no_grad():
            logits = model(**inputs).logits
        pred_label = logits.argmax().item()
        if pred_label != true_label:
            errors.append({
                "text": text,
                "predicted": pred_label,
                "true": true_label,
                "confidence": logits.softmax(-1).max().item()
            })
    return errors

# Common finding: model learned topic (sports → positive)
# not sentiment ("team lost, devastating" → predicted positive
# because sports tweets correlate with positive in training set)

Teaching insight: The students who struggled most weren't those who didn't know the math — they were the ones who hadn't built a mental model of what the data represents before touching the code. Print 10 examples. Always. Before writing a single model line.

The Attention Mechanism Analogy That Worked

Explaining self-attention to students without a deep linear algebra background required finding the right abstraction:

"Imagine the sentence is a committee meeting. Every word is a committee member. Self-attention is each member voting on how much they care about every other member's input — for this specific decision. The word 'not' votes very high attention toward the word that follows it, because 'not good' completely flips the meaning of 'good'."

This analogy worked because students already understood voting, committees, and context-dependent relationships. The math followed from the intuition — not the other way around.

NLP Pipeline Cheat Sheet for Social Media

# Complete social media NLP pipeline — production-ready template

import re
from nltk.tokenize import TweetTokenizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import classification_report

def preprocess(text):
    """Social-media-aware preprocessing."""
    text = re.sub(r'http\S+', 'URL', text)         # normalize URLs
    text = re.sub(r'@\w+', 'USER', text)           # anonymize handles
    tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True)
    return ' '.join(tokenizer.tokenize(text))

# Load & preprocess
texts_clean = [preprocess(t) for t in raw_texts]

# Stratified split — mandatory for imbalanced classes
X_train, X_test, y_train, y_test = train_test_split(
    texts_clean, labels, test_size=0.2, stratify=labels, random_state=42
)

# Pipeline — prevents leakage by design
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=15000, ngram_range=(1,2))),
    ('clf',   LogisticRegression(class_weight='balanced', max_iter=500))
])

# Cross-validate honestly
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

pipe.fit(X_train, y_train)
print(classification_report(y_test, pipe.predict(X_test)))  # macro F1 matters

Takeaways for Anyone Learning NLP

Always inspect your preprocessed text before feeding a model — print 10 examples, every time.
Split data first, then fit all transformers on training data only. Use Pipeline to enforce this.
Use stratified splits and class weights for imbalanced datasets — which is almost all social media data.
Macro F1 > accuracy when classes are imbalanced. Report both.
Run error analysis before declaring victory — ask why the model fails, not just how often.
TweetTokenizer exists for a reason. Use purpose-built tools for social media text.