In January 2026 I started as a Graduate Teaching Assistant for CS 5664: Social Media Analytics at Virginia Tech. The course covers NLP, text mining, network analysis, and forecasting — all applied to real social media data. I expected to spend most of my time grading. What I didn't expect was how much teaching would sharpen my own understanding of NLP fundamentals.
The NLP Pipeline (Where Most Mistakes Happen)
Before diving into the bugs I saw repeatedly, here's the end-to-end NLP pipeline students were expected to implement:
Each arrow is a place something can go wrong silently. The model will train, produce numbers, and look successful — while being completely broken.
What the Course Covers
CS 5664 is graduate-level with students ranging from ML veterans to those taking their first serious NLP course. The curriculum spans:
| Module | Topics | Tools |
|---|---|---|
| Text Preprocessing | Tokenization, stemming, lemmatization, stopword removal | NLTK, spaCy |
| Feature Extraction | TF-IDF, Word2Vec, GloVe, FastText | scikit-learn, Gensim |
| Classification | Naive Bayes, SVM, LSTM, BERT fine-tuning | PyTorch, HuggingFace |
| Network Analysis | Community detection, influence propagation, centrality | NetworkX, Gephi |
| Forecasting | ARIMA, LSTM for trending topics and engagement | statsmodels, PyTorch |
Bug #1: Tokenization Assumptions
Most students assume text.split(' ') is tokenization. Social media text breaks every assumption standard tokenizers make:
import re
from nltk.tokenize import TweetTokenizer, word_tokenize
tweet = "loving this rn 😍 #MachineLearning @vt_cs can't wait!!! http://t.co/abc"
# ❌ Naive split — terrible for social media
tokens = tweet.split(' ')
# ['loving', 'this', 'rn', '😍', '#MachineLearning', "@vt_cs",
# "can't", 'wait!!!', 'http://t.co/abc']
# Problems: emoji kept as noise, hashtag kept whole, URL kept
# ❌ Standard NLTK tokenizer — designed for news, not tweets
tokens = word_tokenize(tweet)
# ['loving', 'this', 'rn', '😍', '#', 'MachineLearning', ... ]
# Problems: splits hashtag — loses meaning entirely
# ✓ TweetTokenizer — built for this
tokenizer = TweetTokenizer(
preserve_case=False,
reduce_len=True, # "waaaaait" → "waaait"
strip_handles=True # removes @mentions
)
tokens = tokenizer.tokenize(tweet)
# ['loving', 'this', 'rn', '😍', '#machinelearning', "can't", 'wait', '!']
# ✓ For classification: also strip URLs and normalize
def clean_tweet(text):
text = re.sub(r'http\S+', '', text) # remove URLs
text = re.sub(r'[^\w\s#@!?]', '', text) # keep useful punctuation
return TweetTokenizer(preserve_case=False, reduce_len=True).tokenize(text)
This was the #1 source of silent bugs. The model trains fine — it just learns from garbage features.
Bug #2: Data Leakage Through the Vectorizer
This is the most common high-impact mistake, and it inflated reported accuracy by 5–15% in multiple submissions:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
# ❌ WRONG — fit vectorizer on ENTIRE dataset (leakage)
vectorizer = TfidfVectorizer()
X_all = vectorizer.fit_transform(all_texts) # sees test vocabulary!
X_train, X_test, y_train, y_test = train_test_split(X_all, labels)
model = MultinomialNB()
model.fit(X_train, y_train)
score = model.score(X_test, y_test) # inflated — test vocab was in training
# ✓ CORRECT — split FIRST, then fit vectorizer on train only
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
all_texts, labels, test_size=0.2, stratify=labels, random_state=42
)
vectorizer = TfidfVectorizer(max_features=10000)
X_train = vectorizer.fit_transform(X_train_raw) # fit on train only
X_test = vectorizer.transform(X_test_raw) # transform (not fit_transform!)
model = MultinomialNB()
model.fit(X_train, y_train)
score = model.score(X_test, y_test) # honest evaluation
# ✓ Even better — use Pipeline to make leakage impossible
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=10000)),
('clf', MultinomialNB())
])
pipeline.fit(X_train_raw, y_train)
score = pipeline.score(X_test_raw, y_test)
Bug #3: Ignoring Class Imbalance
Social media sentiment datasets are almost always heavily imbalanced. A model predicting "neutral" for everything achieves 90% accuracy on a 90/5/5 split — and is completely useless:
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight
import seaborn as sns
import matplotlib.pyplot as plt
# The "90% accuracy" trap
y_pred_dummy = ["neutral"] * len(y_test)
print(f"Dummy accuracy: {(y_dummy_pred == y_test).mean():.2%}") # 90%!
# ✓ Always use classification_report
print(classification_report(y_test, y_pred,
target_names=["negative", "neutral", "positive"]))
# precision recall f1-score support
# negative 0.00 0.00 0.00 50
# neutral 0.90 1.00 0.95 900
# positive 0.00 0.00 0.00 50
# macro avg 0.30 0.33 0.32 1000 ← the real story
# ✓ Fix: class weights in training
classes = np.unique(y_train)
weights = compute_class_weight('balanced', classes=classes, y=y_train)
class_weight_dict = dict(zip(classes, weights))
# sklearn: class_weight=class_weight_dict
# PyTorch: CrossEntropyLoss(weight=torch.tensor(weights))
# ✓ Fix: stratified split (preserves class ratios)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y # ← always for imbalanced data
)
# ✓ Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', xticklabels=classes, yticklabels=classes)
plt.title("Confusion Matrix — don't skip this")
After I added a mandatory confusion matrix + macro F1 requirement, the quality of submissions jumped significantly. Accuracy as the sole metric was banned.
Bug #4: BERT as a Black Box
Fine-tuning BERT and getting 94% accuracy felt like success — until students couldn't explain why the model failed on specific examples or whether it was learning sentiment vs topic correlation:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from captum.attr import IntegratedGradients # attribution analysis
model_name = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Fine-tuning on CS5664 dataset (abbreviated)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
per_device_train_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
load_best_model_at_end=True, # prevent overfitting
metric_for_best_model="f1",
)
# ✓ Error analysis — required in CS5664
def analyze_errors(model, tokenizer, test_data):
errors = []
for text, true_label in test_data:
inputs = tokenizer(text, return_tensors="pt", truncation=True)
with torch.no_grad():
logits = model(**inputs).logits
pred_label = logits.argmax().item()
if pred_label != true_label:
errors.append({
"text": text,
"predicted": pred_label,
"true": true_label,
"confidence": logits.softmax(-1).max().item()
})
return errors
# Common finding: model learned topic (sports → positive)
# not sentiment ("team lost, devastating" → predicted positive
# because sports tweets correlate with positive in training set)
Teaching insight: The students who struggled most weren't those who didn't know the math — they were the ones who hadn't built a mental model of what the data represents before touching the code. Print 10 examples. Always. Before writing a single model line.
The Attention Mechanism Analogy That Worked
Explaining self-attention to students without a deep linear algebra background required finding the right abstraction:
"Imagine the sentence is a committee meeting. Every word is a committee member. Self-attention is each member voting on how much they care about every other member's input — for this specific decision. The word 'not' votes very high attention toward the word that follows it, because 'not good' completely flips the meaning of 'good'."
This analogy worked because students already understood voting, committees, and context-dependent relationships. The math followed from the intuition — not the other way around.
NLP Pipeline Cheat Sheet for Social Media
# Complete social media NLP pipeline — production-ready template
import re
from nltk.tokenize import TweetTokenizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import classification_report
def preprocess(text):
"""Social-media-aware preprocessing."""
text = re.sub(r'http\S+', 'URL', text) # normalize URLs
text = re.sub(r'@\w+', 'USER', text) # anonymize handles
tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True)
return ' '.join(tokenizer.tokenize(text))
# Load & preprocess
texts_clean = [preprocess(t) for t in raw_texts]
# Stratified split — mandatory for imbalanced classes
X_train, X_test, y_train, y_test = train_test_split(
texts_clean, labels, test_size=0.2, stratify=labels, random_state=42
)
# Pipeline — prevents leakage by design
pipe = Pipeline([
('tfidf', TfidfVectorizer(max_features=15000, ngram_range=(1,2))),
('clf', LogisticRegression(class_weight='balanced', max_iter=500))
])
# Cross-validate honestly
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
pipe.fit(X_train, y_train)
print(classification_report(y_test, pipe.predict(X_test))) # macro F1 matters
Takeaways for Anyone Learning NLP
- Always inspect your preprocessed text before feeding a model — print 10 examples, every time.
- Split data first, then fit all transformers on training data only. Use
Pipelineto enforce this. - Use stratified splits and class weights for imbalanced datasets — which is almost all social media data.
- Macro F1 > accuracy when classes are imbalanced. Report both.
- Run error analysis before declaring victory — ask why the model fails, not just how often.
- TweetTokenizer exists for a reason. Use purpose-built tools for social media text.