At Blenheim Chalcot's GenAI division, I worked on AI products that needed to do more than call an LLM API and return a response. We were building tools that had to understand our company's specific growth metrics, internal vocabulary, and business context — things a general-purpose model doesn't know. This post covers what the fine-tuning process actually looked like in practice, including the dataset pipeline, LoRA training setup, and how we evaluated beyond perplexity.
Why Fine-Tuning, Not Just Prompting
The first question we always asked: can we solve this with prompt engineering alone? Often yes — for well-scoped tasks, a good system prompt gets you 80% of the way. But we hit four specific walls:
| Problem | Prompt Engineering Limit | Fine-Tuning Solves |
|---|---|---|
| Internal KPI vocabulary | Must explain every term in context window | Model learns vocabulary during training |
| Context window | Fills up with few-shot examples | Examples baked into weights |
| Consistency | Degrades with longer prompts | Output format learned reliably |
| Latency / cost | Longer prompts = higher cost, slower | Smaller fine-tuned model, faster inference |
A smaller fine-tuned model consistently beat a large model with long prompts — faster, cheaper, more consistent.
The Full Fine-Tuning Pipeline
Dataset Curation: The Hard Part
Everyone focuses on training. The dataset is where the real work happens.
Step 1: Define the Task Precisely
We were fine-tuning for one specific task: given a growth metric time-series and business context, generate a plain-English analysis with a recommended action. Not a general assistant — a specialist.
Step 2: Format as Instruction-Response Pairs
import json
# Each training example
example = {
"instruction": "Analyze the following activation rate trend and provide a recommended action for the product team.",
"input": """
Metric: Activation Rate (% users who complete onboarding within 7 days)
Context: B2B SaaS, SMB segment, UK market
Data: [0.42, 0.41, 0.38, 0.35, 0.33, 0.31] (last 6 weeks)
Cohort size: ~2000 new signups/week
Recent changes: Onboarding flow redesigned 5 weeks ago
""",
"output": """
The activation rate has declined 26% over 6 weeks (0.42 → 0.31), with the drop beginning immediately after the onboarding redesign 5 weeks ago. This is a strong indicator that the redesign introduced friction rather than reducing it.
Recommended action: Conduct a funnel analysis on the new onboarding flow to identify the step with highest drop-off. A/B test reverting to the previous flow for new signups this week. If the original flow recovers activation within 2 weeks, prioritize fixing the specific friction point in the redesign rather than a full rollback.
Urgency: High — each percentage point of activation rate at 2000 signups/week represents ~20 users who don't activate, compounding over cohort lifetime.
"""
}
# Write JSONL for training
with open('training_data.jsonl', 'a') as f:
f.write(json.dumps(example) + '\n')
Step 3: Quality Filter
We removed examples where the output was vague, didn't specify an action, or used undefined jargon. Kept 160 of 200+ examples. Quantity is not the goal.
Key lesson: 160 high-quality examples consistently outperformed 500 mediocre ones in our validation. Data quality sets your ceiling — training technique is secondary.
Step 4: Synthetic Augmentation
import openai
def generate_synthetic_example(seed_example: dict) -> dict:
"""Use GPT-4 to generate variants, then human-review before adding."""
prompt = f"""
Generate a new training example following the same structure as this one,
but for a different growth metric (e.g. churn rate, NPS, DAU/MAU ratio,
conversion rate). Vary the trend pattern (recovery, plateau, spike).
Seed example:
{json.dumps(seed_example, indent=2)}
Output a JSON object with keys: instruction, input, output.
Make the analysis specific and the action concrete.
"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.8
)
return json.loads(response.choices[0].message.content)
# Generated ~400 synthetic examples, human-reviewed, kept 300
LoRA Training Setup
We used LoRA (Low-Rank Adaptation) via HuggingFace PEFT. Full fine-tuning was too expensive and risked catastrophic forgetting. LoRA trains small adapter matrices while keeping base model weights frozen:
LoRA adds small trainable matrices A and B. The update W + BA is applied at inference — base model unchanged.
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import torch
# Load base model in 4-bit quantization (QLoRA) for memory efficiency
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 — best for LLMs
bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
quantization_config=bnb_config,
device_map="auto",
)
# LoRA configuration
lora_config = LoraConfig(
r=16, # rank — higher = more capacity, more params
lora_alpha=32, # scaling factor (alpha/r = effective LR multiplier)
target_modules=[ # which attention matrices to adapt
"q_proj", # queries
"v_proj", # values
# "k_proj", # keys — often skipped
# "o_proj", # output — add if underfitting
],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 3,752,071,168 || trainable%: 0.1816%
# Training
training_args = TrainingArguments(
output_dir="./growth-analyst-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch = 16
warmup_steps=100,
learning_rate=2e-4,
fp16=True,
logging_steps=50,
evaluation_strategy="steps",
eval_steps=200,
save_strategy="steps",
save_steps=200,
load_best_model_at_end=True,
report_to="wandb",
)
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=val_dataset,
args=training_args,
dataset_text_field="formatted_text",
max_seq_length=2048,
)
trainer.train()
Training ran ~2.5 hours on a single A100 (40GB). We monitored validation loss and stopped when it plateaued at epoch 2.6 — avoiding overfitting on our 460-example dataset.
Evaluation: Three Layers, Not Just Perplexity
Perplexity tells you how well the model predicts the next token — not whether the analysis is actually useful to a growth team:
import re
from evaluate import load
# Layer 1: Format compliance — automated
def check_format(output: str) -> dict:
checks = {
"has_metric_reference": bool(re.search(r'\d+%|\d+\.\d+', output)),
"has_recommended_action": "recommend" in output.lower() or "action:" in output.lower(),
"appropriate_length": 150 <= len(output.split()) <= 500,
"no_hedging_only": not re.match(r'^(it (could|might|may))', output.lower()),
}
return checks
# Layer 2: Factual accuracy — semi-automated
# Check if numbers in output match numbers in input
def check_factual(instruction_input: str, output: str) -> float:
# Extract numbers from input
input_numbers = set(re.findall(r'\d+\.?\d*', instruction_input))
output_numbers = set(re.findall(r'\d+\.?\d*', output))
# Penalize outputs that cite numbers not in input
hallucinated = output_numbers - input_numbers
return 1.0 - (len(hallucinated) / (len(output_numbers) + 1))
# Layer 3: Business usefulness — human review
# Rubric used by senior analysts (0-5 scale):
# 5 — I would act on this immediately
# 4 — Useful with minor clarification
# 3 — Correct but too vague to act on
# 2 — Partially correct, misleading recommendation
# 1 — Incorrect analysis
# 0 — Nonsense or refused to analyze
| Metric | Base Model (GPT-3.5 few-shot) | Fine-Tuned Model |
|---|---|---|
| Format compliance | 71% | 96% |
| Factual accuracy (human, 50 examples) | 78% | 91% |
| Business usefulness (avg score 0-5) | 3.1 | 4.3 |
| p50 latency | 2.1s | 1.2s (43% faster) |
| Cost per 1000 requests | ~$4.20 | ~$0.80 (hosted) |
Connecting LLM Output to Regression Analysis
Alongside LLM work, I ran traditional regression analyses to identify which growth levers had the highest impact on key metrics. The LLM then consumed these regression outputs as structured context:
import statsmodels.api as sm
import pandas as pd
# Regression: which factors predict activation rate?
X = df[['onboarding_steps_completed', 'time_to_first_value_days',
'support_tickets_week1', 'plan_tier', 'company_size_bucket']]
y = df['activated_30d']
X = sm.add_constant(X)
model = sm.Logit(y, X).fit()
# Extract significant predictors for LLM context
significant = model.summary2().tables[1]
significant = significant[significant['P>|z|'] < 0.05]
# Feed regression output as structured LLM context
regression_context = f"""
Regression analysis (logistic, n={len(df)}, pseudo-R²={model.prsquared:.3f}):
Significant predictors of 30-day activation:
- time_to_first_value_days: coef={significant.loc['time_to_first_value_days','Coef.']:.3f} (each additional day reduces activation odds by X%)
- onboarding_steps_completed: coef=... (strong positive predictor)
"""
# LLM now grounds its recommendation in statistical evidence
Takeaways
- Invest the majority of your time in dataset quality — it sets your performance ceiling.
- LoRA (especially QLoRA) is the pragmatic choice for most fine-tuning tasks — 0.18% of parameters trained, 43% latency improvement.
- Define task-specific evaluation metrics before training, not after. Perplexity is not a business metric.
- Synthetic data augmentation (GPT-4 generation + human review) is a legitimate technique — but human review is non-negotiable.
- Combine classical statistical methods with LLMs — they're complementary. Regression identifies what matters; LLM explains it in human language.