Fine-Tuning LLMs for Enterprise Growth Analytics

At Blenheim Chalcot's GenAI division, I worked on AI products that needed to do more than call an LLM API and return a response. We were building tools that had to understand our company's specific growth metrics, internal vocabulary, and business context — things a general-purpose model doesn't know. This post covers what the fine-tuning process actually looked like in practice, including the dataset pipeline, LoRA training setup, and how we evaluated beyond perplexity.

Why Fine-Tuning, Not Just Prompting

The first question we always asked: can we solve this with prompt engineering alone? Often yes — for well-scoped tasks, a good system prompt gets you 80% of the way. But we hit four specific walls:

Problem	Prompt Engineering Limit	Fine-Tuning Solves
Internal KPI vocabulary	Must explain every term in context window	Model learns vocabulary during training
Context window	Fills up with few-shot examples	Examples baked into weights
Consistency	Degrades with longer prompts	Output format learned reliably
Latency / cost	Longer prompts = higher cost, slower	Smaller fine-tuned model, faster inference

A smaller fine-tuned model consistently beat a large model with long prompts — faster, cheaper, more consistent.

The Full Fine-Tuning Pipeline

flowchart TD Raw["Raw Analyst Reports (200+)"] QF["Quality Filter → 160 high-quality"] Format["Format as Instruction-Response Pairs (JSONL)"] Aug["Augment with GPT-4 synthesis → +300 reviewed"] Split["Train/Val Split (90/10)"] LoRA["LoRA Training 3 epochs · A100 PEFT library"] Eval["3-Layer Evaluation Format·Factual·Business"] Merge["Merge adapter into base model"] Deploy["Deploy (vLLM inference)"] Raw --> QF QF --> Format Format --> Aug Aug --> Split Split --> LoRA LoRA --> Eval Eval -->|"Pass"| Merge Eval -->|"Fail"| Format Merge --> Deploy

Dataset Curation: The Hard Part

Everyone focuses on training. The dataset is where the real work happens.

Step 1: Define the Task Precisely

We were fine-tuning for one specific task: given a growth metric time-series and business context, generate a plain-English analysis with a recommended action. Not a general assistant — a specialist.

Step 2: Format as Instruction-Response Pairs

import json

# Each training example
example = {
    "instruction": "Analyze the following activation rate trend and provide a recommended action for the product team.",
    "input": """
    Metric: Activation Rate (% users who complete onboarding within 7 days)
    Context: B2B SaaS, SMB segment, UK market
    Data: [0.42, 0.41, 0.38, 0.35, 0.33, 0.31]  (last 6 weeks)
    Cohort size: ~2000 new signups/week
    Recent changes: Onboarding flow redesigned 5 weeks ago
    """,
    "output": """
    The activation rate has declined 26% over 6 weeks (0.42 → 0.31), with the drop beginning immediately after the onboarding redesign 5 weeks ago. This is a strong indicator that the redesign introduced friction rather than reducing it.

    Recommended action: Conduct a funnel analysis on the new onboarding flow to identify the step with highest drop-off. A/B test reverting to the previous flow for new signups this week. If the original flow recovers activation within 2 weeks, prioritize fixing the specific friction point in the redesign rather than a full rollback.

    Urgency: High — each percentage point of activation rate at 2000 signups/week represents ~20 users who don't activate, compounding over cohort lifetime.
    """
}

# Write JSONL for training
with open('training_data.jsonl', 'a') as f:
    f.write(json.dumps(example) + '\n')

Step 3: Quality Filter

We removed examples where the output was vague, didn't specify an action, or used undefined jargon. Kept 160 of 200+ examples. Quantity is not the goal.

Key lesson: 160 high-quality examples consistently outperformed 500 mediocre ones in our validation. Data quality sets your ceiling — training technique is secondary.

Step 4: Synthetic Augmentation

import openai

def generate_synthetic_example(seed_example: dict) -> dict:
    """Use GPT-4 to generate variants, then human-review before adding."""
    prompt = f"""
    Generate a new training example following the same structure as this one,
    but for a different growth metric (e.g. churn rate, NPS, DAU/MAU ratio,
    conversion rate). Vary the trend pattern (recovery, plateau, spike).

    Seed example:
    {json.dumps(seed_example, indent=2)}

    Output a JSON object with keys: instruction, input, output.
    Make the analysis specific and the action concrete.
    """
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.8
    )
    return json.loads(response.choices[0].message.content)

# Generated ~400 synthetic examples, human-reviewed, kept 300

LoRA Training Setup

We used LoRA (Low-Rank Adaptation) via HuggingFace PEFT. Full fine-tuning was too expensive and risked catastrophic forgetting. LoRA trains small adapter matrices while keeping base model weights frozen:

LoRA adds small trainable matrices A and B. The update W + BA is applied at inference — base model unchanged.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import torch

# Load base model in 4-bit quantization (QLoRA) for memory efficiency
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4 — best for LLMs
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=bnb_config,
    device_map="auto",
)

# LoRA configuration
lora_config = LoraConfig(
    r=16,                    # rank — higher = more capacity, more params
    lora_alpha=32,           # scaling factor (alpha/r = effective LR multiplier)
    target_modules=[         # which attention matrices to adapt
        "q_proj",            # queries
        "v_proj",            # values
        # "k_proj",          # keys — often skipped
        # "o_proj",          # output — add if underfitting
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 3,752,071,168 || trainable%: 0.1816%

# Training
training_args = TrainingArguments(
    output_dir="./growth-analyst-lora",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # effective batch = 16
    warmup_steps=100,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=50,
    evaluation_strategy="steps",
    eval_steps=200,
    save_strategy="steps",
    save_steps=200,
    load_best_model_at_end=True,
    report_to="wandb",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    args=training_args,
    dataset_text_field="formatted_text",
    max_seq_length=2048,
)
trainer.train()

Training ran ~2.5 hours on a single A100 (40GB). We monitored validation loss and stopped when it plateaued at epoch 2.6 — avoiding overfitting on our 460-example dataset.

Evaluation: Three Layers, Not Just Perplexity

Perplexity tells you how well the model predicts the next token — not whether the analysis is actually useful to a growth team:

import re
from evaluate import load

# Layer 1: Format compliance — automated
def check_format(output: str) -> dict:
    checks = {
        "has_metric_reference":  bool(re.search(r'\d+%|\d+\.\d+', output)),
        "has_recommended_action": "recommend" in output.lower() or "action:" in output.lower(),
        "appropriate_length":    150 <= len(output.split()) <= 500,
        "no_hedging_only":       not re.match(r'^(it (could|might|may))', output.lower()),
    }
    return checks

# Layer 2: Factual accuracy — semi-automated
# Check if numbers in output match numbers in input
def check_factual(instruction_input: str, output: str) -> float:
    # Extract numbers from input
    input_numbers = set(re.findall(r'\d+\.?\d*', instruction_input))
    output_numbers = set(re.findall(r'\d+\.?\d*', output))
    # Penalize outputs that cite numbers not in input
    hallucinated = output_numbers - input_numbers
    return 1.0 - (len(hallucinated) / (len(output_numbers) + 1))

# Layer 3: Business usefulness — human review
# Rubric used by senior analysts (0-5 scale):
# 5 — I would act on this immediately
# 4 — Useful with minor clarification
# 3 — Correct but too vague to act on
# 2 — Partially correct, misleading recommendation
# 1 — Incorrect analysis
# 0 — Nonsense or refused to analyze

Metric	Base Model (GPT-3.5 few-shot)	Fine-Tuned Model
Format compliance	71%	96%
Factual accuracy (human, 50 examples)	78%	91%
Business usefulness (avg score 0-5)	3.1	4.3
p50 latency	2.1s	1.2s (43% faster)
Cost per 1000 requests	~$4.20	~$0.80 (hosted)

Connecting LLM Output to Regression Analysis

Alongside LLM work, I ran traditional regression analyses to identify which growth levers had the highest impact on key metrics. The LLM then consumed these regression outputs as structured context:

import statsmodels.api as sm
import pandas as pd

# Regression: which factors predict activation rate?
X = df[['onboarding_steps_completed', 'time_to_first_value_days',
        'support_tickets_week1', 'plan_tier', 'company_size_bucket']]
y = df['activated_30d']

X = sm.add_constant(X)
model = sm.Logit(y, X).fit()

# Extract significant predictors for LLM context
significant = model.summary2().tables[1]
significant = significant[significant['P>|z|'] < 0.05]

# Feed regression output as structured LLM context
regression_context = f"""
Regression analysis (logistic, n={len(df)}, pseudo-R²={model.prsquared:.3f}):
Significant predictors of 30-day activation:
- time_to_first_value_days: coef={significant.loc['time_to_first_value_days','Coef.']:.3f} (each additional day reduces activation odds by X%)
- onboarding_steps_completed: coef=... (strong positive predictor)
"""
# LLM now grounds its recommendation in statistical evidence

Takeaways

Invest the majority of your time in dataset quality — it sets your performance ceiling.
LoRA (especially QLoRA) is the pragmatic choice for most fine-tuning tasks — 0.18% of parameters trained, 43% latency improvement.
Define task-specific evaluation metrics before training, not after. Perplexity is not a business metric.
Synthetic data augmentation (GPT-4 generation + human review) is a legitimate technique — but human review is non-negotiable.
Combine classical statistical methods with LLMs — they're complementary. Regression identifies what matters; LLM explains it in human language.