October 2, 2024 · 12 min read

Fine-Tuning LLMs on a Budget: From LoRA to Production

Machine LearningLLMLoRAPyTorchResearch

Fine-tuning a large language model used to require 8 A100 GPUs and a university compute cluster. Today, with parameter-efficient fine-tuning (PEFT) methods like LoRA, you can fine-tune a 7B parameter model on a single consumer GPU in a few hours.

This post documents exactly how I fine-tuned LLaMA 2 for a domain-specific code generation task, the results I got, and the lessons learned along the way.

Why Fine-Tune At All?

Prompt engineering gets you 80% of the way there. Fine-tuning takes you to 95%+. For our use case — generating structured database queries from natural language — we needed:

Consistent output format (no hallucinated column names)
Domain-specific vocabulary (our internal schema, function names)
Reduced token usage (fewer in-context examples needed)

The accuracy improvement alone justified the engineering effort.

Setting Up the Environment

pip install transformers peft datasets accelerate bitsandbytes

The key libraries:

transformers — Hugging Face's model hub and training utilities
peft — Parameter-Efficient Fine-Tuning library (LoRA lives here)
bitsandbytes — 4-bit quantization for running larger models on smaller GPUs

LoRA: The Core Concept

LoRA (Low-Rank Adaptation) works by freezing the original model weights and injecting trainable rank-decomposition matrices into each layer. Instead of updating all 7B parameters, you're only training ~0.1% of them.

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,              # rank of the update matrices
    lora_alpha=32,     # scaling factor
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 8,388,608 || all params: 6,746,804,224 || trainable%: 0.12%

Training Results

After fine-tuning on 50K query/SQL pairs:

| Metric | Base LLaMA 2 | Fine-Tuned | |--------|-------------|------------| | Exact Match | 34.2% | 78.6% | | Execution Accuracy | 51.3% | 91.4% | | Schema Adherence | 67.8% | 98.2% |

The schema adherence improvement was the most impactful for production use — the model stopped hallucinating column names that don't exist.

Production Deployment

We deployed the fine-tuned adapter using VLLM for efficient inference:

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    enable_lora=True,
    max_lora_rank=16,
)

# Load fine-tuned adapter
outputs = llm.generate(
    prompts,
    SamplingParams(temperature=0.1, max_tokens=512),
    lora_request=LoRARequest("sql_adapter", 1, "./adapter"),
)

VLLM's LoRA support means we can serve multiple fine-tuned adapters on the same base model — a huge cost win.

What Didn't Work

Attempt 1: Full fine-tuning — Even with 4-bit quantization, full fine-tuning on a single A10G GPU caused OOM errors and took 12+ hours per epoch. LoRA solved this.

Attempt 2: Too small a rank (r=4) — We initially used r=4 for maximum efficiency. The model learned formatting but not domain vocabulary well. r=16 hit the sweet spot.

Attempt 3: Too short training — We underfit at 1 epoch. 3 epochs with early stopping on validation loss was optimal.

Key Takeaways

Start with LoRA, not full fine-tuning. You'll get 90% of the benefit at 5% of the cost.
Data quality matters more than quantity. 10K high-quality pairs beat 100K noisy ones.
Always evaluate on execution accuracy, not just text similarity.
Keep the base model frozen. You want to add capability, not overwrite it.

The field is moving incredibly fast. QLoRA, DoRA, and other variants keep improving the efficiency/accuracy tradeoff. But the fundamentals covered here will remain relevant.

All posts