Fine-tuning a large language model used to require 8 A100 GPUs and a university compute cluster. Today, with parameter-efficient fine-tuning (PEFT) methods like LoRA, you can fine-tune a 7B parameter model on a single consumer GPU in a few hours.
This post documents exactly how I fine-tuned LLaMA 2 for a domain-specific code generation task, the results I got, and the lessons learned along the way.
Why Fine-Tune At All?
Prompt engineering gets you 80% of the way there. Fine-tuning takes you to 95%+. For our use case — generating structured database queries from natural language — we needed:
- Consistent output format (no hallucinated column names)
- Domain-specific vocabulary (our internal schema, function names)
- Reduced token usage (fewer in-context examples needed)
The accuracy improvement alone justified the engineering effort.
Setting Up the Environment
pip install transformers peft datasets accelerate bitsandbytes
The key libraries:
- transformers — Hugging Face's model hub and training utilities
- peft — Parameter-Efficient Fine-Tuning library (LoRA lives here)
- bitsandbytes — 4-bit quantization for running larger models on smaller GPUs
LoRA: The Core Concept
LoRA (Low-Rank Adaptation) works by freezing the original model weights and injecting trainable rank-decomposition matrices into each layer. Instead of updating all 7B parameters, you're only training ~0.1% of them.
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=16, # rank of the update matrices
lora_alpha=32, # scaling factor
target_modules=["q_proj", "v_proj"], # which layers to adapt
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 8,388,608 || all params: 6,746,804,224 || trainable%: 0.12%
Training Results
After fine-tuning on 50K query/SQL pairs:
| Metric | Base LLaMA 2 | Fine-Tuned | |--------|-------------|------------| | Exact Match | 34.2% | 78.6% | | Execution Accuracy | 51.3% | 91.4% | | Schema Adherence | 67.8% | 98.2% |
The schema adherence improvement was the most impactful for production use — the model stopped hallucinating column names that don't exist.
Production Deployment
We deployed the fine-tuned adapter using VLLM for efficient inference:
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
enable_lora=True,
max_lora_rank=16,
)
# Load fine-tuned adapter
outputs = llm.generate(
prompts,
SamplingParams(temperature=0.1, max_tokens=512),
lora_request=LoRARequest("sql_adapter", 1, "./adapter"),
)
VLLM's LoRA support means we can serve multiple fine-tuned adapters on the same base model — a huge cost win.
What Didn't Work
Attempt 1: Full fine-tuning — Even with 4-bit quantization, full fine-tuning on a single A10G GPU caused OOM errors and took 12+ hours per epoch. LoRA solved this.
Attempt 2: Too small a rank (r=4) — We initially used r=4 for maximum efficiency. The model learned formatting but not domain vocabulary well. r=16 hit the sweet spot.
Attempt 3: Too short training — We underfit at 1 epoch. 3 epochs with early stopping on validation loss was optimal.
Key Takeaways
- Start with LoRA, not full fine-tuning. You'll get 90% of the benefit at 5% of the cost.
- Data quality matters more than quantity. 10K high-quality pairs beat 100K noisy ones.
- Always evaluate on execution accuracy, not just text similarity.
- Keep the base model frozen. You want to add capability, not overwrite it.
The field is moving incredibly fast. QLoRA, DoRA, and other variants keep improving the efficiency/accuracy tradeoff. But the fundamentals covered here will remain relevant.