Aligned’s recent partnership with GradPilot, a graduate school application copilot, provided insights into the capabilities and limitations of AI models. We focused on improving AI-generated feedback for graduate school applicant essays, revealing important lessons about fine-tuning models and the limitations of prompt engineering.
In this post, we'll discuss:
The challenges we faced in generating high-quality feedback on student essays
Why prompt engineering alone wasn't sufficient
How fine-tuning dramatically improved model responses
Key takeaways for pushing AI models beyond their initial capabilities
The Challenge: Perfecting Generative Feedback
GradPilot was using Claude 3.5 Sonnet and sophisticated prompts to help grad students refine their Statements of Purpose. Their goal was clear: provide better, more critical advice, work faster, and reduce costs. But getting there wasn't straightforward.
Before we stepped in to help, GradPilot had been on a months-long quest to improve the quality of the generated feedback by:
Tweaking prompts
Creating Style Rubrics
Rewriting instructions
They made progress, but hit a wall. The model's tone wasn't quite right, struggling to provide sufficiently critical feedback. Students needed tough love, but the model had been trained to be positive and gentle.
When Prompts Reach Their Limits
Prompt engineering can take you far and should be your first move, but it has limits. GradPilot's experience shows that sometimes, you need to go deeper.
The first challenge with fine-tuning is finding the right data. For GradPilot, finding the right tone of feedback and appropriately encouraging students to make changes was the breakthrough they needed.
Our Five-Step Fine-Tuning Process for GradPilot
Baseline Assessment: Our experts graded the existing AI's performance and provided feedback on accuracy, style and tone.
Expert Insight: Our writers crafted a style guide for ideal feedback based on the initial eval.
Expert Rewrites: Using the style guide, experts rewrote the feedback with the right tone, level of criticism, and encouragement.
Focused Training: We fine-tuned several models (LLama-70B, Llama-8b, GPT-4o, and GPT-4o-mini) with the rewritten feedback.
Rigorous Evaluation: We compared the fine-tuned against the original service and and against each other using fresh essays.
The Results: A Huge Upgrade
Using side-by-side ratings, we compared the output of all the models, including the original, against each other in a series of “battles” to compute their elo rating. Three of the four fine-tuned models outperformed the original claude-sonnet responses (Llama3-70B, GPT-4o, GPT-40-mini) while the smaller fine-tuned model (Llama3-8B) had trouble producing great feedback.
Fine-tuning didn't just improve the feedback - it transformed:
Quality: It delivered more critical, actionable feedback.
Tone: The fine-tuning models found a better balance between encouragement and necessary critique.
Speed/Cost: Smaller fine-tuned models like GPT-40-mini are able to process essays faster and cheaper (1/10 of the cost of running the existing service)
This project was a showcase of what's possible with Aligned's expert-driven fine-tuning platform.
What Sets Aligned Apart?
Expert-Driven Data: We don't just collect data; we curate it, tapping into a network of domain experts for the highest quality training data.
Streamlined Process: From initial assessment to final testing, our platform and team guide you through each step of the fine-tuning journey.
Measurable Results: We provide clear, quantifiable improvements in AI performance, just like we did for GradPilot.
Is Your AI Hitting a Wall?
If GradPilot's story sounds familiar, you're not alone. Many organizations find their AI is good but not great, needing more nuanced, domain-specific responses while facing cost or speed concerns.
Don't let your service plateau when it could soar. Reach out to us at Aligned, and let's explore how we can take your models to the next level with high-quality data from our domain experts.
Comments