top of page
pmmucsd_a_field_of_small_As_purple_geometric_shapes_that_seem_ab322f7d-61e1-4d60-b807-97c0

Fine-tuning GradPilot: When Prompts Aren’t Enough




Aligned’s recent partnership with GradPilot, a graduate school application copilot, provided insights into the capabilities and limitations of AI models. We focused on improving AI-generated feedback for graduate school applicant essays, revealing important lessons about fine-tuning models and the limitations of prompt engineering.


In this post, we'll discuss:


  1. The challenges we faced in generating high-quality feedback on student essays

  2. Why prompt engineering alone wasn't sufficient

  3. How fine-tuning dramatically improved model responses

  4. Key takeaways for pushing AI models beyond their initial capabilities


The Challenge: Perfecting Generative Feedback


GradPilot was using Claude 3.5 Sonnet and sophisticated prompts to help grad students refine their Statements of Purpose. Their goal was clear: provide better, more critical advice, work faster, and reduce costs. But getting there wasn't straightforward.


Before we stepped in to help, GradPilot had been on a months-long quest to improve the quality of the generated feedback by:


  • Tweaking prompts

  • Creating Style Rubrics

  • Rewriting instructions


They made progress, but hit a wall. The model's tone wasn't quite right, struggling to provide sufficiently critical feedback. Students needed tough love, but the model had been trained to be positive and gentle.


When Prompts Reach Their Limits


Prompt engineering can take you far and should be your first move, but it has limits. GradPilot's experience shows that sometimes, you need to go deeper.


The first challenge with fine-tuning is finding the right data. For GradPilot, finding the right tone of feedback and appropriately encouraging students to make changes was the breakthrough they needed.


Our Five-Step Fine-Tuning Process for GradPilot


  1. Baseline Assessment: Our experts graded the existing AI's performance and provided feedback on accuracy, style and tone.

  2. Expert Insight: Our writers crafted a style guide for ideal feedback based on the initial eval.

  3. Expert Rewrites: Using the style guide, experts rewrote the feedback with the right tone, level of criticism, and encouragement.

  4. Focused Training: We fine-tuned several models (LLama-70B, Llama-8b, GPT-4o, and GPT-4o-mini) with the rewritten feedback.

  5. Rigorous Evaluation: We compared the fine-tuned against the original service and and against each other using fresh essays.


The Results: A Huge Upgrade




Using side-by-side ratings, we compared the output of all the models, including the original, against each other in a series of “battles” to compute their elo rating. Three of the four fine-tuned models outperformed the original claude-sonnet responses (Llama3-70B, GPT-4o, GPT-40-mini) while the smaller fine-tuned model (Llama3-8B) had trouble producing great feedback.


Fine-tuning didn't just improve the feedback - it transformed:

  • Quality: It delivered more critical, actionable feedback.

  • Tone: The fine-tuning models found a better balance between encouragement and necessary critique.

  • Speed/Cost: Smaller fine-tuned models like GPT-40-mini are able to process essays faster and cheaper (1/10 of the cost of running the existing service)




This project was a showcase of what's possible with Aligned's expert-driven fine-tuning platform.


What Sets Aligned Apart?


  1. Expert-Driven Data: We don't just collect data; we curate it, tapping into a network of domain experts for the highest quality training data.

  2. Streamlined Process: From initial assessment to final testing, our platform and team guide you through each step of the fine-tuning journey.

  3. Measurable Results: We provide clear, quantifiable improvements in AI performance, just like we did for GradPilot.


Is Your AI Hitting a Wall?


If GradPilot's story sounds familiar, you're not alone. Many organizations find their AI is good but not great, needing more nuanced, domain-specific responses while facing cost or speed concerns.


Don't let your service plateau when it could soar. Reach out to us at Aligned, and let's explore how we can take your models to the next level with high-quality data from our domain experts.

206 views0 comments

Recent Posts

See All

Fine-Tuning Best Practices

Every large language model runs on a specific set of parameters or weights that determine how the model behaves.  Cutting-edge LLMs end...

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page