Optimize a simple LLM app
Introduction
In this guide, we will walk you through the process of creating and optimizing a simple LLM app using HoneyHive. Specifically, we’ll demonstrate how to build an app that generates sales emails based on user input, including variables like topic
and tone
. A similar workflow can be used for tasks like data transformation, writing assistants, code suggestions, etc.
Setup HoneyHive & integrate your model provider
If you haven’t already done so, then the first thing you will need to do is create a HoneyHive project.
After creating the project, you can add your model provider API key on the Integrations page.
We currently support the following model providers natively:
- OpenAI
- Anthropic
- AWS Bedrock
- Google Vertex AI
- Azure OpenAI
- HuggingFace
Once you have created a HoneyHive project and setup your model integration, you can now start iterating on your prompts.
Prototyping in the Playground
The first step is to prototype your prompt in the Playground to create a simple prompt template that instructs the model to generate sales emails.
Configuring your prompt template
- First we pick our model provider and model on the left pane
- Then we can create a prompt in the Chat interface
{{ }}
double curly brackets to specify input variables in prompt templates.Testing your prompt
- After creating the prompt, we can test it by providing it input values in the Input sections
- Hit the
Run
button to generate the output
As you can see based on the above response, the LLM is able to generate a response that is relevant to the input, albeit with some issues.
You can continue tweaking the prompt here until you are satisfied with the results. We automatically track all unique versions so you can always roll back.
Sharing your prompt
- Hit the
Save
button next to share. This will save the prompt to your project. - Optionally you can share the prompt with your team by clicking on the
Share
button and copying the link to the template. - If you are viewing a saved prompt and want to change it, hit the
Fork
button next toSave
to create your own variant and iterate on it.
Evaluating your prompt
Once you have a prompt that you is working fine, you can now evaluate and optimize it to get the best results.
Open your prompt in Evaluation UX
- Hit the
Prompts
button on the header and go to the library page. - Here you can select your prompt and hit the
Evaluate
button to start optimizing it.
Provide data points to evaluate your prompt over
- Hit the
+ Test Case
to manually enter data points to evaluate your prompt over. - Optionally upload a dataset of type
jsonl
or select a dataset someone has already uploaded.
Hit Next
to continue.
Select metrics to evaluate your prompt with
- Pick some of the out-of-the-box metrics we provide for evaluation
- Optionally create your own python metrics or AI feedback functions to evaluate your prompt with.
Here I have picked some basic metrics like cost, latency, response length, a GPT-4 evaluator for conciseness and a Python metric which uses the Moderation Filter to flag responses.
Hit Next
to run your evaluation!
HoneyHive Evaluations Quickstart
How to start logging evaluations in HoneyHive.
Running Evaluations in HoneyHive
How to run evaluations using the HoneyHive app.
Logging Evaluation Runs via the SDK
How to programmatically run evaluations and log runs in HoneyHive.
Sharing and Collaboration
How to collaborate and share learnings with your team.
Review your evaluation results
- Quickly review the model completions to get a sense of how well your prompt is doing.
- Optionally provide feedback on the model completions to improve the model down the line.
- Add more test cases at the bottom to evaluate your prompt over more data points if needed.
Iterate on your prompt
In this evaluation, the model responses are getting cut off abruptly. So, let’s start tweaking the prompt directly in the Evaluation UX.
- Open the
Configs
accordion and hit theFork
icon under Actions - Increase the
Max Tokens
inside the hyperparameters to 512 - Hit
Save
- Now hit the re-run button next to Results to run the prompts once again.
You can continue forking and tweaking the prompt here until you are satisfied with the results.
Share your evaluation
- Hit the
Share
button on the top right to share your evaluation with your team. - You can drop comments in the chat section to discuss the evaluation.
Ship to production & start logging data from your app
Once you are satisfied with your prompt in evaluation, you can ship it to production. This is a critical step to learn what types of inputs is your prompt is expected to handle. Refer to our LLM request logging guide to learn how to log LLM requests and capture user feedback.
Curate a new evaluation dataset
Once you have collected enough data, you can curate a new evaluation dataset to evaluate your prompt with.
- Go to the Discover to explore your data, filter it based on
version
, any performance metrics or user feedback properties to find underperforming data samples. - Click on any model completions in the table below to inspect the request and model response.
- If you find a model completion that is underperforming, you can quickly provide the
Ground Truth
, i.e. a correction to the model response, and add it to your evaluation dataset via theAdd To Dataset
button in the corner. Here I can create a new dataset or add it to an existing one.
- Hit
Create new...
under Evaluation datasets to create a new dataset. - Set the dataset name & description as follows.
- Hit
Create
to create the dataset.
Re-evaluate your prompt
Now you can re-evaluate your prompt with the new dataset you have created to validate if performance is indeed improving.
Curate a fine-tuning dataset
Over time as you keep improving your prompt and logging more user annotated data, you can go ahead and create a fine-tuning dataset to fine-tune your LLM with.
Repeat the steps as outlined above in the Curate a new evaluation dataset section to create a new fine-tuning dataset, just picking the Fine-tuning
option instead of Evaluation
.
Export your fine-tuning dataset
- Go to the Fine-tuning datasets and select the dataset you created
- You can export the dataset by clicking the download icon on the right.
Fine-tune your LLM
- You can fine-tune OpenAI models via our platform or go with one of our recommended fine-tuning partners.
Conclusion
In this guide, we showed you how to prototype and optimize your LLM with HoneyHive. We also showed you how to curate a new evaluation dataset and fine-tuning dataset to improve your LLM over time.
As you keep trying new variants and logging data to HoneyHive, you can continually exploratory your data, compare data slices using Group by
and filter your charts based on relevant metrics, user feedback, user properties or any other metadata fields.