Introduction

In this guide, we will walk you through the process of creating and optimizing a simple LLM app using HoneyHive. Specifically, we’ll demonstrate how to build an app that generates sales emails based on user input, including variables like topic and tone. A similar workflow can be used for tasks like data transformation, writing assistants, code suggestions, etc.

Setup HoneyHive & integrate your model provider

If you haven’t already done so, then the first thing you will need to do is create a HoneyHive project.

After creating the project, you can add your model provider API key on the Integrations page.

We currently support the following model providers natively:

  1. OpenAI
  2. Anthropic
  3. AWS Bedrock
  4. Google Vertex AI
  5. Azure OpenAI
  6. HuggingFace
For enterprise users, we also support any custom model provider that is compatible with the OpenAI chat or completions API.

Once you have created a HoneyHive project and setup your model integration, you can now start iterating on your prompts.

Prototyping in the Playground

The first step is to prototype your prompt in the Playground to create a simple prompt template that instructs the model to generate sales emails.

Configuring your prompt template

  1. First we pick our model provider and model on the left pane
  2. Then we can create a prompt in the Chat interface
HoneyHive uses {{ }} double curly brackets to specify input variables in prompt templates.

playground-step-1

Testing your prompt

  1. After creating the prompt, we can test it by providing it input values in the Input sections
  2. Hit the Run button to generate the output

playground-step-1-1

As you can see based on the above response, the LLM is able to generate a response that is relevant to the input, albeit with some issues.

You can continue tweaking the prompt here until you are satisfied with the results. We automatically track all unique versions so you can always roll back.

Sharing your prompt

  1. Hit the Save button next to share. This will save the prompt to your project.
  2. Optionally you can share the prompt with your team by clicking on the Share button and copying the link to the template.
  3. If you are viewing a saved prompt and want to change it, hit the Fork button next to Save to create your own variant and iterate on it.

playground-step-3

Evaluating your prompt

Once you have a prompt that you is working fine, you can now evaluate and optimize it to get the best results.

Open your prompt in Evaluation UX

  1. Hit the Prompts button on the header and go to the library page.
  2. Here you can select your prompt and hit the Evaluate button to start optimizing it.

playground-step-2

Provide data points to evaluate your prompt over

  1. Hit the + Test Case to manually enter data points to evaluate your prompt over.
  2. Optionally upload a dataset of type jsonl or select a dataset someone has already uploaded.

evaluation-step-2

Hit Next to continue.

Select metrics to evaluate your prompt with

  1. Pick some of the out-of-the-box metrics we provide for evaluation
  2. Optionally create your own python metrics or AI feedback functions to evaluate your prompt with.

evaluation-step-3

Here I have picked some basic metrics like cost, latency, response length, a GPT-4 evaluator for conciseness and a Python metric which uses the Moderation Filter to flag responses.

Hit Next to run your evaluation!

Review your evaluation results

  1. Quickly review the model completions to get a sense of how well your prompt is doing.
  2. Optionally provide feedback on the model completions to improve the model down the line.
  3. Add more test cases at the bottom to evaluate your prompt over more data points if needed.

evaluation-step-4

Iterate on your prompt

In this evaluation, the model responses are getting cut off abruptly. So, let’s start tweaking the prompt directly in the Evaluation UX.

  1. Open the Configs accordion and hit the Fork icon under Actions
  2. Increase the Max Tokens inside the hyperparameters to 512
  3. Hit Save

evaluation-step-5

  1. Now hit the re-run button next to Results to run the prompts once again.

You can continue forking and tweaking the prompt here until you are satisfied with the results.

Share your evaluation

  1. Hit the Share button on the top right to share your evaluation with your team.
  2. You can drop comments in the chat section to discuss the evaluation.

evaluation-step-6

Ship to production & start logging data from your app

Once you are satisfied with your prompt in evaluation, you can ship it to production. This is a critical step to learn what types of inputs is your prompt is expected to handle. Refer to our LLM request logging guide to learn how to log LLM requests and capture user feedback.

Curate a new evaluation dataset

Once you have collected enough data, you can curate a new evaluation dataset to evaluate your prompt with.

  1. Go to the Discover to explore your data, filter it based on version, any performance metrics or user feedback properties to find underperforming data samples.
  2. Click on any model completions in the table below to inspect the request and model response.

monitoring-step-1

  1. If you find a model completion that is underperforming, you can quickly provide the Ground Truth, i.e. a correction to the model response, and add it to your evaluation dataset via the Add To Dataset button in the corner. Here I can create a new dataset or add it to an existing one.

monitoring-step-2

  1. Hit Create new... under Evaluation datasets to create a new dataset.
  2. Set the dataset name & description as follows.

monitoring-step-3

  1. Hit Create to create the dataset.
You can also select all the completions in the table after filtering and add them to the dataset in one go.

Re-evaluate your prompt

Now you can re-evaluate your prompt with the new dataset you have created to validate if performance is indeed improving.

Curate a fine-tuning dataset

Over time as you keep improving your prompt and logging more user annotated data, you can go ahead and create a fine-tuning dataset to fine-tune your LLM with.

Repeat the steps as outlined above in the Curate a new evaluation dataset section to create a new fine-tuning dataset, just picking the Fine-tuning option instead of Evaluation.

Export your fine-tuning dataset

  1. Go to the Fine-tuning datasets and select the dataset you created
  2. You can export the dataset by clicking the download icon on the right.

finetuning-step-1

Fine-tune your LLM

  1. You can fine-tune OpenAI models via our platform or go with one of our recommended fine-tuning partners.

Conclusion

In this guide, we showed you how to prototype and optimize your LLM with HoneyHive. We also showed you how to curate a new evaluation dataset and fine-tuning dataset to improve your LLM over time.

As you keep trying new variants and logging data to HoneyHive, you can continually exploratory your data, compare data slices using Group by and filter your charts based on relevant metrics, user feedback, user properties or any other metadata fields.

monitoring-step-4