Evaluations can be run from both within the UX or logged programmatically via the SDK. To get ourselves familiar, let’s try running an evaluation within HoneyHive’s webapp.
Running evaluations in the UX
- Accessing the Evaluations Section: Navigate to the Evaluation tab in the left sidebar and click
- Selecting configs: Select the version you’d like to evaluate. For this tutorial, let’s evaluate two variants using two different models - claude-instant-v1.1 vs text-davinci-003.
- Defining test cases: Select a pre-existing dataset or upload test cases. Alternatively, you can synthetically generate test cases by providing few-shot examples. In this example, we’ll only use a single test case with our input variables (tone and topic) and our expected output (ground truth).
- Selecting evaluation metrics: Next, let’s select some metrics to evaluate our prompt templates against. The custom metric that we defined earlier can be found here.
- Running the evaluation: Click
Run Comparisonto run your evaluation and analyze the evaluation report.
Running evaluations programmatically
Evaluating simple prompt variants along with external tools like Pinecone can be done via the UX, as described in this tutorial. That said, production LLM pipelines often involve multiple steps, LLM chains and external tools working together to deliver the final output.
To support complex pipelines, we allow developers to log evaluation runs programmatically via the SDK.
Logging Evaluation Runs via the SDK
How to programmatically run evaluations and log runs in HoneyHive.