Running Evaluations via SDK
Logging evaluation runs programmatically with the Python SDK
Initialize your evaluation run
Let’s start by running a quick experiment to test whether a including certain instructions in the prompt actually affects outputs across two different prompt variants. For this particular example, we’ll compare a prompt with explicit instructions to output concise sales emails vs another prompt with no such instructions. We’ll be evaluating the results using the Conciseness
metric defined out-of-the-box in HoneyHive.
Define test cases and pipeline configuration
Now that you’ve initialized the evaluation run, define your test cases and prompt-model configurations for this evaluation.
chunk_size
, chunk_overlap
, vectorDB_index_name
, etc. to your configs when running evaluations via the SDK. This can help you compare different vector databases, indexes, chunking sizes, and other variations in your LLM pipeline when running evaluations.Run your LLM pipelines
Now that define your test cases, run your LLM pipelines to start logging evaluation results.
Analyze results and share learnings
Congrats! You’ve successfully run your first evaluation with HoneyHive! You can now view and start analyzing your evaluation results.
From the results, it looks like adding instructions for making the outputs more concise doesn’t only make the outputs more concise, but also reduces hallucination.