Running evaluations is a natural extension of the tracing capabilities of HoneyHive. We recommend you to go through the tracing quickstart before proceeding with this guide.

What is an Evaluation Run?

An evaluation run in HoneyHive is a group of related sessions that have a common metadata.run_id field.

When an evaluation run is linked to a dataset, then it can be compared against other runs on the same dataset.

The flexibility here helps you compare the performance of your application across any dimension of configuration you want to test - models, chunking strategies, vector databases, prompts, so on.

Running a evaluation


  • You have already created a project in HoneyHive, as explained here.
  • You have an API key for your project, as explained here.
  • You have instrumented your application with the HoneyHive SDK, as explained here.

Expected Time: 5 minutes




To install our SDKs, run the following commands in the shell.

npm install honeyhive

Initialize the SDK

To initialize the API client, we simply pass the API key to the SDK constructor.

import { HoneyHive } from "honeyhive";
const sdk = new HoneyHive({
    bearerAuth: "MY_HONEYHIVE_API_KEY",

Initialize the evaluation run

To initialize the evaluation run, we need to provide a name for it.

To make evaluation runs comparable, we need to pass the dataset_id to the evaluation run. The docs for creating datasets can be found here.
Keep the returned run_id for future reference.

const evalRun = await sdk.runs.createRun({
    project: "MY_HONEYHIVE_PROJECT",
    name: "MY_EVAL_RUN",
    eventIds: [],
    // datasetId: "MY_DATASET_ID"

const runId = evalRun.runId;

Setup your application tracing

Create a function which traces your application code.

At the end of the function, return the session_id for future reference.

If the relevant execution is not being auto-traced, refer to our custom spans docs to track that as well.

To initialize the tracer, we require 3 key details:

  1. Your HoneyHive API Key
  2. Name of the project to log the trace to
  3. Name for this session - like “Chatbot Session” or “Customer RAG Session”.
To make the datapoints comparable, we need to set the datapoint_id on the tracer metadata.
import { HoneyHiveTracer } from "honeyhive";

function my_application() {
    // place the code below at the beginning of your application
    const tracer = await HoneyHiveTracer.init({
      source: MY_SOURCE, // e.g. "prod", "dev", etc.
      sessionName: mySessionName,

    await tracer.trace(async () => my_application());

        "run_id": runId,               // required
        "datapoint_id": datapointId    // include to link to datapoints (for comparative evals)
    return tracer.sessionId;

Setup your harness

Now you can loop over any dataset you have and run your application code.

const eventIdsEval = [];

for (const datapoint of dataset) {
    const sessionId = await my_application();

Push your evaluation to HoneyHive

Finally, push the evaluation to HoneyHive.

import { UpdateRunRequest } from "honeyhive/models/components";
await sdk.runs.updateRun({
        eventIds: eventIdsEval,
        status: UpdateRunRequestStatus.COMPLETED,

Next Steps

Congratulations! You have successfully run an evaluation in HoneyHive.

Now, to view the evaluation results, head over to the HoneyHive Evaluations.

Was this page helpful?