Skip to main content
A dataset in HoneyHive is a structured collection of datapoints. Think of it as a table where each row represents a specific scenario, interaction, or piece of information relevant to your AI application.
HoneyHive dataset detail view showing inputs, ground truth, metadata, and related events tabs
These datasets are fundamental building blocks used for various purposes throughout the AI development lifecycle, including:
  • Running ad-hoc experiments and evaluations to test prompts, models, or configurations.
  • Setting up automated tests within your CI/CD pipeline to catch regressions.
  • Creating curated sets for fine-tuning your language models.

Why Use HoneyHive Datasets?

Managing datasets within HoneyHive offers several advantages:
  • Centralized Management & Collaboration: Provides a single source of truth for your test cases and evaluation data, accessible via both UI and SDK. Makes it easier for teams, including domain experts (like linguists or analysts), to collaborate.
  • Continuous Curation: Refine and expand datasets by filtering and curating directly from your production logs and traces.
  • Integration: Datasets work with HoneyHive’s evaluation framework and can be exported for fine-tuning or external tools.

Use Cases

  • Evaluating specific failure modes or performance aspects of your LLM application
  • Tracking performance across different user segments or input types
  • A/B testing different prompts, models, or RAG configurations
  • Building datasets for fine-tuning models on specific domains
  • Regression testing with benchmark datasets

Dataset Structure

Datapoints and Fields

Each row in a HoneyHive dataset is called a datapoint. A datapoint is composed of multiple fields, which are essentially key-value pairs representing different aspects of that datapoint (e.g., user_query, expected_response, customer_segment).

Field Groups

When creating or uploading a dataset, each field must be mapped into one of the following functional groups:
  • Input Fields: These represent the data that will be fed into your application or function during an evaluation run. Examples include user prompts, query parameters, or document snippets for RAG.
  • Ground Truth Fields: These contain the expected or ideal outputs or reference answers for a given input. They are used by evaluators to compare against the actual output of your application. Examples include reference summaries, known correct answers, or ideal classification labels.
  • Chat History Fields: This group is specifically for conversational AI use cases. It holds the sequence of previous messages in a dialogue, providing context for the current turn being evaluated.
  • Metadata Fields: Any field not explicitly mapped as Input, Ground Truth, or Chat History automatically falls into this category. Metadata fields store supplementary information that might be useful for analysis or filtering but isn’t directly used as input or ground truth during evaluation (e.g., source_log_id, timestamp, user_segment).

Creating Datasets

There are several ways to create datasets in HoneyHive:
  • From Production Traces: Filter and select interesting interactions or edge cases directly from your logged production data within the HoneyHive UI to build targeted datasets. Learn more.
  • Uploading Data via UI: Upload structured files (JSON, JSONL, CSV) directly through the HoneyHive web interface. Learn more.
  • Uploading Data via SDK: Programmatically create and upload datasets using the HoneyHive Python SDK. Learn more.
  • In-Code Datasets: Define datasets directly within your evaluation script code (primarily for quick tests or simple use cases, discussed below).

Using Datasets

Primary Use: Experiments

Datasets are most commonly used when running experiments to evaluate your AI application’s performance. You can use either datasets managed within HoneyHive or define them directly in your code. Managed Datasets (Recommended) These are datasets created via the UI, SDK, or from traces, and reside within your HoneyHive project. They are identified by a unique dataset_id.
  • Pros: Centralized, collaborative, reusable across experiments.
  • How to use: Create the dataset beforehand (see the Creating Datasets section). Then, pass its dataset_id to the evaluate function.
    import os
    from honeyhive import evaluate
    # Assume function_to_evaluate and evaluators are defined elsewhere
    
    if __name__ == "__main__":
        evaluate(
            function=function_to_evaluate,
            api_key=os.environ["HH_API_KEY"],
            project=os.environ["HH_PROJECT"],
            name="Sample Experiment with Managed Dataset",
            dataset_id="<your-dataset-id>",  # From HoneyHive UI
            evaluators=[...],
        )
    
In-Code Datasets These datasets are defined as Python lists of dictionaries directly within your evaluation script.
  • Pros: Simple for quick tests, self-contained within code.
  • Cons: Harder to share, manage, version, and reuse; not suitable for large datasets.
  • How to use: Define the list with inputs and ground_truth fields, then pass it via the dataset parameter to evaluate().
    import os
    from honeyhive import evaluate
    
    dataset = [
        {
            "inputs": {"prompt": "Translate 'hello' to French"},
            "ground_truth": {"expected_translation": "Bonjour"}
        },
        {
            "inputs": {"prompt": "Translate 'world' to French"},
            "ground_truth": {"expected_translation": "Monde"}
        }
    ]
    
    if __name__ == "__main__":
        evaluate(
            function=function_to_evaluate,
            api_key=os.environ["HH_API_KEY"],
            project=os.environ["HH_PROJECT"],
            name="Sample Experiment with In-Code Dataset",
            dataset=dataset,
            evaluators=[...],
        )
    
    Datasets always have an ID. In the example above, an ID is automatically generated (prefixed with EXT- followed by a hash of the content, e.g., EXT-dc089d82c986a22921e0e773). You can optionally add an id field to each datapoint to use your own identifiers:
    dataset = [
        {"id": "tp-001", "inputs": {"prompt": "..."}, "ground_truth": {"expected": "..."}},
        {"id": "tp-002", "inputs": {"prompt": "..."}, "ground_truth": {"expected": "..."}},
    ]
    
    Custom datapoint IDs appear in the UI prefixed with EXT- (e.g., EXT-tp-001), helping you trace individual rows back to your source data.
When calling evaluate, provide either the dataset_id (for managed datasets) or the dataset parameter (for in-code datasets), but never both.

Other Uses

  • Export for fine-tuning language models
  • Regression testing - run experiments in your test suite to catch performance regressions

Exporting Datasets

Export datasets via SDK for external use. See Export Guide. Common uses: fine-tuning, external evaluation tools, archiving.