A dataset in HoneyHive is a structured collection of datapoints. Think of it as a table where each row represents a specific scenario, interaction, or piece of information relevant to your AI application.

These datasets are fundamental building blocks used for various purposes throughout the AI development lifecycle, including:

  • Running ad-hoc experiments and evaluations to test prompts, models, or configurations.
  • Setting up automated tests within your CI/CD pipeline to catch regressions.
  • Creating curated sets for fine-tuning your language models.

Why Use HoneyHive Datasets?

Managing datasets within HoneyHive offers several advantages:

  • Centralized Management & Collaboration: Provides a single source of truth for your test cases and evaluation data, making it easier for teams, including domain experts (like linguists or analysts), to work together. Datasets are automatically synced between the UI and SDK, ensuring consistency.
  • Continuous Curation: You can continuously refine and expand your datasets by filtering, labeling (manually or with AI assistance), and curating directly from your production logs and traces, creating valuable proprietary datasets.
  • Seamless Integration: Datasets integrate directly with HoneyHive’s evaluation framework, CI/CD features, and can be easily exported for use in other tools or for fine-tuning.

Use Cases

  • Evaluating specific failure modes or performance aspects of your LLM application.
  • Tracking performance across different user segments or input types.
  • A/B testing different prompts, models, or RAG configurations.
  • Building high-quality datasets for fine-tuning models on specific domains or tasks.
  • Establishing benchmark datasets for regression testing in CI/CD.

Dataset Structure

Datapoints and Fields

Each row in a HoneyHive dataset is called a datapoint. A datapoint is composed of multiple fields, which are essentially key-value pairs representing different aspects of that datapoint (e.g., user_query, expected_response, customer_segment).

Field Groups

When creating or uploading a dataset, each field must be mapped into one of the following functional groups:

  • Input Fields: These represent the data that will be fed into your application or function during an evaluation run. Examples include user prompts, query parameters, or document snippets for RAG.
  • Ground Truth Fields: These contain the expected or ideal outputs or reference answers for a given input. They are used by evaluators to compare against the actual output of your application. Examples include reference summaries, known correct answers, or ideal classification labels.
  • Chat History Fields: This group is specifically for conversational AI use cases. It holds the sequence of previous messages in a dialogue, providing context for the current turn being evaluated.
  • Metadata Fields: Any field not explicitly mapped as Input, Ground Truth, or Chat History automatically falls into this category. Metadata fields store supplementary information that might be useful for analysis or filtering but isn’t directly used as input or ground truth during evaluation (e.g., source_log_id, timestamp, user_segment).

Creating Datasets

There are several ways to create datasets in HoneyHive:

  • From Production Traces: Filter and select interesting interactions or edge cases directly from your logged production data within the HoneyHive UI to build targeted datasets. Learn more.
  • Uploading Data via UI: Upload structured files (JSON, JSONL, CSV) directly through the HoneyHive web interface. Learn more.
  • Uploading Data via SDK: Programmatically create and upload datasets using the HoneyHive Python or TypeScript SDKs. Learn more.
  • In-Code Datasets: Define datasets directly within your evaluation script code (primarily for quick tests or simple use cases, discussed below).

Using Datasets

Primary Use: Experiments

Datasets are most commonly used when running experiments to evaluate your AI application’s performance. You can use either datasets managed within HoneyHive or define them directly in your code.

Managed Datasets (Recommended)

These are datasets created via the UI, SDK, or from traces, and reside within your HoneyHive project. They are identified by a unique dataset_id.

  • Pros: Centralized, collaborative, reusable across experiments.

  • How to use: Create the dataset beforehand (see the Creating Datasets Section). Then, pass its dataset_id to the evaluate function.

    from honeyhive import evaluate
    # Assume function_to_evaluate and evaluators are defined elsewhere
    
    if __name__ == "__main__":
        evaluate(
            function=function_to_evaluate,
            hh_api_key='<HONEYHIVE_API_KEY>',
            hh_project='<HONEYHIVE_PROJECT>',
            name='Sample Experiment with Managed Dataset',
            # Pass the ID of your HoneyHive-managed dataset
            dataset_id='<your-dataset-id>',
            evaluators=[...],
            server_url='<HONEYHIVE_SERVER_URL>' # Optional
        )
    

In-Code Datasets

These datasets are defined as Python lists of dictionaries (or TypeScript arrays of objects) directly within your evaluation script.

  • Pros: Simple for quick tests, self-contained within code.

  • Cons: Harder to share, manage, version, and reuse; not suitable for large datasets.

  • How to use: Define the list, ensuring fields are nested under inputs, ground_truths, etc., and pass it via the dataset parameter to evaluate.

    dataset = [
        {
            "inputs": {"prompt": "Translate 'hello' to French"},
            "ground_truths": {"expected_translation": "Bonjour"}
        },
        {
            "inputs": {"prompt": "Translate 'world' to French"},
            "ground_truths": {"expected_translation": "Monde"}
        }
        # ... more datapoints
    ]
    
    if __name__ == "__main__":
        evaluate(
            function=function_to_evaluate,
            hh_api_key='<HONEYHIVE_API_KEY>',
            hh_project='<HONEYHIVE_PROJECT>',
            name='Sample Experiment with In-Code Dataset',
            # Pass the list directly
            dataset=dataset,
            evaluators=[...],
            server_url='<HONEYHIVE_SERVER_URL>' # Optional
        )
    

    Datasets always have an ID. For in-code datasets, an ID is automatically generated (prefixed with EXT- followed by a hash of the content, e.g., EXT-dc089d82c986a22921e0e773).

When calling evaluate, provide either the dataset_id (for managed datasets) or the dataset parameter (for in-code datasets), but never both.

Other Uses

While experiments are the primary application, HoneyHive datasets can also be:

  • Exported for fine-tuning language models on your specific data.
  • Used as benchmark sets in CI/CD pipelines to automate quality checks and prevent performance regressions.

Exporting Datasets

You can easily export datasets managed in HoneyHive for use in external processes:

  • How: Use the HoneyHive SDK to programmatically retrieve dataset contents. See Export Guide.
  • Why: Export data for fine-tuning models, running evaluations in custom environments, archiving, or analysis with other tools.