Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.honeyhive.ai/llms.txt

Use this file to discover all available pages before exploring further.

HoneyHive supports flexible dataset schemas, making it easy to import datasets from Hugging Face or any other data source.

Prerequisites


Import a Dataset

1

Install dependencies

pip install honeyhive datasets
2

Load and import

Python
import os
from datasets import load_dataset
from honeyhive import HoneyHive
from honeyhive.models import (
    CreateDatasetRequest,
    AddDatapointsToDatasetRequest,
    DatapointMapping,
)

client = HoneyHive(api_key=os.environ["HH_API_KEY"])

# Load HuggingFace dataset
hf_dataset = load_dataset("lhoestq/demo1", split="train[:100]")

# Step 1: Create an empty dataset
dataset = client.datasets.create(CreateDatasetRequest(
    name="HuggingFace Demo Dataset",
    description="Imported from lhoestq/demo1",
))
dataset_id = dataset.result.insertedId

# Step 2: Add datapoints in batches with field mapping
batch_size = 100
total = len(hf_dataset)

for i in range(0, total, batch_size):
    batch = hf_dataset[i:i + batch_size]
    rows = [
        {"review": review, "star": star}
        for review, star in zip(batch["review"], batch["star"])
    ]

    client.datasets.add_datapoints(
        dataset_id,
        AddDatapointsToDatasetRequest(
            data=rows,
            mapping=DatapointMapping(
                inputs=["review"],
                ground_truth=["star"],
            )
        )
    )
    print(f"Imported {min(i + batch_size, total)}/{total} datapoints")

print(f"Created dataset with {total} datapoints")

Field Mapping

Use DatapointMapping to map HuggingFace dataset columns to HoneyHive datapoint fields:
HuggingFaceDatapointMapping fieldUse For
Input columnsinputsData fed to your function
Label/answer columnsground_truthExpected outputs for evaluation
Chat history columnshistoryConversational context
Any columns not listed in the mapping are automatically stored as metadata.

Example: Q&A Dataset

Python
# For a Q&A dataset with "question", "context", and "answers" columns
hf_dataset = load_dataset("squad", split="train[:100]")

# Step 1: Create an empty dataset
dataset = client.datasets.create(CreateDatasetRequest(
    name="SQuAD Q&A Dataset",
    description="Imported from squad",
))

# Step 2: Flatten answers and add datapoints
rows = [
    {
        "question": row["question"],
        "context": row["context"],
        "answer": row["answers"]["text"][0],
        "source": "squad",
    }
    for row in hf_dataset
]

client.datasets.add_datapoints(
    dataset.result.insertedId,
    AddDatapointsToDatasetRequest(
        data=rows,
        mapping=DatapointMapping(
            inputs=["question", "context"],
            ground_truth=["answer"],
        )
    )
)
# "source" is automatically stored as metadata

Example: Classification Dataset

Python
# For a classification dataset with "text" and "label" columns
hf_dataset = load_dataset("imdb", split="test[:100]")

# Step 1: Create an empty dataset
dataset = client.datasets.create(CreateDatasetRequest(
    name="IMDB Classification Dataset",
    description="Imported from imdb",
))

# Step 2: Add datapoints
rows = [
    {
        "text": row["text"],
        "label": "positive" if row["label"] == 1 else "negative",
    }
    for row in hf_dataset
]

client.datasets.add_datapoints(
    dataset.result.insertedId,
    AddDatapointsToDatasetRequest(
        data=rows,
        mapping=DatapointMapping(
            inputs=["text"],
            ground_truth=["label"],
        )
    )
)

Best Practices

Batch imports: For large datasets (1000+ rows), use the batching pattern from the main example above — split your data into chunks of 100 rows per add_datapoints call to avoid timeouts.
RecommendationReason
Start with a subsetTest your mapping with 100 rows before importing the full dataset
Add metadataInclude source information for traceability
Validate fieldsCheck that your field mapping produces valid datapoints

Next Steps

Run Experiments

Evaluate your application using the imported dataset

Export Datasets

Export datasets for external use