HoneyHive Docs

HoneyHive supports flexible dataset schemas, making it easy to import datasets from Hugging Face or any other data source.

Prerequisites

HoneyHive API key
An existing project
datasets library installed: pip install datasets

Import a Dataset

Install dependencies

pip install "honeyhive>=1.0.0rc0" datasets

Load and import

Python

import os
from datasets import load_dataset
from honeyhive import HoneyHive
from honeyhive.models import CreateDatasetRequest, CreateDatapointRequest

client = HoneyHive(api_key=os.environ["HH_API_KEY"])

# Load HuggingFace dataset
hf_dataset = load_dataset("lhoestq/demo1", split="train")

# Create datapoints in batches
datapoint_ids = []
batch_size = 100

for i in range(0, len(hf_dataset), batch_size):
    batch = hf_dataset[i:i + batch_size]
    
    for j in range(len(batch["review"])):
        response = client.datapoints.create(CreateDatapointRequest(
            inputs={"review": batch["review"][j]},
            # Add ground_truth if your HF dataset has labels
            # ground_truth={"label": batch["label"][j]},
        ))
        datapoint_ids.append(response.result["insertedId"])
    
    print(f"Imported {min(i + batch_size, len(hf_dataset))}/{len(hf_dataset)} datapoints")

# Create dataset with all datapoints
dataset = client.datasets.create(CreateDatasetRequest(
    name="HuggingFace Demo Dataset",
    description="Imported from lhoestq/demo1",
    datapoints=datapoint_ids,
))

print(f"Created dataset with {len(datapoint_ids)} datapoints")

Field Mapping

Map HuggingFace dataset columns to HoneyHive datapoint fields:

HuggingFace	HoneyHive	Use For
Input columns	`inputs`	Data fed to your function
Label/answer columns	`ground_truth`	Expected outputs for evaluation
Other columns	`metadata`	Additional context

Example: Q&A Dataset

Python

# For a Q&A dataset with "question" and "answer" columns
hf_dataset = load_dataset("squad", split="train[:100]")

datapoint_ids = []
for row in hf_dataset:
    response = client.datapoints.create(CreateDatapointRequest(
        inputs={"question": row["question"], "context": row["context"]},
        ground_truth={"answer": row["answers"]["text"][0]},
        metadata={"source": "squad"},
    ))
    datapoint_ids.append(response.result["insertedId"])

Example: Classification Dataset

Python

# For a classification dataset with "text" and "label" columns
hf_dataset = load_dataset("imdb", split="test[:100]")

datapoint_ids = []
for row in hf_dataset:
    response = client.datapoints.create(CreateDatapointRequest(
        inputs={"text": row["text"]},
        ground_truth={"label": "positive" if row["label"] == 1 else "negative"},
    ))
    datapoint_ids.append(response.result["insertedId"])

Best Practices

Batch imports: For large datasets (1000+ rows), process in batches of 100 to avoid timeouts and memory issues.

Recommendation	Reason
Start with a subset	Test your mapping with 100 rows before importing the full dataset
Add metadata	Include source information for traceability
Validate fields	Check that your field mapping produces valid datapoints

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

Import from Hugging Face

Prerequisites

Import a Dataset

Field Mapping

Example: Q&A Dataset

Example: Classification Dataset

Best Practices

Next Steps

Run Experiments

Export Datasets

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

​Prerequisites

​Import a Dataset

​Field Mapping

​Example: Q&A Dataset

​Example: Classification Dataset

​Best Practices

​Next Steps

Run Experiments

Export Datasets

Prerequisites

Import a Dataset

Field Mapping

Example: Q&A Dataset

Example: Classification Dataset

Best Practices

Next Steps