Skip to main content
HoneyHive supports flexible dataset schemas, making it easy to import datasets from Hugging Face or any other data source.

Prerequisites


Import a Dataset

1

Install dependencies

pip install "honeyhive>=1.0.0rc0" datasets
2

Load and import

Python
import os
from datasets import load_dataset
from honeyhive import HoneyHive
from honeyhive.models import CreateDatasetRequest, CreateDatapointRequest

client = HoneyHive(api_key=os.environ["HH_API_KEY"])

# Load HuggingFace dataset
hf_dataset = load_dataset("lhoestq/demo1", split="train")

# Create datapoints in batches
datapoint_ids = []
batch_size = 100

for i in range(0, len(hf_dataset), batch_size):
    batch = hf_dataset[i:i + batch_size]
    
    for j in range(len(batch["review"])):
        response = client.datapoints.create(CreateDatapointRequest(
            inputs={"review": batch["review"][j]},
            # Add ground_truth if your HF dataset has labels
            # ground_truth={"label": batch["label"][j]},
        ))
        datapoint_ids.append(response.result["insertedId"])
    
    print(f"Imported {min(i + batch_size, len(hf_dataset))}/{len(hf_dataset)} datapoints")

# Create dataset with all datapoints
dataset = client.datasets.create(CreateDatasetRequest(
    name="HuggingFace Demo Dataset",
    description="Imported from lhoestq/demo1",
    datapoints=datapoint_ids,
))

print(f"Created dataset with {len(datapoint_ids)} datapoints")

Field Mapping

Map HuggingFace dataset columns to HoneyHive datapoint fields:
HuggingFaceHoneyHiveUse For
Input columnsinputsData fed to your function
Label/answer columnsground_truthExpected outputs for evaluation
Other columnsmetadataAdditional context

Example: Q&A Dataset

Python
# For a Q&A dataset with "question" and "answer" columns
hf_dataset = load_dataset("squad", split="train[:100]")

datapoint_ids = []
for row in hf_dataset:
    response = client.datapoints.create(CreateDatapointRequest(
        inputs={"question": row["question"], "context": row["context"]},
        ground_truth={"answer": row["answers"]["text"][0]},
        metadata={"source": "squad"},
    ))
    datapoint_ids.append(response.result["insertedId"])

Example: Classification Dataset

Python
# For a classification dataset with "text" and "label" columns
hf_dataset = load_dataset("imdb", split="test[:100]")

datapoint_ids = []
for row in hf_dataset:
    response = client.datapoints.create(CreateDatapointRequest(
        inputs={"text": row["text"]},
        ground_truth={"label": "positive" if row["label"] == 1 else "negative"},
    ))
    datapoint_ids.append(response.result["insertedId"])

Best Practices

Batch imports: For large datasets (1000+ rows), process in batches of 100 to avoid timeouts and memory issues.
RecommendationReason
Start with a subsetTest your mapping with 100 rows before importing the full dataset
Add metadataInclude source information for traceability
Validate fieldsCheck that your field mapping produces valid datapoints

Next Steps