HoneyHive supports flexible dataset schemas, making it easy to import datasets from Hugging Face or any other data source.
Prerequisites
Import a Dataset
Install dependencies
pip install "honeyhive>=1.0.0rc0" datasets
Load and import
import os
from datasets import load_dataset
from honeyhive import HoneyHive
from honeyhive.models import CreateDatasetRequest, CreateDatapointRequest
client = HoneyHive(api_key=os.environ["HH_API_KEY"])
# Load HuggingFace dataset
hf_dataset = load_dataset("lhoestq/demo1", split="train")
# Create datapoints in batches
datapoint_ids = []
batch_size = 100
for i in range(0, len(hf_dataset), batch_size):
batch = hf_dataset[i:i + batch_size]
for j in range(len(batch["review"])):
response = client.datapoints.create(CreateDatapointRequest(
inputs={"review": batch["review"][j]},
# Add ground_truth if your HF dataset has labels
# ground_truth={"label": batch["label"][j]},
))
datapoint_ids.append(response.result["insertedId"])
print(f"Imported {min(i + batch_size, len(hf_dataset))}/{len(hf_dataset)} datapoints")
# Create dataset with all datapoints
dataset = client.datasets.create(CreateDatasetRequest(
name="HuggingFace Demo Dataset",
description="Imported from lhoestq/demo1",
datapoints=datapoint_ids,
))
print(f"Created dataset with {len(datapoint_ids)} datapoints")
Field Mapping
Map HuggingFace dataset columns to HoneyHive datapoint fields:
| HuggingFace | HoneyHive | Use For |
|---|
| Input columns | inputs | Data fed to your function |
| Label/answer columns | ground_truth | Expected outputs for evaluation |
| Other columns | metadata | Additional context |
Example: Q&A Dataset
# For a Q&A dataset with "question" and "answer" columns
hf_dataset = load_dataset("squad", split="train[:100]")
datapoint_ids = []
for row in hf_dataset:
response = client.datapoints.create(CreateDatapointRequest(
inputs={"question": row["question"], "context": row["context"]},
ground_truth={"answer": row["answers"]["text"][0]},
metadata={"source": "squad"},
))
datapoint_ids.append(response.result["insertedId"])
Example: Classification Dataset
# For a classification dataset with "text" and "label" columns
hf_dataset = load_dataset("imdb", split="test[:100]")
datapoint_ids = []
for row in hf_dataset:
response = client.datapoints.create(CreateDatapointRequest(
inputs={"text": row["text"]},
ground_truth={"label": "positive" if row["label"] == 1 else "negative"},
))
datapoint_ids.append(response.result["insertedId"])
Best Practices
Batch imports: For large datasets (1000+ rows), process in batches of 100 to avoid timeouts and memory issues.
| Recommendation | Reason |
|---|
| Start with a subset | Test your mapping with 100 rows before importing the full dataset |
| Add metadata | Include source information for traceability |
| Validate fields | Check that your field mapping produces valid datapoints |
Next Steps