Documentation Index Fetch the complete documentation index at: https://docs.honeyhive.ai/llms.txt
Use this file to discover all available pages before exploring further.
HoneyHive supports flexible dataset schemas, making it easy to import datasets from Hugging Face or any other data source.
Prerequisites
Import a Dataset
Install dependencies
pip install honeyhive datasets
Load and import
import os
from datasets import load_dataset
from honeyhive import HoneyHive
from honeyhive.models import (
CreateDatasetRequest,
AddDatapointsToDatasetRequest,
DatapointMapping,
)
client = HoneyHive( api_key = os.environ[ "HH_API_KEY" ])
# Load HuggingFace dataset
hf_dataset = load_dataset( "lhoestq/demo1" , split = "train[:100]" )
# Step 1: Create an empty dataset
dataset = client.datasets.create(CreateDatasetRequest(
name = "HuggingFace Demo Dataset" ,
description = "Imported from lhoestq/demo1" ,
))
dataset_id = dataset.result.insertedId
# Step 2: Add datapoints in batches with field mapping
batch_size = 100
total = len (hf_dataset)
for i in range ( 0 , total, batch_size):
batch = hf_dataset[i:i + batch_size]
rows = [
{ "review" : review, "star" : star}
for review, star in zip (batch[ "review" ], batch[ "star" ])
]
client.datasets.add_datapoints(
dataset_id,
AddDatapointsToDatasetRequest(
data = rows,
mapping = DatapointMapping(
inputs = [ "review" ],
ground_truth = [ "star" ],
)
)
)
print ( f "Imported { min (i + batch_size, total) } / { total } datapoints" )
print ( f "Created dataset with { total } datapoints" )
Field Mapping
Use DatapointMapping to map HuggingFace dataset columns to HoneyHive datapoint fields:
HuggingFace DatapointMapping field Use For Input columns inputsData fed to your function Label/answer columns ground_truthExpected outputs for evaluation Chat history columns historyConversational context
Any columns not listed in the mapping are automatically stored as metadata.
Example: Q&A Dataset
# For a Q&A dataset with "question", "context", and "answers" columns
hf_dataset = load_dataset( "squad" , split = "train[:100]" )
# Step 1: Create an empty dataset
dataset = client.datasets.create(CreateDatasetRequest(
name = "SQuAD Q&A Dataset" ,
description = "Imported from squad" ,
))
# Step 2: Flatten answers and add datapoints
rows = [
{
"question" : row[ "question" ],
"context" : row[ "context" ],
"answer" : row[ "answers" ][ "text" ][ 0 ],
"source" : "squad" ,
}
for row in hf_dataset
]
client.datasets.add_datapoints(
dataset.result.insertedId,
AddDatapointsToDatasetRequest(
data = rows,
mapping = DatapointMapping(
inputs = [ "question" , "context" ],
ground_truth = [ "answer" ],
)
)
)
# "source" is automatically stored as metadata
Example: Classification Dataset
# For a classification dataset with "text" and "label" columns
hf_dataset = load_dataset( "imdb" , split = "test[:100]" )
# Step 1: Create an empty dataset
dataset = client.datasets.create(CreateDatasetRequest(
name = "IMDB Classification Dataset" ,
description = "Imported from imdb" ,
))
# Step 2: Add datapoints
rows = [
{
"text" : row[ "text" ],
"label" : "positive" if row[ "label" ] == 1 else "negative" ,
}
for row in hf_dataset
]
client.datasets.add_datapoints(
dataset.result.insertedId,
AddDatapointsToDatasetRequest(
data = rows,
mapping = DatapointMapping(
inputs = [ "text" ],
ground_truth = [ "label" ],
)
)
)
Best Practices
Batch imports : For large datasets (1000+ rows), use the batching pattern from the main example above — split your data into chunks of 100 rows per add_datapoints call to avoid timeouts.
Recommendation Reason Start with a subset Test your mapping with 100 rows before importing the full dataset Add metadata Include source information for traceability Validate fields Check that your field mapping produces valid datapoints
Next Steps
Run Experiments Evaluate your application using the imported dataset
Export Datasets Export datasets for external use