Why curate a dataset

Fine-tuning an LLM requires a dataset that aligns with the specific goals of the application. Here are some key reasons to curate a dataset:

  1. Domain Specificity: To make the LLM more knowledgeable in a particular domain, you need data relevant to that domain.
  2. Customization: If you want the LLM to adopt a specific style or tone, the dataset should reflect that.
  3. Bias Reduction: Curating a dataset allows you to control and reduce biases that might be present in generic datasets.
  4. Task Specificity: For specific tasks, a tailored dataset ensures the LLM is trained on relevant examples.
  5. Data Privacy: Curating your dataset ensures that proprietary or sensitive information is handled appropriately.

To achieve the desired outcomes, it’s essential to curate the dataset with care and precision. Let’s delve into the process.

Steps to curate your dataset

  1. Define Your Goals: Clearly outline what you want to achieve with the fine-tuned model. This will guide your data collection and curation process.

  2. Data Collection:

    • Existing Datasets: Check if there are existing datasets that align with your goals. You’ll likely need to transform your data to fit specific format (eg: prompt and response pairs).
    • Create New Data: If existing datasets are insufficient, consider labelling new datasets for your use-case and using sanitized logs from your production data for model distillation.
  3. Data Cleaning:

    • Remove Duplicates: Ensure that the dataset doesn’t have repeated entries.
    • Handle Missing Values: Decide whether to impute or remove entries with missing values.
    • Standardize Format: Ensure all data is in a consistent format.
  4. Data Annotation:

    • Manual Annotation: If your dataset requires labels, consider manual annotation. Use domain experts for best results.
    • Annotation Tools: Use tools to streamline the annotation process and ensure consistency.
  5. Data Augmentation: Enhance the dataset by creating variations of the existing data. This can help in increasing the diversity and size of the dataset.

  6. Dataset Splitting: Divide the dataset into training, validation, and test sets. This ensures that you can train and evaluate the model effectively.

Collaborate and Review

Once the dataset is curated, it’s essential to review and collaborate:

  1. Peer Review: Have domain experts or peers review the dataset for accuracy and relevance.
  2. Iterative Refinement: Based on feedback, refine the dataset to better align with your goals.
  3. Documentation: Document the dataset’s sources, annotation guidelines, and any other relevant information. This ensures transparency and reproducibility.
  4. Upload to HoneyHive: Upload the dataset to HoneyHive in order to share the dataset with stakeholders and collaborators for further insights.

By meticulously curating the dataset and collaborating with experts, you can ensure that the fine-tuning process yields a model that aligns closely with your objectives.

Uploading your dataset to HoneyHive via the UI

Upload the dataset to HoneyHive by going to the “Datasets -> Fine-Tuning” page and clicking the “Add Dataset” button.


Curating your dataset from production logs in HoneyHive

In addition to uploading complete datasets, you can also easily add individual datapoints from one dataset to another by clicking the relevant datapoint in “Completions” table and clicking the “Add to Dataset” button in the top-right.


You can also add a group of datapoints to a dataset as well. You can filter your datapoints by some criteria, select all datapoints that match that criteria and add those datapoints to a dataset as shown in the following screenshot.


Uploading your dataset to HoneyHive via the SDK

import honeyhive
from honeyhive.sdk.datasets import create_dataset

honeyhive.api_key = "HONEYHIVE_API_KEY"

# read a file
file = open("data.json", "r")

# uploading a dataset
    name="Simple Dataset",
    project="New Project",
    description="A simple dataset for testing"

Note that data.json needs to be a list of valid JSON objects. Within each individual JSON object, ground_truth is a reserved field that is used in evaluations for calculating metrics.