Datasets
Creating and Retrieving Datasets

Creating and Retrieving Datasets in Acadia

In Acadia, datasets are central to managing and processing data. This guide covers how to create and retrieve datasets, allowing users to interact with their data effectively.

Creating a Dataset

To create a dataset, you must first configure it using the DatasetConfig class, which specifies various settings and parameters.

Configuration

A dataset configuration includes several parameters:

  • name: A unique identifier for the dataset.
  • description: A text description of what the dataset contains.
  • columns: A list of column names expected in the data.
  • id_column: Specifies which column should be used as a unique identifier for each row.
  • text_preview_column: Specifies which column to use for text previews.
  • image_preview_column: Specifies which column to use for image previews.
  • source: The data source from where the dataset will be loaded.
  • topic_tree: Optionally, a TopicTree that organizes topics associated with the dataset.

Here is an example of setting up a dataset configuration:

from acadia.data_sources import CSVDataSource
from acadia.models import DatasetConfig
 
data_source = CSVDataSource("path/to/data.csv", sample_size=1000)
config = DatasetConfig(
    name="human_eval",
    description="Test description",
    columns=["id", "caption_0", "caption_1"],
    id_column="id",
    text_preview_column="caption_0",
    source=data_source
)

Creating the Dataset

Once the configuration is set, use the create_dataset function to instantiate the dataset:

dataset = acadia.dataset.create_dataset(config)

This function initializes the dataset, loads data based on the specified source, and prepares it for further processing and analysis.

Retrieving Datasets

Listing Datasets

To see what datasets are available, you can list them using:

datasets = acadia.dataset.list_datasets()
print(datasets)

This function retrieves all datasets that have been created and registered within the system.

Getting a Specific Dataset

If you know the name of the dataset you want to interact with, you can retrieve it directly:

dataset = acadia.dataset.get_dataset("human_eval")

This function fetches the dataset with the specified name, allowing for direct access to its data and associated operations.

Conclusion

Creating and retrieving datasets in Acadia is straightforward but vital for managing your data effectively. By properly configuring and utilizing datasets, users can leverage Acadia's capabilities to perform complex data processing, analysis, and visualization tasks.