Creating and Retrieving Datasets in Acadia
In Acadia, datasets are central to managing and processing data. This guide covers how to create and retrieve datasets, allowing users to interact with their data effectively.
Creating a Dataset
To create a dataset, you must first configure it using the DatasetConfig
class, which specifies various settings and parameters.
Configuration
A dataset configuration includes several parameters:
- name: A unique identifier for the dataset.
- description: A text description of what the dataset contains.
- columns: A list of column names expected in the data.
- id_column: Specifies which column should be used as a unique identifier for each row.
- text_preview_column: Specifies which column to use for text previews.
- image_preview_column: Specifies which column to use for image previews.
- source: The data source from where the dataset will be loaded.
- topic_tree: Optionally, a
TopicTree
that organizes topics associated with the dataset.
Here is an example of setting up a dataset configuration:
from acadia.data_sources import CSVDataSource
from acadia.models import DatasetConfig
data_source = CSVDataSource("path/to/data.csv", sample_size=1000)
config = DatasetConfig(
name="human_eval",
description="Test description",
columns=["id", "caption_0", "caption_1"],
id_column="id",
text_preview_column="caption_0",
source=data_source
)
Creating the Dataset
Once the configuration is set, use the create_dataset
function to instantiate the dataset:
dataset = acadia.dataset.create_dataset(config)
This function initializes the dataset, loads data based on the specified source, and prepares it for further processing and analysis.
Retrieving Datasets
Listing Datasets
To see what datasets are available, you can list them using:
datasets = acadia.dataset.list_datasets()
print(datasets)
This function retrieves all datasets that have been created and registered within the system.
Getting a Specific Dataset
If you know the name of the dataset you want to interact with, you can retrieve it directly:
dataset = acadia.dataset.get_dataset("human_eval")
This function fetches the dataset with the specified name, allowing for direct access to its data and associated operations.
Conclusion
Creating and retrieving datasets in Acadia is straightforward but vital for managing your data effectively. By properly configuring and utilizing datasets, users can leverage Acadia's capabilities to perform complex data processing, analysis, and visualization tasks.