Datasets
Creating Custom Data Sources

Creating Custom Data Sources in Acadia

Custom data sources in Acadia allow for flexibility in handling various data formats and integrating specific data loading logic tailored to your needs. This guide provides detailed instructions on creating your own data sources by extending the DataSource abstract base class provided by Acadia.

Understanding the DataSource Base Class

The DataSource abstract base class defines the essential interface that all data sources in Acadia must implement. This class is found in acadia.data_sources.base and primarily requires the implementation of the generate_data_batches method to facilitate efficient data processing.

Importing the Base Class

Before you can define a custom data source, you need to import the DataSource class from the Acadia framework:

from acadia.data_sources.base import DataSource

This import statement brings the abstract base class into your module, enabling you to define a subclass that implements the required methods.

Implementing the generate_data_batches Method

The generate_data_batches method is pivotal for loading data. It must be implemented to yield data in manageable batches, which is especially useful for processing large datasets efficiently.

Parameters:

  • batch_size (int): Specifies the number of rows per batch. This helps manage memory usage and processing efficiency.

Yields:

  • Generator[DataSourceBatchType, None, None]: A generator that yields batches of data, where each batch is a List[Dict] with each dictionary representing a data row.

Type Definitions:

  • DataSourceBatchType: Defined as List[DataSourceRowType]. DataSourceRowType is a dictionary where keys are column names and values are data values, ensuring each batch is consistent and easy to process.

Creating a Custom Data Source Example

Below is an example implementation of a custom data source that reads data from CSV files, utilizing the base class from Acadia:

import csv
from acadia.data_sources.base import DataSource
from acadia.types import DataSourceBatchType
 
class CSVDataSource(DataSource):
    """
    Custom CSV data source for loading data from CSV files.
    """
 
    def __init__(self, file_path, sample_size=None):
        """
        Initialize with the path to the CSV file and an optional sample size.
 
        Args:
            file_path (str): Path to the CSV file.
            sample_size (int, optional): Maximum number of rows to read.
        """
        self.file_path = file_path
        self.sample_size = sample_size
 
    def generate_data_batches(self, batch_size=1000):
        """
        Yield data batches from the CSV file.
 
        Args:
            batch_size (int): Rows per batch.
 
        Yields:
            DataSourceBatchType: Each batch as a list of dictionaries.
        """
        with open(self.file_path, mode="r", encoding="utf-8") as csvfile:
            reader = csv.DictReader(csvfile)
            batch = []
            for i, row in enumerate(reader):
                if self.sample_size and i >= self.sample_size:
                    break
                batch.append(row)
                if len(batch) == batch_size:
                    yield batch
                    batch = []
            if batch:
                yield batch

Conclusion

Creating custom data sources is integral for adapting Acadia to specific data ingestion requirements, such as custom data formats or preprocessing steps. By extending the DataSource abstract base class and properly implementing the generate_data_batches method, developers can effectively integrate their custom data sources into Acadia's data processing pipeline.