Creating Custom Data Sources in Acadia
Custom data sources in Acadia allow for flexibility in handling various data formats and integrating specific data loading logic tailored to your needs. This guide provides detailed instructions on creating your own data sources by extending the DataSource
abstract base class provided by Acadia.
Understanding the DataSource Base Class
The DataSource
abstract base class defines the essential interface that all data sources in Acadia must implement. This class is found in acadia.data_sources.base
and primarily requires the implementation of the generate_data_batches
method to facilitate efficient data processing.
Importing the Base Class
Before you can define a custom data source, you need to import the DataSource
class from the Acadia framework:
from acadia.data_sources.base import DataSource
This import statement brings the abstract base class into your module, enabling you to define a subclass that implements the required methods.
Implementing the generate_data_batches
Method
The generate_data_batches
method is pivotal for loading data. It must be implemented to yield data in manageable batches, which is especially useful for processing large datasets efficiently.
Parameters:
- batch_size (int): Specifies the number of rows per batch. This helps manage memory usage and processing efficiency.
Yields:
- Generator[DataSourceBatchType, None, None]: A generator that yields batches of data, where each batch is a
List[Dict]
with each dictionary representing a data row.
Type Definitions:
- DataSourceBatchType: Defined as
List[DataSourceRowType]
.DataSourceRowType
is a dictionary where keys are column names and values are data values, ensuring each batch is consistent and easy to process.
Creating a Custom Data Source Example
Below is an example implementation of a custom data source that reads data from CSV files, utilizing the base class from Acadia:
import csv
from acadia.data_sources.base import DataSource
from acadia.types import DataSourceBatchType
class CSVDataSource(DataSource):
"""
Custom CSV data source for loading data from CSV files.
"""
def __init__(self, file_path, sample_size=None):
"""
Initialize with the path to the CSV file and an optional sample size.
Args:
file_path (str): Path to the CSV file.
sample_size (int, optional): Maximum number of rows to read.
"""
self.file_path = file_path
self.sample_size = sample_size
def generate_data_batches(self, batch_size=1000):
"""
Yield data batches from the CSV file.
Args:
batch_size (int): Rows per batch.
Yields:
DataSourceBatchType: Each batch as a list of dictionaries.
"""
with open(self.file_path, mode="r", encoding="utf-8") as csvfile:
reader = csv.DictReader(csvfile)
batch = []
for i, row in enumerate(reader):
if self.sample_size and i >= self.sample_size:
break
batch.append(row)
if len(batch) == batch_size:
yield batch
batch = []
if batch:
yield batch
Conclusion
Creating custom data sources is integral for adapting Acadia to specific data ingestion requirements, such as custom data formats or preprocessing steps. By extending the DataSource
abstract base class and properly implementing the generate_data_batches
method, developers can effectively integrate their custom data sources into Acadia's data processing pipeline.