Embeddings
Creating Custom Datum Embedding Models

Creating Custom Datum Embedding Models

Overview

In Acadia, datum embedding models are crucial for transforming raw data into a numerical format that can be easily analyzed and processed by machine learning algorithms. Custom datum embedding models allow you to tailor the embedding process to the specific characteristics and requirements of your data.

Understanding Datum Embedding Models

A Datum Embedding Model is a component that converts data from its original form (e.g., text, images) into a dense vector of floats. This vector represents the data in a lower-dimensional space while attempting to preserve relevant information.

DatumEmbeddingModel Abstract Base Class (ABC)

The DatumEmbeddingModel ABC in Acadia requires you to implement the following method:

  • get_datum_ids_to_embeddings(self, dataset: Dataset) -> IdToEmbeddingDictType: This method should return a dictionary mapping datum IDs to their corresponding embeddings. This mapping ensures that each datum's embedding is correctly associated with its ID for further processing or storage.

IdToEmbeddingDictType Explained:

  • EmbeddingType: A list of floats representing the datum's embedding.
  • IdToEmbeddingDictType: A dictionary where the key is the datum ID and the value is the EmbeddingType.

This type is crucial for ensuring that the embeddings generated by your model are correctly mapped to their respective datums.

Implementing a Custom Model

To implement a custom datum embedding model, you need to extend the DatumEmbeddingModel ABC and provide implementations for the required methods. Here is an example:

from acadia.models.datum_embedding_models.base import DatumEmbeddingModel
from acadia.types import IdToEmbeddingDictType, EmbeddingType
from acadia.database.schemas import Dataset, Datum
 
class MyCustomEmbeddingModel(DatumEmbeddingModel):
    def get_datum_ids_to_embeddings(self, dataset: Dataset) -> IdToEmbeddingDictType:
        # Initialization of the embedding dictionary
        embeddings = {}
 
        # Iterate through each datum in the dataset
        for datum in dataset.datums:
            # Process the datum to generate its embedding
            embedding = self.process_datum(datum)
            embeddings[datum.id] = embedding
 
        return embeddings
 
    def process_datum(self, datum: Datum) -> EmbeddingType:
        # Custom logic to generate embeddings
        # This could involve neural networks, statistical models, etc.
        return [0.1, 0.2, 0.3]  # Example fixed embedding