Embeddings
Creating Custom Tag Embedding Models

Creating Custom Tag Embedding Models

Introduction

In Acadia, creating custom tag embedding models allows users to transform tag data into numerical formats that are more suitable for analysis, machine learning, and other data processing workflows. Tag embedding models should be designed to capture the semantic meaning of tags and convert them into a vector space.

Overview of Tag Embedding Models

A tag embedding model in Acadia should conform to an abstract base class that defines the necessary methods for generating embeddings. This structure ensures that custom models integrate seamlessly with the broader Acadia framework, particularly during data processing and analysis phases.

Implementing the TagEmbeddingModel

To create a custom tag embedding model, developers must extend the TagEmbeddingModel abstract base class (ABC). This class requires the implementation of the following method:

get_tag_ids_to_embeddings

This method should return a dictionary mapping tag IDs to their corresponding embeddings. This dictionary facilitates efficient storage and retrieval of tag embeddings, linking them directly to their respective tags in the database.

from abc import ABC, abstractmethod
from acadia.database.schemas import Dataset
from acadia.types import IdToEmbeddingDictType
 
class TagEmbeddingModel(ABC):
 
    @abstractmethod
    def get_tag_ids_to_embeddings(self, dataset: Dataset) -> IdToEmbeddingDictType:
        """
        Retrieves embeddings for each tag in a dataset.
        Returns a dictionary where each key is a tag ID and the value is the corresponding embedding.
        """
        pass

Example Implementation

Below is an example of a mocked implementation of a TagEmbeddingModel, which includes a simple method to generate random embeddings for demonstration purposes.

import random
from acadia.models.dimension_reduction_models.base import DimensionReductionModel
from acadia.types import IdToEmbeddingDictType, EmbeddingType
 
class MockTagEmbeddingModel(TagEmbeddingModel):
 
    def __init__(self, task_context: str, tag_content_to_embed: List[str], dim_reduction_model: DimensionReductionModel, max_dims: int = 10):
        self.task_context = task_context
        self.tag_content_to_embed = tag_content_to_embed
        self.dim_reduction_model = dim_reduction_model
        self.max_dims = max_dims
 
    def get_tag_ids_to_embeddings(self, dataset: Dataset) -> IdToEmbeddingDictType:
        tags = dataset.tags
        tag_id_to_raw_embedding_dict = {}
        for tag in tags:
            raw_embedding = self.embed_tag(tag)
            tag_id_to_raw_embedding_dict[tag.id] = raw_embedding
 
        reduced_embedding_dict = self.dim_reduction_model.reduce_dimensions(tag_id_to_raw_embedding_dict)
        return reduced_embedding_dict
 
    def embed_tag(self, tag: Tag) -> EmbeddingType:
        embedding = [random.random() for _ in range(self.max_dims)]
        return embedding[:self.max_dims]

Explanation of IdToEmbeddingDictType

IdToEmbeddingDictType is a dictionary type used to map each tag's unique identifier to its corresponding embedding vector. This data structure is crucial because it ensures that each embedding can be accurately associated with the correct tag, preserving the relationship necessary for downstream tasks such as querying, visualization, and machine learning.

Using this type of mapping allows the embedding process to be more organized and efficient, particularly when dealing with large datasets where each tag's identity and relevance must be maintained across different operations.