Identity Graph and Identity Resolution in SQL | RudderStack

neub9
By neub9
4 Min Read



The Importance of Identity Resolution and Building an Identity Graph with SQL

In the age of data science, implementing sophisticated identity resolution strategies is crucial for organizations. By leveraging big data from offline and online sources and using advanced matching algorithms, businesses can create single views of their customers. This leads to improved customer experiences and streamlined data management. This article provides detailed information on achieving identity resolution and creating an identity graph with SQL. We will also tackle the technical challenges behind identity resolution and propose a new algorithm to solve those challenges. Let’s start with the basics.

Identity Resolution and Identity Graphs

What is identity resolution?
Identity resolution is a data management technique used to link multiple identities and attributes related to the same individual or entity across different data sources or touchpoints.

What is an identity graph?
An Identity Graph is a comprehensive data structure that connects an individual’s identifiers (such as email addresses, device IDs, and social handles) across multiple platforms and channels. It uses identity resolution to create a unified, single view of a customer, enhancing personalized marketing strategies, improving customer experiences, and aiding in data privacy and security.

The Impact of Identity Mapping

Successfully implementing ID mapping can lead to deep customer understanding, optimized marketing efforts, and a more personalized and seamless customer experience.

ID Mapping and Customer Data Platforms (CDPs)

Customer Data Platforms (CDPs) often use ID mapping to integrate and organize customer data from multiple sources. By resolving entity identities accurately, CDPs can provide a single source of truth for all customer data, leading to more insightful customer profiles and enabling better personalization strategies.

The Problem of Identity Resolution

In this article, we detail a proposed algorithm for identity resolution using SQL to solve the problem of identifying connected components in an identity graph. The algorithm involves propagating unique identifiers across edges to find connected components and is designed to work effectively across different sources of data.

Example: Code Implementation in SQL

The following SQL code implements the proposed identity resolution algorithm to find connected components and assigns unique IDs to nodes in the graph, leading to a unified view of customer identities. This algorithm works effectively across different sources of data.


CREATE TABLE ID_GRAPH_0 ( orig_anon_id varchar(32), 
                          orig_user_id varchar(32), 
                          curr_anon_id varchar(32), 
                          curr_user_id varchar(32), 
                          version_anon_id int, 
                          version_user_id int);

CREATE TABLE ID_GRAPH_1 AS(
    SELECT 
        orig_anon_id,
        orig_user_id, 
        CASE 
            WHEN curr_anon_id is NULL THEN NULL 
            WHEN tmp_anon_id < curr_anon_id THEN tmp_anon_id 
            ELSE curr_anon_id 
        END AS curr_anon_id, 
        ...
        ...
        ... 
    )

As noted above, you should run this as many times as the largest diameter of the connected component to stabilize the IDs across the connected components.

Incremental Updates

To create a real-time customer 360, the system will continuously update the identity graph as new users sign up or form new associations. Implementing the proposed identity resolution algorithm at regular intervals will keep the identity map up-to-date. Processing new nodes, edges, and connected components touched by those new edges is crucial for maintaining an accurate and comprehensive view of customer identities.


Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *