Linking the unlinkables; simple, automated, scalable data linking with Databricks ARC

neub9
By neub9
3 Min Read

“`



Databricks ARC Enhancement

Enhanced Data Linking with Databricks ARC

In April 2023, we announced the release of Databricks ARC to enable simple, automated data linking within a single table. Today, we are excited to announce an enhancement that allows ARC to find links between 2 different tables, using the same open, scalable, and simple framework.

Data linking is a common challenge across Government. Splink, developed by the UK Ministry of Justice and which acts as the linking engine within ARC, exists to provide a powerful, open, and explainable entity resolution package.

Linking data between two different tables can be a complex task. Traditionally, it relied on hard-coded rules created by expert developers, leading to complex and brittle systems. However, probabilistic data linking offers a more flexible solution, using statistical similarities between records as the basis for decision-making. Databricks ARC simplifies this approach by applying standards and heuristics to remove the need for manually defined rules, allowing for a more efficient and scalable data linking system.

Benefits of ARC

Automated, low-effort linking with ARC creates a variety of opportunities:

  • Reduce the time to value and cost of migrations and integrations.
  • Enable interdepartmental and inter-government collaboration.
  • Link data with models tailored to the data’s characteristics.

The addition of automated data linking to ARC is an important contribution to the realm of entity resolution and data integration. By connecting datasets without a common key, the public sector can harness the true power of their data, drive internal innovation and modernization, and better serve their citizens.

Explore ARC and get started today by trying the example notebooks from the ARC GitHub repository. ARC is a fully open source project, available on PyPi to be pip installed, requiring no prior data linking or entity resolution experience to get started.

Accuracy Metrics

The perennial challenge of data linking in the real world is accuracy. There are three common ways of measuring accuracy: Precision, Recall, and F1-score. However, these metrics are only applicable when one has access to a set of labels showing the true links.

In order to evaluate ARC’s performance, we used FEBRL to create a synthetic data set and tested our hypothesis by optimizing solely for our metric over a 100 runs for each data set. The positive correlation observed between our metric and the empirical F1 score suggests that maximizing our metric in the absence of labeled data is a good proxy for correct data linking.



“`

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *