Introduction

Databricks is a company that provides a cloud-based platform to help enterprises build, scale, and govern data and AI, including generative AI and other machine learning models. Their data layer Delta Lake is an optimized storage layer that provides the foundation for tables in a lakehouse on Databricks. Delta Lake can be accessed via SQL statements and exported as CSV files. AllegroGraph seamlessly integrates with Databricks to give you the possibility to create and apply your extract, transform, load (ETL) pipeline to transfer data into your Knowledge Graph.

Setup

Prerequisite

  1. Make sure you have an Enterprise Databricks account (not Community Edition) and assigned host, for example, https://dbc-xxxxxxxx-09a9.cloud.databricks.com.
  2. databricks CLI is available on the same machine where your AllegroGraph instance is running.

Connect Databricks and AllegroGraph

  1. You followed the instructions and have a valid Databricks config (~/.databrickscfg), for example:
     [test-profile]  
     host          = https://dbc-xxxxxxxx-09a9.cloud.databricks.com  
     client_id     = XXX  
     client_secret = YYY 
  2. You allowed access to a volume and a workspace folder for the service principal from (1) and set those values in the AllegroGraph configuration file:
  3.  Databricks export-table-notebook="/Workspace/<folder-with-read-write-access-for-the-principal>/import-sql-table.py", \  
                volume-to-store-sql-tables="/Volumes/workspace/default/<volume-with-read-write-access-for-the-principal>" 

    where:

    • /Workspace/<folder-with-read-write-access-for-the-principal>/import-sql-table.py is a Python notebook created automatically by AllegroGraph on the first SQL table export.
    • /Volumes/workspace/default/<volume-with-read-write-access-for-the-principal> is a volume where exported SQL tables will be written.
  4. You allowed access to at least one SQL warehouse for the service principal from (1).

Usage

Importing data from Databricks is a two-step process:

  1. Databricks data is transformed into CSV file and downloaded to AllegroGraph side.
  2. You define the ETL pipeline to ingest data from the previous step using WebView or agtool.

Both of those steps are available in WebView on the Import CSV data from Databricks page. You can find the link to this page when you open a repository and click on the Add, delete & import data link in the right sidebar. The page provides a selection of several ways in which data from Databricks can be downloaded to AllegroGraph side:

Databricks-download-options

The explanation for each of the select items:

After acquiring a CSV file, users can define transformation rules to convert rows into triples. Please see CSV load in WebView section for more details on that.