To securely access Azure Data Lake Storage Gen 2 files from Azure Databricks, a good solution is to set up a Service Principal with the appropriate access policies granted to your Data Lake, so that it serves as an intermediary. That way, you don’t have to expose your storage account keys to all Databricks Workspace users.

First of all, you need to register an application with Azure Active Directory (AAD). Later, you will use it from within Azure Databricks, with OAuth 2.0, to authenticate against ADLS Gen 2 and create a connection to a specific file or directory within Data Lake, with the Databricks File System (DBFS). These connections are called mount points. They have the option of accessing these files or directories using a standard file system abstraction that can be easily understood by multiple workspace users.

Couldn’t you understand a single word of what I said? Come along and I’ll explain everything to you!

Configuring a Service Principal in Azure AD

First, you need to register an application in Azure Active Directory (AAD), following the steps below.

  1. Log in to your account through the Azure portal and select Azure Active Directory.
  1. Select App registrations. 
  1. Select New registration. 
  1. Give your application a name and, after defining the other fields as in the image below, select Register.

After registering your application, in the Overview page you can see the Application ID and Directory ID. Take note of them because they will be used later in this configuration.

Creating a secret to access your Service Principal

The Azure AD application you created is your Main Service. It works as a user inside your Active Directory resource, being able to log in to other Azure resources and access their data as needed.

But to refer to it from other tools, you must create a secret for the application, which is basically your Service Principal’s password. This password can be stored in an Azure Key Vault in your subscription, so you don’t have to reference it directly from your code.

  1. Within your application, select Certificates & secrets.
  1. Below Client secrets, select New client secret. 
  1. Provide a description for the created secret and its duration. When done, select Add.

  1. After creating the secret, its value will be displayed. Copy this value because it will not be possible to recover it later. You will use the secret in conjunction with the noted Application ID to register with the Service Principal.

  1. Now, we need to store the secret value in Azure Key Vault (AKV), to be later referenced securely from Azure Databricks. So, within your AKV Resource, select Secrets, then Generate/Import and create a secret named ‘ADLS-DATABRICKS-KEY’ with the value of the secret copied to the ‘Value’ field.

Creating a Databricks secret scope

In order to access the secrets stored within your Azure Key Vault, a secret scope must be configured in the Databricks workspace.

  1. Make sure you have Contributor permission on the Azure Key Vault feature you want to use to configure the secret scope.
  1. Go to https://<your-instance-on-databricks>#secrets/createScope.  

  1. Enter the name of your secret scope.
  1. In the Manage Principal menu, allow all users to manage this secret scope. Ideally, each user of your Databricks workspace should have the most appropriate permission for the secret scope as it will be used. You can, for example, limit access for some users to only read or list the secrets stored in your Azure Key Vault. However, this fine control of permissions can only be accomplished through the Databricks Premium Plan.  
  1. Fill in the DNS Name and Resource ID for your Azure Key Vault. This information is available in the Properties section of your AKV on the Azure portal. The DNS Name matches the ‘Vault URI’ field.

  1. Click on Create. 

Now, whenever you need to access a Key Vault secret from within the Databricks Workspace (for example, an API user password), this secret scope can be used to reference the Azure Key Vault, where the secrets are stored, and connect to it.

Granting Permissions to the Service Principal in ADLS Gen2

Your Service Principal must have a role assignment configured for it in the storage account to be able to access the files within it. This role depends on how you want to use the files you want to access in the Data Lake:

  • Storage Blob Data Owner: use to define ownership and manage POSIX access controls for Azure Data Lake Storage Gen2.
  • Storage Blob Data Contributor: use to grant read / write / delete permissions to resources stored in the Blob.
  • Storage Blob Data Reader: use to grant read-only permissions to resources stored in the Blob.

Note:
Only the roles explicitly defined for data access allow the Service Principal to access the content of files stored in your storage account. Thus, superior roles such as Owner, Contributor and Storage Account Contributor allow Service Principal to only manage the storage account, but do not provide access to the content of the data stored there.

To configure this, follow the steps below:

  1. Within your storage account, select Access Control (IAM) to display access control settings for it. Select Add and then Add role assignment option to add a new role.

  1. In the Add role assignment window, select the role appropriate for your situation and then look for the name of the Service Principal to assign that role to him. After saving the created role, it will be listed in the Role Assignments list.

Creating an Azure Data Lake Storage Gen2 Mount Point using a service principal and OAuth 2.0

After defining the access control rules, you can mount an Azure Data Lake Storage Gen2 on the Databricks File System (DBFS), using the Service Principal and the OAuth 2.0 protocol. Mount points act as a pointer to the Azure Data Lake storage account. Therefore, data is never synchronized locally and can be accessed from any notebook within the same Databricks workspace.

In this process, we will use the following values:

  • application-id: Application (client) ID that you get when you register your application with Azure AD.
  • directory-id: Directory (tenant) ID that you get when you register your application with Azure AD.
  • storage-account-name: The name of your Azure Data Lake Gen2 storage account.
  • scope-name: The name of your Secret Scope connecting Azure Databricks to Azure Key Vault.
  • service-credential-key-name: The name of the secret in Azure Key Vault that you created to store your Service Principal’s password.

Within your Databricks workspace, create a notebook called ‘mount_ADLS’. Subsequently, execute the following Python commands, filling in the mentioned values ​​with the correct information:

configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"} 

# Optionally, you can add <your-directory> to the URL of your mount point. 

dbutils.fs.mount(
  source = "abfss://<name-of-your-container>@<name-of-your-storage-account>.dfs.core.windows.net/",
  mount_point = "/mnt/<name-of-your-mount>",
  extra_configs = configs) 

<name-of-your-container> refers to the Data Lake file system (or container) from which you want to create the mount point in DBFS. Likewise, <name-of-your-mount> is the directory in DBFS that will represent where this Data Lake container will be mounted. So, to simplify this configuration, you can use the same name as your ADLS container within Databricks.

After your ADLS container has been mounted for DBFS, you can refer to your mount point directly to access the files inside it:

df = spark.read.csv("/mnt/%s/...." % <name-of-your-mount>) 
df = spark.read.csv("dbfs:/mnt/<name-of-your-mount>/....") 

If you need to unmount your mount point, use the following command:

dbutils.fs.unmount("/mnt/<name-of-your-mount>") 

Just to illustrate our configuration, I tried to access the ‘world-best-strikers.csv’ file, located inside the ‘test’ container.

Using the Python commands described earlier, I was able to create the mount point for the ‘test’ container:

After that, just use the mount point to read the csv file directly:

Finally, given the information inside my file, we can confirm that the configuration was done correctly, right?

Note:
If you eventually encounter error 403 when trying to access a file using its mount point, review the steps outlined. This can happen if you try to access the file right after granting Service Principal the role in ADLS Gen 2. So try again after waiting a few minutes for Azure to configure the appropriate permissions.

Hope you could understand how to access files within your Azure Data Lake Storage Gen 2 account from Azure Databricks using a secure and simple solution. There are more sophisticated solutions, which involve creating Access Control Lists (ACLs) within Data Lake using Azure Storage Explorer. However, we can dive into that in another post. Until later!