Databricks: Check for Orphaned Workspace Directories

In this post I will show you how to check if your Databricks workspace has directories that are orphaned to users/service principals.

  1. import requests
  2. from requests.auth import HTTPBasicAuth
  3. import json
  4.  
  5. DATABRICKS_INSTANCE = dbutils.widgets.get('url')
  6. TOKEN = dbutils.widgets.get('token')
  7.  
  8. #Get list of directories
  9. endpoint = f'{DATABRICKS_INSTANCE}/api/2.0/workspace/list'
  10. params = {
  11. 'path': '/Users/'
  12. }
  13. response = requests.get(endpoint, params=params, auth=HTTPBasicAuth('token', TOKEN))
  14. response.raise_for_status()
  15. json_response = response.json()
  16.  
  17. for i, result in json_response.items():
  18. for value in result:
  19. directory = value["path"]
  20. user = directory.replace("/Users/", "")
  21.  
  22. if '@' in user:
  23. endpoint = f'{DATABRICKS_INSTANCE}/api/2.0/preview/scim/v2/Users'
  24. params = {
  25. 'filter': 'userName eq "%s"' % (user)
  26. }
  27. response = requests.get(endpoint, params=params, auth=HTTPBasicAuth('token', TOKEN))
  28. account = response.json().get('Resources', [])
  29. if not account:
  30. print('Account %s doesn't exist' % (account))
  31. else:
  32. endpoint = f'{DATABRICKS_INSTANCE}/api/2.0/preview/scim/v2/ServicePrincipals'
  33. params = {
  34. 'filter': 'applicationId eq "%s"' % (user)
  35. }
  36. response = requests.get(endpoint, params=params, auth=HTTPBasicAuth('token', TOKEN))
  37. sp = response.json().get('Resources', [])
  38. if not sp:
  39. print('SP %s doesn't exist' % (sp))

SourceTree Installation on Windows

This post is how to install/Configure SourceTree.

Step 1

Install SourceTree.

Step 2

Install Beyond Compare.

Step 3

Install Git

Step 4

Create or Import your SSH key by going to SourceTree and clicking Tools then click “Create or import SSH Keys”. Follow the instructions then click “Save public key” and “Save private key” next to “Save the generated key”. Save public key to C:\Users\<USER>\.ssh\id_rsa.pub. Save private key to C:\Users\<USER>\.ssh\id_rsa.

Step 5

Update to latest SourceTree build by opening SourceTree click Tools –> Options then click “Updates” tab and update if required.

Step 6

Configure any options you want by going to Tools –> Options.

General Tab

  • Choose your theme
  • Setup SSH Client Configuration
  • Setup Default user information
  • Set your remote check for updates
  • Reopen repository tabs at startup
  • Spell check commit messages

Diff Tab

  • Convert tabs to spaces with width 4
  • External Diff tool Beyond Compare

Git Tab

  • Set Global Ignore List
  • Prune tracking branches no longer present on remote(s)
  • Push all tags to remotes
  • Fetch all tags

Authentication Tab

Setup any auth you have.

Step 7

Setup C:\Users\<USER>\.gitconfig like the below.

  1. [core]
  2. autocrlf = true
  3. [user]
  4. name = GaudreaultPC\\Gaudreault
  5. email = oliver@gaudreault.ca
  6. [difftool "sourcetree"]
  7. cmd = 'C:/Program Files/Beyond Compare 5/BCompare.exe' \"$LOCAL\" \"$REMOTE\"
  8. [mergetool "sourcetree"]
  9. cmd = "'' "
  10. trustExitCode = true
  11. [safe]
  12. bareRepository = explicit

Step 8

If you are trying to transfer your SourceTree from one PC to another. The below are the steps you will need to perform.

  1. Copy your data from one PC to the other in the same exact folder structure. Include .git and .gitignore files and folders.
  2. On original PC go to %USERPROFILE%\AppData\Local\Atlassian\SourceTree and copy the following files to the new PC.
    1. bookmarks.xml
    2. opentabs.xml
  3. On the new PC close SourceTree and copy the two files from the previous step and copy to the same location as was in original PC.
  4. Ensure the new laptop has the same .gitconfig from old PC.

Step 9

Sometimes you might have issue with the repository settings for each repo. If you do just go in and fix it. Remove the origin and add back. Ensure you select “Default Remote”.

 

Installing Jenkins Using Docker on Windows

This post is how to install Jenkins using Docker in Windows.

Install Docker

Go to Docker and download “Docker Desktop for Windows – x86_64”. Then run as administrator and install. If you don’t have an account you can create one.

Then open Docker Desktop and go to settings and enable the following “Enable Docker terminal”.

Once you are done you need to start the Docker Engine.

 

 

 

If you get the following error “Failed to initialize: unable to resolve docker endpoint: open C:\Users\<USER>\.docker\machine\machines\default\ca.pem: The system cannot find the path specified.” Then go to environment variables and modify DOCKER_CERT_PATH and set to “C:\Users\<USER>\.docker\machine\certs”.

Then open the docker terminal.

Install Jenkins

Create a network called “jenkins”. This will give you back the id.

  1. docker network create jenkins

You can check that it created successfully by running the following command

  1. docker network ls

Next we run a docker:dind Docker image. If you don’t have the docker image it will download it.

A dind is a Docker In Docker allows developers to run a Docker container within an already running Docker container to support CI/CD pipelines and create sandboxed container environments.

  1. docker run --name jenkins-docker --rm --detach --privileged --network jenkins --network-alias docker --env DOCKER_TLS_CERTDIR=/certs --volume jenkins-docker-certs:/certs/client --volume jenkins-data:/var/jenkins_home --publish 2376:2376 docker:dind

Next we create a Dockerfile

  1. FROM jenkins/jenkins:2.462.2-jdk17
  2. USER root
  3. RUN apt-get update && apt-get install -y lsb-release
  4. RUN curl -fsSLo /usr/share/keyrings/docker-archive-keyring.asc \
  5. https://download.docker.com/linux/debian/gpg
  6. RUN echo "deb [arch=$(dpkg --print-architecture) \
  7. signed-by=/usr/share/keyrings/docker-archive-keyring.asc] \
  8. https://download.docker.com/linux/debian \
  9. $(lsb_release -cs) stable" > /etc/apt/sources.list.d/docker.list
  10. RUN apt-get update && apt-get install -y docker-ce-cli
  11. USER jenkins
  12. RUN jenkins-plugin-cli --plugins "blueocean docker-workflow"

Next we will build a new docker image with the Dockerfile from the previous step. Be sure to name it appropriately and include versions. If you don’t have the right image it will get it for you. Make sure you are in the directory where you saved your Dockerfile.

  1. docker build -t jenkins-blueocean:2.462.2-1 .

Next we run the image as a container in Docker

  1. docker run --name jenkins-blueocean --restart=on-failure --detach --network jenkins --env DOCKER_HOST=tcp://docker:2376 --env DOCKER_CERT_PATH=/certs/client --env DOCKER_TLS_VERIFY=1 --volume jenkins-data:/var/jenkins_home --volume jenkins-docker-certs:/certs/client:ro --publish 8080:8080 --publish 50000:50000 jenkins-blueocean:2.462.2-1

Docker Container Logs using the below command. This will show you the password.

  1. docker logs jenkins-blueocean

Record the password located “/var/jenkins_home/secrets/initialAdminPassword”.

  1. docker exec jenkins-blueocean cat /var/jenkins_home/secrets/initialAdminPassword
  2.  
  3. or
  4.  
  5. docker exec jenkins-blueocean bash
  6. cat /var/jenkins_home/secrets/initialAdminPassword
  7. exit

Then navigate to http://localhost:8080 and enter the password in the “Unlock Jenkins” screen.

 

 

 

 

 

Then select the packages you want to install and click install.

Once the packages install then you need to create the admin user.

Then select your url and click “Save and Finish”.

 

 

 

 

 

Oracle JDBC

This post is how to connect to oracle from JDBC.

There are four main connection examples:

TCP to Single Host
  1. jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<HOST>)(PORT=<PORT>))(CONNECT_DATA=(SERVICE_NAME=<SERVICE_NAME>)))
TCP to Multiple Hosts
  1. jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<HOST1>)(PORT=<PORT1>))(ADDRESS=(PROTOCOL=TCP)(HOST=<HOST2>)(PORT=<PORT2>))(CONNECT_DATA=(SERVICE_NAME=<SERVICE_NAME>)))
SSL to Single Host
  1. jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCPS)(HOST=<HOST>)(PORT=<PORT>))(CONNECT_DATA=(SERVICE_NAME=<SERVICE_NAME>)))
SSL to Multiple Hosts
  1. jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCPS)(HOST=<HOST1>)(PORT=<PORT1>))(ADDRESS=(PROTOCOL=TCPS)(HOST=<HOST2>)(PORT=<PORT2>))(CONNECT_DATA=(SERVICE_NAME=<SERVICE_NAME>)))
Connect Using PySpark With Databricks

You will need to get the secret from a keyvault. Refer to Databricks: Get Secret

  1. secret = <Get_Secret>
  2.  
  3. df = spark.read \
  4. .format("jdbc") \
  5. .option("url", "<URL>") \
  6. .option("dbtable", "<SCHEMA>.<TABLE>") \
  7. .option("user", "<ACCOUNT>") \
  8. .option("password", secret) \
  9. .option("driver", "oracle.jdbc.driver.OracleDriver") \
  10. .laod()
Connect Using SQL With Databricks

It should be noted that this isn’t a really good way and I don’t believe is completely secure.

Add cluster configuration

  1. spark.databricks.secureVariableSubstitute.enabled false
  2. spark.<SOME_VARIABLE_NAME> {{secrets/<SCOPE>/<SECRET>}}

Code

  1. %sql
  2.  
  3. CREATE SCHEMA IF NOT EXISTS <SCHEMA_NAME>;
  4.  
  5. DROP VIEW IF EXISTS <VIEW_NAME>;
  6.  
  7. CREATE TEMPORARY VIEW <SCHEMA_NAME>.<VIEW_NAME>
  8. USING org.apache.spark.sql.jdbc
  9. OPTIONS (
  10. url "<URL>",
  11. dbtable "<SCHEMA_NAME>.<TABLE_NAME>",
  12. user "<ACCOUNT>",
  13. password "${spark.<SOME_VARIABLE_NAME>}",
  14. driver "oracle.jdbc.driver.OracleDriver"
  15. )

 

Azure: Tags

This post is how to work with tags for Azure resources.

Install Graph Extension

  1. az extension add --name resource-graph

Tag List

  1. az tag list --subscription <NAME>

Query for Specific Tag Value

  1. az graph query -q "project name, resourceGroup, type, tags | where tags.<TAGNAME>=~'<VALUE>'"

Query for Multiple Tags

  1. az graph query -q "project name, resourceGroup, type, tags | where tags.<TAGNAME>=~'<VALUE>' | where tags.<TAGNAME>=~'<VALUE>'"

Query for Resource Groups

  1. az graph query -q "ResourceContainers | project name, type, tags | where tags.<TAGNAME>=~'<VALUE>'"

Query For Multiple Resource Types

  1. az graph query -q "project name, resourceGroup, type, tags | where tags.<TAGNAME>=~'<VALUE>' | where type =~ 'microsoft.sql/servers/databases' or type =~ 'microsoft.storage/storageaccounts'"

 

Databricks: Python SDK

This post is how to use the Databricks Python SDK.

Install the Package

  1. pip install databricks-sdk

Update Package

  1. pip install databricks-sdk --upgrade

Check Package Version

  1. pip show databricks-sdk | grep -oP '(?<=Version: )\S+'

Setup WorkspaceClient

  1. from databricks.sdk import WorkspaceClient
  2.  
  3. secret = dbutils.secrets.get(scope = "<SCOPE>", key = "<KEY>")
  4.  
  5. w = WorkspaceClient(
  6. host = 'https://<URL>/'
  7. azure_workspace_resource_id = '<RESOURCE_ID_OF_DATABRICKS>',
  8. azure_tenant_id = '<TENANT_ID>',
  9. azure_client_id = '<CLIENT_ID>',
  10. azure_client_secret = secret
  11. )

Setup AccountClient

You can get the account_id from the databricks account portal. By your id in the top right hand corner.

  1. from databricks.sdk import AccountClient
  2.  
  3. secret = dbutils.secrets.get(scope = "<SCOPE>", key = "<KEY>")
  4.  
  5. a = AccountClient(
  6. host = 'https://accounts.azuredatabricks.net'
  7. account_id = '<ACCOUNT_ID>'
  8. azure_tenant_id = '<TENANT_ID>',
  9. azure_client_id = '<CLIENT_ID>',
  10. azure_client_secret = secret
  11. )

List Workspace Groups

NOTE: You must also setup the workspaceclient to do this.

  1. w.groups.list()

List Account Groups

NOTE: You must also setup the accountclient to do this. You must also be account admin.

  1. a.groups.list()

Create Storage Credential

NOTE: Your SPN must be account admin to do this. You must also setup the workspaceclient to do this.

  1. from databricks.sdk.service.catalog import AzureManagedIdentity
  2.  
  3. storage_credential_name = '<CREDENTIAL_NAME>'
  4. comment = '<COMMENT>'
  5. connector_id = '<DATABRICKS_ACCESS_CONNECTOR>'
  6. az_mi = AzureManagedIdentity(access_connector_id = connector_id)
  7.  
  8. w.storage_credenditals.create(
  9. name = storage_credential_name,
  10. azure_managed_identity = az_mi
  11. comment = comment
  12. )

 

PySpark DataFrame Methods

This post shows different methods of a DataFrame.

Get the first value in a column

  1. df = some_dataframe_definition
  2.  
  3. value = df.select("SOME_COLUMN_NAME").first()[0]

Convert Dataframe to JSON

  1. df = some_dataframe_definition
  2.  
  3. result_json = df.toJSON()

Get a Row

  1. df = some_dataframe_definition
  2.  
  3. row = df.collect()[0] #You can switch out 0 for whatever row you want.

Count rows of Dataframe

  1. df = some_dataframe_definition
  2.  
  3. num_rows = df.count()

 

 

Databricks Unity Catalog SQL Commands

This post is basic commands you will need to know for working with Unity Catalog.

Display Current Metastore
  1. SELECT CURRENT_METASTORE();
Display Current Catalog
  1. SELECT CURRENT_CATALOG();
Create Catalog
  1. CREATE CATALOG IF NOT EXISTS <Catalog_Name> COMMENT 'A COMMENT';
Create Catalog With Location
  1. CREATE CATALOG IF NOT EXISTS <Catalog_Name> MANAGED LOCATION 'abfss://<METASTORE_CONTAINER_NAME>@<STORAGE_ACCOUNT>.dfs.core.windows.net/<CATALOG_NAME>' COMMENT 'A COMMENT';
Describe Catalog
  1. DESCRIBE CATALOG <Catalog_Name>;
Create Schema
  1. CREATE SCHEMA IF NOT EXISTS <SCHEMA_NAME> COMMENT '<COMMENT>';
Create Schema With Location
  1. CREATE SCHEMA IF NOT EXISTS <SCHEMA_NAME> MANAGED LOCATION 'abfss://<METASTORE_CONTAINER_NAME>@<STORAGE_ACCOUNT>.dfs.core.windows.net/<CATALOG_NAME>/<SCHEMA_NAME>' COMMENT '<COMMENT>';
Show All Storage Credentials
  1. SHOW STORAGE CREDENTIALS;
Describe Credential
  1. DESCRIBE STORAGE CREDENTIAL <CREDENTIAL_NAME>;
Create External Location

You will first need a storage credential.

You can reference down to the full table path or keep it at the container

  1. CREATE EXTERNAL LOCATION IF NOT EXISTS <NAME>
  2. URL 'abfss://<COTNAINER>@<STORAGE_ACCOUNT>.dfs.core.windows.net/'
  3. WITH (STORAGE CREDENTIAL <CREDENTIAL_NAME>)
  4. COMMENT '<COMMENT>';
Create External Table
  1. CREATE TABLE <CATALOG_NAME>.<SCHEMA_NAME>.<TABLE_NAME>
  2. USING <FORMAT>
  3. LOCATION 'abfss://<COTNAINER>@<STORAGE_ACCOUNT>.dfs.core.windows.net/FOLDER/PATH;'
Grant Create Storage Credential on Metastore
  1. GRANT CREATE STORAGE CREDENTIAL ON METASTORE TO `<USER>`;
Grant Permission to Create External Locations on Storage Credential
  1. GRANT CREATE EXTERNAL LOCATION ON STORAGE CREDENTIAL <CREDENTIAL_NAME> TO `<USER>`;
Grant Permission to Create External Location On Metastored
  1. GRANT CREATE EXTERNAL LOCATION ON METASTORE TO `<USER>`;
Grant Permission to Use Catalog
  1. GRANT USE_CATALOG ON CATALOG <CATALOG_NAME> TO `<USER>`;
Show all Grants On Metastore
  1. SHOW GRANTS `<USER>` ON METASTORE;
Grant Permission to Use Schema
  1. GRANT USE_SCHEMA ON SCHEMA <CATALOG_NAME>.<SCHEMA_NAME> TO `<USER>`;
Grant Permission to Create Table
  1. GRANT CREATE TABLE ON SCHEMA <CATALOG_NAME>.<SCHEMA_NAME> TO <USER>;

 

Databricks Unity Catalog Rest API’s

This post is how to work with Databricks Unity Catalog Rest API’s.

Set Catalog Isolation Mode to ISOLATED

  1. curl --location --request PATCH 'https://<DATABRICK_URL>/api/2.1/unity-catalog/catalogs/<CATALOG_NAME>' \
  2. --header 'Authorization: Bearer <TOKEN>' \
  3. --header 'Content-Type: application/json' \
  4. --data-raw '{
  5. "isolation_mode": "ISOLATED"
  6. }'

Bind Workspace to Catalog

  1. curl --location --request PATCH 'https://<DATABRICK_URL>/api/2.1/unity-catalog/bindings/catalog/<CATALOG_NAME>' \
  2. --header 'Authorization: Bearer <TOKEN>' \
  3. --header 'Content-Type: application/json' \
  4. --data-raw '{
  5. "add": [{ "workspace_id": "<WORKSPACEE_ID>", "binding_type": "BINDING_TYPE_READ_WRITE" }]
  6. "remove": []
  7. }'

Unbind Workspace to Catalog

  1. curl --location --request PATCH 'https://<DATABRICK_URL>/api/2.1/unity-catalog/bindings/catalog/<CATALOG_NAME>' \
  2. --header 'Authorization: Bearer <TOKEN>' \
  3. --header 'Content-Type: application/json' \
  4. --data-raw '{
  5. "unassign_workspaces": ["<WORKSPACE_ID>"]
  6. }'

List Workspaces Assigned to Catalog

  1. curl --location --request GET 'https://<DATABRICK_URL>/api/2.1/unity-catalog/bindings/catalog/<CATALOG_NAME>' \
  2. --header 'Authorization: Bearer <TOKEN>' \
  3. --header 'Content-Type: application/json'

 

Synapse: Get KeyVault Properties Using Token Library

This post is how to get the key vault properties using the token library.

Ensure you have a spark session created. Refer to PySpark: Create a Spark Session

  1. import sys
  2. from pyspark.sql import SparkSession
  3.  
  4. linked_service_name = '<KEYVAULT_LINKED_SERVICE_NAME>'
  5. spark = <GET_SPARK_SESSION>
  6. token_library = spark._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
  7. key_vault_url = token_library.getFullConnectionStringAsMap(linked_service_name).get('url')
  8.  
  9. print(key_vault_url)
  10. print(token_library.getFullConnectionStringAsMap(linked_service_name))

 

Synapse: SAS Token

This post is how to get the SAS token from a notebook.

Ensure you have a spark session created. Refer to PySpark: Create a Spark Session

  1. from notebookutils import mssparkutils
  2.  
  3. linked_service_storage_account_name = '<LINKED_SERVICE_STORAGE_NAME>'
  4. blob_sas_token = mssparkutils.credentials.getConnectionStringOrCreds(linked_service_storage_account_name)
  5.  
  6. spark.conf.set('fs.azure.sas.<CONTAINER_NAME>.<ADLS_STORAGE_ACCOUNT_NAME>.blob.core.windows.net', blob_sas_token

 

Synapse: Environment Variables

This post is how to work with environment variables in Synapse.

Ensure you have a spark session created. Refer to PySpark: Create a Spark Session

Get Environment Variable

It should be noted that “str” is the type that variable is. You can change it to whatever is required.

  1. var: str = spark.conf.get('spark.executorEnv.<ENV_NAME>')
Set Environment Variable
  1. spark.conf.set('spark.executorEnv.<ENV_NAME>', '<VALUE>')

 

Synapse: List Python Packages

This post is how to list the python packages in various ways.

You can use %pip to list the python packages that are installed.

  1. %pip freeze

However doing it that way may not give you the exact versions that are installed. To get a comprehensive list do the following.

  1. import pkg_resources
  2.  
  3. for package in pkg_resources.working_set:
  4. print(package)

 

Synapse: Help Command

This post is just how to use the help command from mssparkutils.

You can use help at various levels of Synapse.

Root

The following command will tell you what areas help can assist you in. This will respond with

  • fs
  • notebook
  • credentials
  • env
  1. from notebookutils import mssparkutils
  2.  
  3. mssparkutils.help()
FileSystem

If you leave the help command empty it will just return all options that are available for help. If you put a command in then it will explain that command in greater detail.

  1. from notebookutils import mssparkutils
  2.  
  3. mssparkutils.fs.help()
  4.  
  5. mssparkutils.fs.help('cp')

 

Synapse: Mounts

This post is how to work with mounts on Synapse.

I suggest mounting to an ADLS storage account. That is what I will assume in the below examples.

List Mounts
  1. from notebookutils import mssparkutils
  2.  
  3. mssparkutils.fs.mounts()
Get Mount Path

The output of this command will produce ‘/synfs/<number>/mnt/<CONTAINER_NAME>’

  1. from notebookutils import mssparkutils
  2.  
  3. mount_name = "/mnt/<CONTAINER_NAME>"
  4. mount_path = mssparkutils.fs.getMountPath(mount_name)
Unmount
  1. from notebookutils import mssparkutils
  2.  
  3. mount_name = "/mnt/<CONTAINER_NAME>"
  4. mssparkutils.fs.unmount(mount_name)
Mount Using a Linked Service

First you must have a linked service created to the storage account. This linked service must be hard-coded and not parameterized in any way.

  1. from notebookutils import mssparkutils
  2.  
  3. container = '<CONTAINER_NAME>'
  4. storage_account = '<STORAGE_ACCOUNT_NAME>'
  5. sub_folder = '<SUB_FOLDER>' #it should be noted that this isn't required.
  6. linked_service_name = '<LINKED_SERVICE_NAME>'
  7.  
  8. mssparkutils.fs.mount(
  9. source='abfss://%s@%s.dfs.core.windows.net/%s/' % (container, storage_account, sub_folder),
  10. mountPoint='/mnt/%s' % (container),
  11. {'linkedService':linked_service_name, 'fileCacheTimeout': 120, 'timeout': 120}
  12. )
Mount Using Configs

You will need to get the secret. Refer to Synapse: Get Secret

  1. from notebookutils import mssparkutils
  2.  
  3. client_id = '<CLIENT_ID>'
  4. tenant_id = '<TENANT_ID>'
  5. container = '<CONTAINER_NAME>'
  6. storage_account = '<STORAGE_ACCOUNT_NAME>'
  7. sub_folder = '<SUB_FOLDER>' #it should be noted that this isn't required.
  8.  
  9. configs = {
  10. "fs.azure.account.auth.type": "OAuth",
  11. "fs.azure.account.oauth.provider.type": "org.apache.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  12. "fs.azure.account.oauth2.client.id": client_id,
  13. "fs.azure.account.oauth2.client.secret": secret,
  14. "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/" tenant_id + "/oauth2/token"
  15. }
  16.  
  17. mssparkutils.fs.mount(
  18. source='abfss://%s@%s.dfs.core.windows.net/%s' % (container, storage_account, sub_folder),
  19. mountPoint='/mnt/%s' % (container),
  20. extraConfigs=configs
  21. )

 

PySpark: Delta Lake

This post is how to use pyspark to work with Delta Tables.

For more information on Delta Lake you can refer here.

First you need to install the “delta-spark” package for whatever version you require.

  1. pip install delta-spark==3.1.0

Setup a Spark Session.

To read delta tables you can refer to PySpark: Read From ADLS to DataFrame.

To write delta tables you can refer to PySpark: Save a DataFrame To ADLS.

Vacuum Delta Table

  1. from delta.tables import DeltaTable
  2.  
  3. vacuum_hrs = 100
  4. path = 'abfss://<CONTAINER>@<STORAGE_ACCOUNT>.dfs.core.windows.net/<FOLDER>/'
  5.  
  6. delta_table = DeltaTable.forPath(spark, path)
  7. delta_table.vacuum(vacuum_hrs)

Compaction

Impoves reads by merging small files into larger ones.

  1. from delta.tables import DeltaTable
  2.  
  3. path = 'abfss://<CONTAINER>@<STORAGE_ACCOUNT>.dfs.core.windows.net/<FOLDER>/'
  4.  
  5. delta_table = DeltaTable.forPath(spark, path)
  6. delta_table.optimize().executeCompaction()

Z-Order

  1. from delta.tables import DeltaTable
  2.  
  3. path = 'abfss://<CONTAINER>@<STORAGE_ACCOUNT>.dfs.core.windows.net/<FOLDER>/'
  4. columns = ''
  5.  
  6. delta_table = DeltaTable.forPath(spark, path)
  7. delta_table.optimize().executeZOrderBy(columns)

Delete

  1. from delta.tables import DeltaTable
  2. import pyspark.sql.functions as F
  3.  
  4. path = 'abfss://<CONTAINER>@<STORAGE_ACCOUNT>.dfs.core.windows.net/<FOLDER>/'
  5.  
  6. delta_table = DeltaTable.forPath(spark, path)
  7. delta_table.delete(F.col('<MY_COL>') == '<SOME_VAL>')
  8.  
  9. #You can also use sql
  10. delta_table.delete("column == 'some_VALUE'")

Modify Properties

You can refer here for more properties.

dataSkippingNumIndexedCols

You would do this if you have over the max columns that the delta lake can collect statistics on. Default value is 32.

  1. path = 'abfss://<CONTAINER>@<STORAGE_ACCOUNT>.dfs.core.windows.net/<FOLDER>/'
  2. skip_cols = <SOME_VALUE>
  3.  
  4. spark.sql("ALTER TABLE delta.`%s` SET TBLPROPERTIES ('delta.dataSkippingNumIndexedCols' == '%s')" % (path, skip_cols))