Databricks: Check for Orphaned Workspace Directories

In this post I will show you how to check if your Databricks workspace has directories that are orphaned to users/service principals.

  1. import requests
  2. from requests.auth import HTTPBasicAuth
  3. import json
  4.  
  5. DATABRICKS_INSTANCE = dbutils.widgets.get('url')
  6. TOKEN = dbutils.widgets.get('token')
  7.  
  8. #Get list of directories
  9. endpoint = f'{DATABRICKS_INSTANCE}/api/2.0/workspace/list'
  10. params = {
  11. 'path': '/Users/'
  12. }
  13. response = requests.get(endpoint, params=params, auth=HTTPBasicAuth('token', TOKEN))
  14. response.raise_for_status()
  15. json_response = response.json()
  16.  
  17. for i, result in json_response.items():
  18. for value in result:
  19. directory = value["path"]
  20. user = directory.replace("/Users/", "")
  21.  
  22. if '@' in user:
  23. endpoint = f'{DATABRICKS_INSTANCE}/api/2.0/preview/scim/v2/Users'
  24. params = {
  25. 'filter': 'userName eq "%s"' % (user)
  26. }
  27. response = requests.get(endpoint, params=params, auth=HTTPBasicAuth('token', TOKEN))
  28. account = response.json().get('Resources', [])
  29. if not account:
  30. print('Account %s doesn't exist' % (account))
  31. else:
  32. endpoint = f'{DATABRICKS_INSTANCE}/api/2.0/preview/scim/v2/ServicePrincipals'
  33. params = {
  34. 'filter': 'applicationId eq "%s"' % (user)
  35. }
  36. response = requests.get(endpoint, params=params, auth=HTTPBasicAuth('token', TOKEN))
  37. sp = response.json().get('Resources', [])
  38. if not sp:
  39. print('SP %s doesn't exist' % (sp))

Databricks: Python SDK

This post is how to use the Databricks Python SDK.

Install the Package

  1. pip install databricks-sdk

Update Package

  1. pip install databricks-sdk --upgrade

Check Package Version

  1. pip show databricks-sdk | grep -oP '(?<=Version: )\S+'

Setup WorkspaceClient

  1. from databricks.sdk import WorkspaceClient
  2.  
  3. secret = dbutils.secrets.get(scope = "<SCOPE>", key = "<KEY>")
  4.  
  5. w = WorkspaceClient(
  6. host = 'https://<URL>/'
  7. azure_workspace_resource_id = '<RESOURCE_ID_OF_DATABRICKS>',
  8. azure_tenant_id = '<TENANT_ID>',
  9. azure_client_id = '<CLIENT_ID>',
  10. azure_client_secret = secret
  11. )

Setup AccountClient

You can get the account_id from the databricks account portal. By your id in the top right hand corner.

  1. from databricks.sdk import AccountClient
  2.  
  3. secret = dbutils.secrets.get(scope = "<SCOPE>", key = "<KEY>")
  4.  
  5. a = AccountClient(
  6. host = 'https://accounts.azuredatabricks.net'
  7. account_id = '<ACCOUNT_ID>'
  8. azure_tenant_id = '<TENANT_ID>',
  9. azure_client_id = '<CLIENT_ID>',
  10. azure_client_secret = secret
  11. )

List Workspace Groups

NOTE: You must also setup the workspaceclient to do this.

  1. w.groups.list()

List Account Groups

NOTE: You must also setup the accountclient to do this. You must also be account admin.

  1. a.groups.list()

Create Storage Credential

NOTE: Your SPN must be account admin to do this. You must also setup the workspaceclient to do this.

  1. from databricks.sdk.service.catalog import AzureManagedIdentity
  2.  
  3. storage_credential_name = '<CREDENTIAL_NAME>'
  4. comment = '<COMMENT>'
  5. connector_id = '<DATABRICKS_ACCESS_CONNECTOR>'
  6. az_mi = AzureManagedIdentity(access_connector_id = connector_id)
  7.  
  8. w.storage_credenditals.create(
  9. name = storage_credential_name,
  10. azure_managed_identity = az_mi
  11. comment = comment
  12. )

 

Databricks Unity Catalog SQL Commands

This post is basic commands you will need to know for working with Unity Catalog.

Display Current Metastore
  1. SELECT CURRENT_METASTORE();
Display Current Catalog
  1. SELECT CURRENT_CATALOG();
Create Catalog
  1. CREATE CATALOG IF NOT EXISTS <Catalog_Name> COMMENT 'A COMMENT';
Create Catalog With Location
  1. CREATE CATALOG IF NOT EXISTS <Catalog_Name> MANAGED LOCATION 'abfss://<METASTORE_CONTAINER_NAME>@<STORAGE_ACCOUNT>.dfs.core.windows.net/<CATALOG_NAME>' COMMENT 'A COMMENT';
Describe Catalog
  1. DESCRIBE CATALOG <Catalog_Name>;
Create Schema
  1. CREATE SCHEMA IF NOT EXISTS <SCHEMA_NAME> COMMENT '<COMMENT>';
Create Schema With Location
  1. CREATE SCHEMA IF NOT EXISTS <SCHEMA_NAME> MANAGED LOCATION 'abfss://<METASTORE_CONTAINER_NAME>@<STORAGE_ACCOUNT>.dfs.core.windows.net/<CATALOG_NAME>/<SCHEMA_NAME>' COMMENT '<COMMENT>';
Show All Storage Credentials
  1. SHOW STORAGE CREDENTIALS;
Describe Credential
  1. DESCRIBE STORAGE CREDENTIAL <CREDENTIAL_NAME>;
Create External Location

You will first need a storage credential.

You can reference down to the full table path or keep it at the container

  1. CREATE EXTERNAL LOCATION IF NOT EXISTS <NAME>
  2. URL 'abfss://<COTNAINER>@<STORAGE_ACCOUNT>.dfs.core.windows.net/'
  3. WITH (STORAGE CREDENTIAL <CREDENTIAL_NAME>)
  4. COMMENT '<COMMENT>';
Create External Table
  1. CREATE TABLE <CATALOG_NAME>.<SCHEMA_NAME>.<TABLE_NAME>
  2. USING <FORMAT>
  3. LOCATION 'abfss://<COTNAINER>@<STORAGE_ACCOUNT>.dfs.core.windows.net/FOLDER/PATH;'
Grant Create Storage Credential on Metastore
  1. GRANT CREATE STORAGE CREDENTIAL ON METASTORE TO `<USER>`;
Grant Permission to Create External Locations on Storage Credential
  1. GRANT CREATE EXTERNAL LOCATION ON STORAGE CREDENTIAL <CREDENTIAL_NAME> TO `<USER>`;
Grant Permission to Create External Location On Metastored
  1. GRANT CREATE EXTERNAL LOCATION ON METASTORE TO `<USER>`;
Grant Permission to Use Catalog
  1. GRANT USE_CATALOG ON CATALOG <CATALOG_NAME> TO `<USER>`;
Show all Grants On Metastore
  1. SHOW GRANTS `<USER>` ON METASTORE;
Grant Permission to Use Schema
  1. GRANT USE_SCHEMA ON SCHEMA <CATALOG_NAME>.<SCHEMA_NAME> TO `<USER>`;
Grant Permission to Create Table
  1. GRANT CREATE TABLE ON SCHEMA <CATALOG_NAME>.<SCHEMA_NAME> TO <USER>;

 

Databricks Unity Catalog Rest API’s

This post is how to work with Databricks Unity Catalog Rest API’s.

Set Catalog Isolation Mode to ISOLATED

  1. curl --location --request PATCH 'https://<DATABRICK_URL>/api/2.1/unity-catalog/catalogs/<CATALOG_NAME>' \
  2. --header 'Authorization: Bearer <TOKEN>' \
  3. --header 'Content-Type: application/json' \
  4. --data-raw '{
  5. "isolation_mode": "ISOLATED"
  6. }'

Bind Workspace to Catalog

  1. curl --location --request PATCH 'https://<DATABRICK_URL>/api/2.1/unity-catalog/bindings/catalog/<CATALOG_NAME>' \
  2. --header 'Authorization: Bearer <TOKEN>' \
  3. --header 'Content-Type: application/json' \
  4. --data-raw '{
  5. "add": [{ "workspace_id": "<WORKSPACEE_ID>", "binding_type": "BINDING_TYPE_READ_WRITE" }]
  6. "remove": []
  7. }'

Unbind Workspace to Catalog

  1. curl --location --request PATCH 'https://<DATABRICK_URL>/api/2.1/unity-catalog/bindings/catalog/<CATALOG_NAME>' \
  2. --header 'Authorization: Bearer <TOKEN>' \
  3. --header 'Content-Type: application/json' \
  4. --data-raw '{
  5. "unassign_workspaces": ["<WORKSPACE_ID>"]
  6. }'

List Workspaces Assigned to Catalog

  1. curl --location --request GET 'https://<DATABRICK_URL>/api/2.1/unity-catalog/bindings/catalog/<CATALOG_NAME>' \
  2. --header 'Authorization: Bearer <TOKEN>' \
  3. --header 'Content-Type: application/json'

 

Databricks: Get Secret

This post is how to get a secret from a key vault in Databricks.

First you need to setup dbutils.

Next you have to make sure your Databricks installation has a Key Vault integrated Scope setup.

Then you need to make sure that Databricks is allowed to communicate with your KeyVault.

Then you can query your Key Vault to get the secret with the following command.

  1. secret = dbutils.secrets.get(scope='<SCOPE>', key='<SECRET_KEY>')

 

Databricks: Notebook SQL

This post is how to work with Databricks SQL through a Notebook.

Create a Temp View of a DataFrame.

  1. df = <SOMETHING>
  2. df.createOrReplaceTempView("<TABLE_NAME>")

Drop a Table

  1. %sql
  2. drop table <SCHEMA>.<TABLE>;

Describe Table

  1. %sql
  2. desc table extended <SCHEMA>.<TABLE>;

Describe Detail

  1. %sql
  2. describe detail <SCHEMA>.<TABLE>;

Show Table Properties

  1. %sql
  2. SHOW TBLPROPERTIES <SCHEMA>.<TABLE>;

Describe History

  1. %sql
  2. describe history <SCHEMA>.<TABLE>;

Create Schema

  1. %sql
  2. CREATE SCHEMA IF NOT EXISTS <SCHEMA>;

Create Parquet Table

  1. %sql
  2. CREATE TABLE <SCHEMA>.<TABLE> USING PARQUET LOCATION 'abfss://<COTNAINER>@<STORAGE_ACCOUNT>.dfs.core.windows.net/<FOLDER>/'

Create Delta Table

  1. %sql
  2. CREATE TABLE <SCHEMA>.<TABLE> USING DELTA LOCATION 'abfss://<COTNAINER>@<STORAGE_ACCOUNT>.dfs.core.windows.net/<FOLDER>/'

Upsert

  1. MERGE INTO schema.table t \
  2. USING ( \
  3. SELECT columns \
  4. FROM table \
  5. ) AS source ON (source.column = t.column) \
  6. WHEN NOT MATCHED THEN \
  7. INSERT ( \
  8. ( \
  9. column, column2 \
  10. ) \
  11. VALUES ( \
  12. source.column, source.column2 \
  13. ) \
  14. WHEN MATCHED THEN \
  15. UPDATE SET \
  16. t.column = source.column \

 

Databricks: Mounts

This post is how to mount on Databricks.

Notes

  • Security Issue: They are shared across all clusters and users
  • Should always be unmounted after use
    • Due to Service Prinicpal password rotations
    • Reliability esspecially in BCDR
  • Databricks recommends using Unity Catalog instead of mounts as they are legacy.
  • Could be conflicts in other projects due to naming
  • Do not create mounts manually. Always have your project mount and unmount at the end

List Mounts

  1. dbutils.fs.mounts()

Unmount

  1. dbutils.fs.unmount("<MOUNT>")

Mount

  1. client_id = "<CLIENTID>"
  2. secret = dbutils.secrets.get(scope = "<SCOPE_NAME>", key = "<SECRET_NAME>")
  3. tenant_id = "<TENANT_ID>"
  4. storage_account_name = "<STORAGE_ACCOUNT_NAME>"
  5. container_name = "<CONTAINER_NAME>"
  6.  
  7. configs = {
  8. "fs.azure.account.auth.type": "OAuth",
  9. "fs.azure.account.oauth.provider.type": "org.apache.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  10. "fs.azure.account.oauth2.client.id": client_id,
  11. "fs.azure.account.oauth2.client.secret": secret,
  12. "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/" tenant_id + "/oauth2/token"
  13. }
  14.  
  15. path = "abfss://%s@%s.dfs.core.windows.net/" % (container_name, storage_account_name)
  16.  
  17. dbutils.fs.mount(
  18. source = path,
  19. mount_point = "/mnt/<MOUNT_NAME>",
  20. extra_configs = configs
  21. )

 

 

 

 

 

Databricks: Notebook Commands

This post is all about notebook commands.

List a directory on DBFS using Shell

  1. %sh
  2. ls /dbfs

List a Directory on DBFS using FS

  1. %fs
  2. ls "<DIRECTORY>"

List Python Packages

  1. %pip list

Install a Python Requirements.txt

  1. %pip install --index <URL> -r requirements.txt

Install a Single Python Package

  1. %pip install --index <URL> <PACKAGE>==<VERSION>

 

Databricks: Bearer Token CLI

This post is how to get the bearer token using the CLI and setting the env variable.

First install Azure CLI.

Databricks Resource ID = 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d

Get Access Token

  1. az account get-access-token --resource 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d

Set Access Token

Linux

  1. export DATABRICKS_AAD_TOKEN="<TOKEN>"

Windows

  1. set DATABRICKS_AAD_TOKEN="<TOKEN>"

Set Config File

Linux

  1. export DATABRICKS_CONFIG_FILE="<LOCATION>"

Windows

  1. set DATABRICKS_CONFIG_FILE="<LOCATION>"

 

Databricks: Rest API

This post is how to communicate with Databricks using Rest API’s.

Databricks Resource ID = 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d

Get Bearer Token for Service Principal

  1. curl -X GET https://login.microsoft.com/<TENANTID>/oauth2/token -H 'Content-Type: application/x-www-form-urlencoded' -d'grant_type=client_credential&client_id=<CLIENTID>&resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d&client_secret=<SECRET>

Get Bearer Token for Service Principal Using management.core.windows.net

  1. curl -X GET https://login.microsoftonline.com/<TENANTID>/oauth2/token -H 'Content-Type: application/x-www-form-urlencoded' -d'grant_type=client_credential&client_id=<CLIENTID>&resource=https://management.core.windows.net/&amp;client_secret=<SECRET>'

Start Cluster

  1. curl --location -g --trace -X --request POST -H 'Authorization: Bearer <TOKEN>' https://<DATABRICKS_url>/api/2.0/clusters/start -d '{ "cluster_id": "<CLUSTER_ID>"}'

Stop Cluster

  1. curl --location -g --trace -X --request POST -H 'Authorization: Bearer <TOKEN>' https://<DATABRICKS_url>/api/2.0/clusters/stop -d '{ "cluster_id": "<CLUSTER_ID>"}'

List Clusters

  1. curl --location -g --trace -X --request GET -H 'Authorization: Bearer <TOKEN>' https://<DATABRICKS_url>/api/2.0/clusters/list

Job List

  1. curl --location -g --trace -X --request GET -H 'Authorization: Bearer <TOKEN>' https://<DATABRICKS_url>/api/2.0/jobs/list

Job Python Run

  1. curl --location -g --trace -X --request POST -H 'Authorization: Bearer <TOKEN>' https://<DATABRICKS_url>/api/2.0/jobs/run-now -d '{"job_id": <JOB_ID>, "python_params": [] }'

Job Get

  1. curl --location -g --trace -X --request GET -H 'Authorization: Bearer <TOKEN>' https://<DATABRICKS_url>/api/2.0/jobs/runs/get?run_id=<JOB_RUN_ID>

Create Job

Databricks Create Job

  1. curl --location -g --trace -X --request POST -H 'Authorization: Bearer <TOKEN>' https://<DATABRICKS_url>/api/2.0/jobs/create -d '<PAYLOAD>'

Create Job Payload

  1. {
  2. "name": "<NAME>",
  3. "max_concurrent_runs": 1,
  4. "tasks": [
  5. {
  6. "task_key": "<TASK_KEY>",
  7. "run_if": "ALL_SUCCESS",
  8. "max_retries": 1,
  9. "timeout_seconds": <TIMEOUT_SECONDS>,
  10. "notebook_tasks": {
  11. "notebook_path": "<PATH>",
  12. "source": "WORKSPACE",
  13. "base_parameters": {
  14. "<KEY>": "<VALUE>",
  15. "<KEY2>": "<VALUE2>",
  16. }
  17. },
  18. "libraries": [
  19. {
  20. "pypi": {
  21. "package": "<PACKAGE_NAME==VERSION>",
  22. "coordinates": ""
  23. }
  24. },
  25. {
  26. "jar": "<LOCATION>"
  27. }
  28. ],
  29. "new_cluster": {
  30. "custom_tags": {
  31. "<TAG_NAME>": "<TAG_VALUE>"
  32. },
  33. "azure_attributes": {
  34. "first_on_demand": 1,
  35. "availability": "SPOT_AZURE",
  36. "spot_bid_max_price": 75
  37. },
  38. "instance_pool_id": "<WORKER_INSTANCE_POOL_ID>",
  39. "driver_instances_pool_id": "<DRIVER_INSTANCE_POOL_ID>",
  40. "data_security_mode": "SINGLE_USER",
  41. "spark_version": "<SPARK_VERSION>",
  42. "node_type_id": "<NODE_TYPE_ID>",
  43. "runtime_engine": "STANDARD",
  44. "policy_id": "<POLICY_ID>",
  45. "autoscale": {
  46. "min_workers": <MIN_WORKERS>,
  47. "max_workers": <MAX_WORKERS>
  48. },
  49. "spark_conf": {
  50. "<CONFIG_KEY>": "<CONFIG_VALUE>"
  51. },
  52. "cluster_log_conf": {
  53. "dbfs": {
  54. "destination": "<LOG_DESTINATION>"
  55. }
  56. },
  57. "spark_env_vars": {
  58. "<ENV_NAME>": "<ENV_VALUE>"
  59. },
  60. "init_scripts": [
  61. {
  62. "volumes": {
  63. "destination": "<INIT_SCRIPT_LOCATION>"
  64. }
  65. }
  66. ]
  67. }
  68. }
  69. ],
  70. "format": "SINGLE_TASK"
  71. }

Job Permission Patch

  1. curl --location -g --trace -X --request PATCH -H 'Authorization: Bearer <TOKEN>' https://<DATABRICKS_url>/api/2.0/permissions/jobs/<JOB_ID> -d '{ "access_control_list": [{ "group_name": "<GROUP_NAME>", "permission_level": "<PERMISSION>"}]}'

Get Service Principal List

  1. curl -X GET -H 'Authorization: Bearer <TOKEN>' https://<DATABRICKS_url>/api/2.0/preview/scim/v2/ServicePrincipals

Delete Service Principal List From Databricks ONLY

  1. curl --location -g --trace -X --request DELETE -H 'Authorization: Bearer <TOKEN>' https://<DATABRICKS_url>/api/2.0/preview/scim/v2/ServicePrincipals/<APPLICATION_ID>

Add Service Principal To Databricks

  1. curl --location --request POST 'https://<DATABRICKS_url>/api/2.0/preview/scim/v2/ServicePrincipals' --header 'Authorization: Bearer <TOKEN>' --header 'Content-Type: application/json' --data-raw '{ "schemas": ["urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal"], "applicationId": "<CLIENTID>", "displayName": "<DISPLAYNAME>", "groups": [{"value": "<GROUP_ID>"}], "entitlements": [{ "value": "allow-cluster-create"}] }'

List Secret Scopes

  1. curl --location -g --trace -X --request GET -H 'Authorization: Bearer <TOKEN>' https://<DATABRICKS_url>/api/2.0/secrets/scopes/list

Create KeyVault Secret Scope

  1. curl --location -g --trace -X --request POST -H 'Authorization: Bearer <TOKEN>' https://<DATABRICKS_url>/api/2.0/secrets/scopes/create -d '{"scope": "<Keyvault_name>", "scope_backend_type": "AZURE_KEYVAULT", "backend_azure_keyvault": {"resource_id": "<RESOURCE_ID>", "dns_name": "<KEYVAULT_URL>"}, "initial_manage_principal": "users"}'

IP Access Lists

  1. curl -X GET -H 'Authorization: Bearer <TOKEN>' https://<DATABRICKS_url>/api/2.0/ip-access-lists

List Git Repos

  1. curl --location -g --trace -X --request GET -H 'Authorization: Bearer <TOKEN>' https://<DATABRICKS_url>/api/2.0/repos

Update Git Repo

  1. curl --location -g --trace -X --request POST -H 'Authorization: Bearer <TOKEN>' https://<DATABRICKS_url>/api/2.0/repos/<REPO_ID> -d '{ "branch": "<BRANCH_NAME>" }'

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Databricks: Set Spark Configs

This post is how to set the spark configs on Databricks or Synapse Notebooks.

First you will need a spark session. Refer to PySpark: Create a Spark Session for more details.

  1. secret = 'value' #I highly suggest you get the password from the keyvault
  2. storage_account = ''
  3. application_id = ''
  4. tenant_id = ''
  5.  
  6. spark.config.set('fs.azure.account.auth.type.{}.dfs.core.windows.net'.format(storage_account), 'OAuth')
  7.  
  8. spark.config.set('fs.azure.account.oauth.provider.type.{}.dfs.core.windows.net'.format(storage_account), 'org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider')
  9.  
  10. spark.config.set('fs.azure.account.oauth2.client.id.{}.dfs.core.windows.net'.format(storage_account), application_id)
  11.  
  12. spark.config.set('fs.azure.account.oauth2.client.secret.{}.dfs.core.windows.net'.format(storage_account), secret)
  13.  
  14. spark.config.set('fs.azure.account.oauth2.client.endpoint.{}.dfs.core.windows.net'.format(storage_account), 'https://login.microsoftonline.com/{}/oauth2/token'.format(tenant_id))
  15.  

If you are running in Databricks you could add them to cluster start. Although I recommand doing it in a notebook instead.

  1. spark.hadoop.fs.azure.account.auth.type.<STORAGE_ACCOUNT>.dfs.core.windows.net OAuth
  2. fs.azure.account.oauth.provider.type.<STORAGE_ACCOUNT>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
  3. fs.azure.account.oauth2.client.id.<STORAGE_ACCOUNT>.dfs.core.windows.net <CLIENT_ID>
  4. fs.azure.account.oauth2.client.secret.<STORAGE_ACCOUNT>.dfs.core.windows.net secret
  5. fs.azure.account.oauth2.client.endpoint.<STORAGE_ACCOUNT>.dfs.core.windows.net https://login.microsoftonline.com/<TENANT_ID>/oauth2/token