What is Databricks
Creating an Account
User Interface
Computes
Notebooks
Importing Data
Creating Tables
Visualizing Data
Founded by the original creators of Spark
Offers a web-based platform for working with Spark
Provides automatic cluster-management
Offers training sections (free and paid)
Access to Spark clusters from your browser
GUI tools to organize and maintain your notebooks and datasets
GUI tools to display and sort your results
Databricks has a free community version
This version allows for: 15Gb of data, 6Gb of RAM, two cores, and no worker nodes
Databricks for business runs on either AWS or Azure
Create your account at: databricks.com (Try Databricks)
You can access Databricks UI to:
→ Create new clusters (Compute)
→ Create new notebooks and run code
→ Set up workflows
→ Import data
→ Import libraries
All processing happens inside computes (formerly known as clusters)
Before running commands, you need to create a new compute
Use its LTS version (12.2 - Spark 3.3.2, Scala 2.12)
It might takes a few minutes for the it to start
Notebooks are simply Jupyter Notebooks
Each cell can execute a sequence of Python commands
Switch languages with: %md
and %sql
To run, a notebook needs to be attached to a compute
You can access notebooks via Workspace
You can import notebooks from your computer
Click on New, then on Notebook
Then, from the untitled notebook, click on import and select your file
You can organize your notebooks in folders
Databricks has a tool to import data from a local machine to its FileSystem - DBFS
Click on New and select Table
Then, drag and drop or browse the data file
Delete a file with:
dbutils.fs.rm("FILE_PATH", True)
After uploaded, the file is stored at the path: /FileStore/tables/FILENAME
Once in the DBFS, the file can be loaded as a DataFrame with:
df = spark.read.csv(FILE_PATH)
After importing, datasets can be saved as a table
Tables can be loaded as a DataFrame with:
df = spark.table("TABLE_NAME")
Tables reside in memory (require an active cluster)
We will not use tables in this course
The method .show()
displays the first 20 rows of the DataFrame
A number can be provided to change the number of rows to be shown
The methods head()
and tail()
exist. But, their output is far from pretty
Databricks introduces a display
method
Allows for quick visualization of DataFrames
Provides a GUI for plotting
Provides a GUI for downloading CSV
You can download external python libraries
These add new capabilities to standard Python
Click on Compute, select the desired compute, click on Libraries, and then Install new
Libraries can be fetched from PyPi, Maven, CRAN, or from a file
What is Databricks (video)
Notebook Basics (video)