Spark on Databricks
Hello! In my last post, I've written a brief introduction to Apache Spark. In this post, we'll explore Databricks. Formed by the creators of Spark, Databricks is one unified platform for data and AI. It is by far the easiest and requires very minimal setup to begin with. It offers a free community version which is a great fit for learners.
Topics covered in this post are:
- Setting up Databricks Community Edition Account
- Brief Walkthrough of the tool
- Creating a Notebook
- Setting up the Cluster
- Adding a data file as a table
- Reading the table in the Notebook
1. Setting up Databricks Community Edition Account
To set up the Community Edition, we need
1.1. A Community Edition. We can register an account at Databricks Community Edition signup page. Provide all the necessary details in the signup page as seen below:
Note: "Work Email" is mandatory and it'll be the primary point of email communications
1.2. Select "Get Started" under Community Edition
1.3. ...and it's time to check your work email and confirm your account.
Note: We'll be redirected to Reset Password page and the password chosen is the account password.
1.4. Later, Go to Community Edition login page and login to the website.
1.5. Now you should in Databricks home page
2.Brief Walkthrough of the tool
The Databricks homepage is arranged in the following sections in the left pane under the logo:
- Home/Workspace - Gives the folder structure where the files are arranged
- Recents - Recent files
- Data - Data from different sources like File uploads, AWS S3, DBFS and other sources.
- Clusters - Compute units that are provisioned to run the notebooks
- Jobs - It is a way of running a notebook and currently allowed to paid customers only
- Search - File search
All of the functionality that Databricks are neatly arranged in the left pane.
- New Notebook - Jupter Notebook style workspace which provides interactive development environment
- Create Table - Create tables directly from imported data.
- New Cluster - A Databricks cluster is a set of computation resources and configurations on which you run different workloads
- New Job (For Enterprise costumers only)
- New MLflow Experiment - End-to-end machine learning lifecycle for ML experiments
- Import Library - Import a third-part library
- Read Documentation
3. Creating a Notebook
If you're familiar with the Jupyter Notebook, Databricks provides similar form and functionality to the Notebooks. Databricks Notebooks offers support to Python, Scala, SQL and R languages to develop verity of Spark applications. The following steps will show how to create a notebook:
3.1. Click on "New Notebook" under Common Tasks section. Enter the name of the notebook.
3.2. Select the language of your choice from the Default Language dropdown and click on "Create". Example, Python. 3.3. We can attach an already existing cluster in the Cluster dropdown. If no cluster is available, Cluster creation is explained in the next section. 3.4. We can edit the notebook at this point but it is necessary to attach a cluster to execute and obtain the results.
4. Setting up the Cluster
Databricks cluster can be created in 3 methods.
4.1. Click on the "Detached" dropdown under Notebook's name and "Create a Cluster" option will lead us to "Create Cluster" page 4.2. Click on the "Cluster" tab on the left pane. Click on "Create Cluster" button in the Cluster page 4.3. When running the notebook or a cell, an alert is displayed as below. Click "Launch and Run" the cluster is created and attached to the notebook. A cluster named "My Cluster" is attached and displayed under Notebook's title
Let's discuss the first and second steps further below. Third method is pretty straight forward.
Note: Cluster creation will take around a minute or two to complete. **
Method 1 and Method 2
In both the methods 1 and 2, we'll be navigated to the "Create Cluster" page as shown below.
- No. of Driver and Worker nodes. For Community Edition, only single cluster and one driver node above configuration is permitted
- Provide the cluster name in the filed. Ex: "My Cluster"
- Databricks Runtime Version includes the latest stable version of Spark runtime. This also includes beta versions as well and it'll be handy to test new versions
- Since the backend compute instance are of AWS, we can select the Availability Zone between us-east-2a, us-east-2b and us-east-2c
- After the name is provided, we can select the "Create Cluster"
- Similar configuration can be provided in JSON as mentioned below(not available in Community Edition)
Cluster page after the cluster up and running
A Databricks Unit (“DBU”) is a unit of processing capability per hour, billed on per-second usage. Databricks supports many AWS EC2 instance types. The larger the instance is, the more DBUs you will be consuming on an hourly basis. For example, 1 DBU is the equivalent of Databricks running on a c4.2xlarge machine for an hour.
4.Adding a Data File
In the Data tab on the left pane, we can quickly check the data present in the workspace for a particular cluster. An example dataset in our case is the San Francisco Fire Incidents dataset. The following are the steps to add a CSV file to the workspace:
4.1. Click on Data tab on the left pane and Click on "Add Data" on the navigation pane.
4.2. After Clicking on the Add Data you'll see the following page
4.3. Once the file is uploaded, the file path is mentioned in the UI
4.4. After uploading the file successfully, we can create the table from the table and use it to read in our notebooks as a table. To create the table, we need to select "Create Table with UI" and select the cluster. Later, select the "Preview Table" option to display the table.
4.5. To save the table in the "default" database, select "Create Table". The properties can be modified are
- Table Name - Name of the table to be created
- Database - Database to store the table
- File Type - CSV, Avro or JSON
- Column Delimiter - Character to separate the fields in the file
- First row is header - Use first row as header
- Infer schema - Infer the DataType of each column in the table and assign datatype. Ex: INT, STRING
4.6. After saving the table the sample schema and data are displayed as below:
5.Reading the Data file in the Notebook
After opening the "TestPySpark" notebook from the Workspace(Workspace→users→), attach the cluster if not attached. The following controls are present besides the Cluster dropdown(left to right order).
- Notebook actions - New, Clone, Delete and etc.
- Cell actions - Undo, Cut cells, Copy cells and etc.
- View Mode
- Run all cells
- Clear dropdown
We are now all set to run the code in the notebooks and every cell output will be displayed below it. So, let's start with simple "Spark" command.
Since the Spark is running in interactive mode, we will get the the SparkSession readily available and hence the above result is displayed for the command. Now, Let's read the data from the table we created earlier, ie., fire_incidents_2_csv.
# Reading data from the table 'fire_incidents_2_csv' fire_df = spark.table('fire_incidents_2_csv') display(fire_df.select('*'))
The output for the above code is as follows:
This concludes the exercise of setting up the Databricks Community Edition account, Knowing different components of the Databricks UI and functionality, Loading the data and creating a table from it and finally reading the data from the table and displaying the read data.
The Account setup process is one time setup. The steps later are very minimal to start and build the Spark applications in an interactive way. This is great for learners and for conducting fun experiments too! There are a number of features missing in the Community Edition and are available for Enterprise customers.