BDA420 - High Performance Computing

This course delves into advanced tools and techniques for handling very large datasets. Specifically, students will acquire the skills to effectively navigate the Spark framework and create Jupyter notebooks capable of harnessing the full power of parallelization techniques introduced by Spark. This course will consist of a sequence of lectures immediately followed by workshops. For example, a lecture on a specific subject will be presented, and then a workshop will be provided for students to practice the concepts introduced in the lecture. Additionally, an assignment and two tests will be used for assessment purposes.



Academic Honesty

Make sure to learn and abide to Seneca's academic honesty policies. Not knowing a particular policy will never be accepted as a valid excuse.
Remember that, implicitly within every test, exam, and assignment you submit, you are stating that it contains your own work. The three most often scenarios of academic dishonesty are: using material obtained from the internet, using AI to generate your answers, and using material obtained from another student.

Evaluation

Item Qtd Marks
Workshops 8 5% (40%)
Midterm 1 15%
Final Exam 1 15%
Final Assignment 1 30%


Important Dates

Midterm Feb 19th
Final Exam March 26th
Final Project Apr 2nd, 9th, and 16th

Calendar

Week 01
Jan 8th

Welcome and Jupyter Notebooks

In this class students have an introduction to the course and its evaluation methods. Also, we will review the use of Jupyter Notebooks to perform basic data analysis.

Slides
Workshop 1 (5.0%)
Week 02
Jan 15th

Introduction to Apache Spark

In this class students have an introduction to Apache Spark's analytics engine for large-scale data processing and present its libraries. Also, we will discuss the concept of parallelism in computations.

Slides
Workshop 2 (5.0%)
Week 03
Jan 22nd

Working with Databricks: clusters, data, and notebooks

In this class students are introduced to the Databrics platform for data engineering using Apache Spark. We will cover how students can create accounts in this platform, as well as how to perform basic actions such as creating clusters, writing notebooks, uploading datasets, etc.

Slides
Workshop 3 (5.0%)
Week 04
Jan 29th

DataFrames and Schemas

In this class we will cover how to work with dataframes, Apache Spark's most widely used data abstraction. We will also introduce some basic methods that can be applied to dataframes. Finally, we will discuss how to provide a schema to a dataframe.

Slides
Workshop 4 (5.0%)
Week 05
Feb 5th

Columns and GroupBy

In this class we will cover column operations, including withColumn. We will also discuss the groupBy method to categorize rows of data.

Slides
Workshop 5 (5.0%)
Week 06
Feb 12th

Rows and Filtering

In this class we will cover basic operations to sanitize our dataset. Also, we will present filtering operations. Finally, we will discuss the when method applied in conjunction with withColumn

Slides
Workshop 6 (5.0%)
Week 07
Feb 19th

Midterm


Midterm (15%)
Week 08
Feb 24th - 28th

Study Week

Week 09
Mar 5th

Datetime Functions and Window Operations

In this class we will cover datetime functions that can be used to work with fields containing dates and times. Also, we will discuss Window operations that can be used to smooth data for presentation.

Slides
Workshop 7 (5.0%)
Week 10
Mar 12th

UDFs, Joins, RDDs, and other data formats

In this class we will cover User Defined Functions (UDFs). Also, we will discuss join operations. Finally, we will present RDDs.

Slides
Workshop 8 (5.0%)
Week 11
Mar 19th

Real-Time Data Analysis and Machine Learning with Spark

In this class we will cover JSON datasets and processing of real-time data (streaming) using Spark's Streaming Library.


Final Exam Review
Week 12
Mar 26th

Final Exam


Final Exam (15%)
Week 13
Apr 2nd

Final Project Presentations

During this week, students will present their projects.


Final Project - 30.0%
Week 14
Apr 9th

Final Project Presentations

During this week, students will present their projects.


Final Project 30.0%