BDA420

BDA420 - High Performance Computing

This course delves into advanced tools and techniques for handling very large datasets. Specifically, students will acquire the skills to effectively navigate the Spark framework and create Jupyter notebooks capable of harnessing the full power of parallelization techniques introduced by Spark. This course will consist of a sequence of lectures immediately followed by workshops. For example, a lecture on a specific subject will be presented, and then a workshop will be provided for students to practice the concepts introduced in the lecture. Additionally, an assignment and two tests will be used for assessment purposes.

Academic Honesty

Make sure to learn and abide to Seneca's academic honesty policies. Not knowing a particular policy will never be accepted as a valid excuse.
Remember that, implicitly within every test, exam, and assignment you submit, you are stating that it contains your own work. The three most often scenarios of academic dishonesty are: using material obtained from the internet, using AI to generate your answers, and using material obtained from another student.

Evaluation

Item	Qtd	Marks
Workshops	8	5% (40%)
Midterm	1	15%
Final Exam	1	15%
Final Assignment	1	30%

Important Dates

Midterm	Feb 19th
Final Exam	March 26th
Final Project	Apr 2nd, 9th, and 16th

Calendar

Week 01

Jan 8th

Welcome and Jupyter Notebooks

In this class students have an introduction to the course and its evaluation methods. Also, we will review the use of Jupyter Notebooks to perform basic data analysis.

Slides

Workshop 1 (5.0%)

Week 02

Jan 15th

Introduction to Apache Spark

In this class students have an introduction to Apache Spark's analytics engine for large-scale data processing and present its libraries. Also, we will discuss the concept of parallelism in computations.

Slides

Workshop 2 (5.0%)

Week 03

Jan 22nd

Working with Databricks: clusters, data, and notebooks

In this class students are introduced to the Databrics platform for data engineering using Apache Spark. We will cover how students can create accounts in this platform, as well as how to perform basic actions such as creating clusters, writing notebooks, uploading datasets, etc.

Slides

Workshop 3 (5.0%)

Week 04

Jan 29th

DataFrames and Schemas

In this class we will cover how to work with dataframes, Apache Spark's most widely used data abstraction. We will also introduce some basic methods that can be applied to dataframes. Finally, we will discuss how to provide a schema to a dataframe.

Slides

Workshop 4 (5.0%)

Week 05

Feb 5th

Columns and GroupBy

In this class we will cover column operations, including withColumn. We will also discuss the groupBy method to categorize rows of data.

Slides

Workshop 5 (5.0%)

Week 06

Feb 12th

Rows and Filtering

In this class we will cover basic operations to sanitize our dataset. Also, we will present filtering operations. Finally, we will discuss the when method applied in conjunction with withColumn

Slides

Workshop 6 (5.0%)

Week 07

Feb 19th

Midterm

Midterm (15%)

Week 08

Feb 24th - 28th

Study Week

Week 09

Mar 5th

Datetime Functions and Window Operations

In this class we will cover datetime functions that can be used to work with fields containing dates and times. Also, we will discuss Window operations that can be used to smooth data for presentation.

Slides

Workshop 7 (5.0%)

Week 10

Mar 12th

UDFs, Joins, and RDDs

In this class we will cover User Defined Functions (UDFs). Also, we will discuss join operations. Finally, we will present RDDs.

Slides

Workshop 8 (5.0%)

Week 11

Mar 19th

JSON datasets, Real-Time Data Analysis, and Review

In this class we will cover JSON datasets and processing of real-time data (streaming) using Spark's Streaming Library.

Slides

Final Exam Review

Week 12

Mar 26th

Final Exam

Final Exam (15%)

Week 13

Apr 2nd

Final Project Presentations

During this week, students will present their projects.

Final Project - 30.0%

Week 14

Apr 9th

Final Project Presentations

During this week, students will present their projects.

Final Project 30.0%