This course delves into advanced tools and techniques for handling very large datasets. Specifically, students will acquire the skills to effectively navigate the Spark framework and create Jupyter notebooks capable of harnessing the full power of parallelization techniques introduced by Spark. This course will consist of a sequence of lectures immediately followed by workshops. For example, a lecture on a specific subject will be presented, and then a workshop will be provided for students to practice the concepts introduced in the lecture. Additionally, an assignment and two tests will be used for assessment purposes.
Item | Qtd | Marks |
---|---|---|
Workshops | 8 | 5% (40%) |
Midterm | 1 | 15% |
Final Exam | 1 | 15% |
Final Assignment | 1 | 30% |
Midterm | Feb 19th |
---|---|
Final Exam | March 26th |
Final Project | Apr 2nd, 9th, and 16th |
In this class students have an introduction to the course and its evaluation methods. Also, we will review the use of Jupyter Notebooks to perform basic data analysis.
SlidesIn this class students have an introduction to Apache Spark's analytics engine for large-scale data processing and present its libraries. Also, we will discuss the concept of parallelism in computations.
SlidesIn this class students are introduced to the Databrics platform for data engineering using Apache Spark. We will cover how students can create accounts in this platform, as well as how to perform basic actions such as creating clusters, writing notebooks, uploading datasets, etc.
SlidesIn this class we will cover how to work with dataframes, Apache Spark's most widely used data abstraction. We will also introduce some basic methods that can be applied to dataframes. Finally, we will discuss how to provide a schema to a dataframe.
SlidesIn this class we will cover column operations, including withColumn. We will also discuss the groupBy method to categorize rows of data.
SlidesIn this class we will cover basic operations to sanitize our dataset. Also, we will present filtering operations. Finally, we will discuss the when method applied in conjunction with withColumn
SlidesIn this class we will cover datetime functions that can be used to work with fields containing dates and times. Also, we will discuss Window operations that can be used to smooth data for presentation.
SlidesIn this class we will cover User Defined Functions (UDFs). Also, we will discuss join operations. Finally, we will present RDDs.
SlidesIn this class we will cover JSON datasets and processing of real-time data (streaming) using Spark's Streaming Library.
During this week, students will present their projects.
During this week, students will present their projects.