BDA420


DataFrames and Schemas

Summary

Data for this Lecture

Dataframes

Dataframe Methods

OrderBy

Dataframe Attributes

Schemas

CreateDataFrame

Data for this Lecture

Data for this Lecture

We will use the grades dataset in this lecture

You can download it from Blackboard

DataFrames

DataFrames Overview

DataFrames are the most widely used data abstraction in Spark

They are immutable distributed collections of data

Equivalent to a table in a relational database

Similar to Pandas DataFrames

Organized into named columns with defined types

Read CSV

The spark.read.csv method reads a CSV file and returns a DataFrame

In PySpark, DataFrames are Python objects:

  They have properties and methods

Read CSV

path = "/FileStore/tables/grades.csv"
df = spark.read.csv(path, header=True, inferSchema=True)
display(df)

Use header=True if the file starts with a header

Use inferSchema=True to let Spark guess the type of data in each column

DataFrame Methods

Methods Overview

Multiple methods are available for DataFrames

To call a method, use the . operator, followed by the method's name and its parameters:

df.methodName(parametersList)

Methods Overview

Methods never change the original dataframe

Rather, they generally return a new dataframe:

newDf = dfName.methodName(parameters)

or (to change* the original)

df = df.methodName(parameters)

*eventually, the original will be garbage collected

Chaining Methods

Chaining methods is also possible

df.method1(params1).method2(params2)

Is more Pythonic than

df = df.method1(params1)

df.method2(params2)

Chaining is normally conducted from left to right

Simple Methods

Describe

Like Pandas, Spark has a describe() method

It returns a DataFrame with: count, mean, stdde, min, and max, for each column

Specific column names can be used as parameters

Describe

display(df.describe())
display(df.describe("grades"))

Summary

A summary() method is also available. It allows for user specified stats to be obtained

By default it adds percentiles: 25%, 50%, and 75%

You can specify other stats as well

df.summary().show()
df.summary("max", "min", "90%").show()

orderBy

orderBy() can be used to sort DataFrames

It returns a new dataframe sorted according to a specified colum

More column names can be provided for extra criteria

For descending order, specify ascending as False

orderBy

df.orderBy("grades").show()
df.orderBy("grades", "last_name").show()
df.orderBy("grades", "last_name", ascending=False).show()
df.sort("grades").show()

A method called sort() is almost identical, and with the same parameters grant the same results

DataFrame Attributes

Attributes Overview

Attributes access properties of dataframes

Like methods, they use the . operator

Unlike methods, attributes do not take parameters

Commonly used attributes:

  columns

  dtypes

  schema

attributes

print(type(df.columns))
print(df.columns)
    
print(type(df.dtypes))
print(df.dtypes)
    
print(type(df.schema))
print(df.schema)

Schemas

StructType and StructField

A schema can be created using StructType and StructField

More specifically, a schema is generally a StrucType containing a list of StructField elements

These types are defined in pyspark.sql.types

StructType and StructField

from pyspark.sql.types import *

schema = StructType([StructField("first name", StringType(),True),\
                      StructField("last name", StringType(),True),\
                      StructField("gender", StringType(),True),\
                      StructField("grades", IntegerType(),True)])

newDF = spark.read.csv(path, header=True, schema=schema)

display(newDf)
newDf.printSchema()

StructField Elements

A schema provides a blueprint of the data (metadata)

Each StructField should contain:

  the names of the columns (string)

  the types of values in each column (DataType)

  a flag to allow columns to contain nulls or not*

StructField Elements

from pyspark.sql.types import *

schema = StructType([StructField("first name", StringType(),True),\
                      StructField("last name", StringType(),True),\
                      StructField("gender", StringType(),True),\
                      StructField("grades", IntegerType(),True)])

newDF = spark.read.csv(path, header=True, schema=schema)

display(newDf)
newDf.printSchema()

Inferred Schemas

Schemas are frequently inferred from the data with: inferSchema=True

With header=True, the first row is used to populate the FieldName

DataType and Nullability are inferred by analyzing the whole data

Could lead to suboptimal results

createDataFrame

createDataFrame

A Dataframe can be created using a bi-dimensional list and a schema

It uses the spark.createDataFrame() function

An RDD or a Pandas.DataFrame can also be used as a data source

Remember: avoid using Pandas on large datasets!!!

createDataFrame

from pyspark.sql.types import * 
	
students =[("john", "smith", 43222, "fall_2024","m",3.0),\
          ("kim", "pak", 43878, "winter_2025", "f", 3.4),\
          ("boon", "pak", 41359, "fall_2024", "m", 3.4),\
          ("juan", "lopez", 99271, "fall_2024", "m", 2.5),\
          ("ryan", "rey", 76419, "winter_2024", "m", 3.9),\
          ("leah", "lukez", 31315, "fall_2025", "f", 2.7)]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("cohort", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("gpa", DoubleType(), True) \
  ])
  
bda420 = spark.createDataFrame(data=students, schema=schema)	
display(bda420)

Reading Material

DataFrame Official Documentation

Spark by Example

Spark Docs - csv