Data for this Lecture
Dataframes
Dataframe Methods
OrderBy
Dataframe Attributes
Schemas
CreateDataFrame
We will use the grades dataset in this lecture
You can download it from Blackboard
DataFrames are the most widely used data abstraction in Spark
They are immutable distributed collections of data
Equivalent to a table in a relational database
Similar to Pandas DataFrames
Organized into named columns with defined types
The spark.read.csv
method reads a CSV file and returns a DataFrame
In PySpark, DataFrames are Python objects:
→ They have properties and methods
path = "/FileStore/tables/grades.csv"
df = spark.read.csv(path, header=True, inferSchema=True)
display(df)
Use header=True
if the file starts with a header
Use inferSchema=True
to let Spark guess the type of data in each column
Multiple methods are available for DataFrames
To call a method, use the .
operator, followed by the method's name and its parameters:
df.methodName(parametersList)
Methods never change the original dataframe
Rather, they generally return a new dataframe:
newDf = dfName.methodName(parameters)
or (to change* the original)
df = df.methodName(parameters)
*eventually, the original will be garbage collected
Chaining methods is also possible
df.method1(params1).method2(params2)
Is more Pythonic than
df = df.method1(params1)
df.method2(params2)
Chaining is normally conducted from left to right
Like Pandas, Spark has a describe()
method
It returns a DataFrame with: count
, mean
, stdde
, min
, and max
, for each column
Specific column names can be used as parameters
display(df.describe())
display(df.describe("grades"))
A summary()
method is also available. It allows for user specified stats to be obtained
By default it adds percentiles:
25%
, 50%
, and 75%
You can specify other stats as well
df.summary().show()
df.summary("max", "min", "90%").show()
orderBy()
can be used to sort DataFrames
It returns a new dataframe sorted according to a specified colum
More column names can be provided for extra criteria
For descending order, specify ascending
as False
df.orderBy("grades").show()
df.orderBy("grades", "last_name").show()
df.orderBy("grades", "last_name", ascending=False).show()
df.sort("grades").show()
A method called sort() is almost identical, and with the same parameters grant the same results
Attributes access properties of dataframes
Like methods, they use the .
operator
Unlike methods, attributes do not take parameters
Commonly used attributes:
→ columns
→ dtypes
→ schema
print(type(df.columns))
print(df.columns)
print(type(df.dtypes))
print(df.dtypes)
print(type(df.schema))
print(df.schema)
A schema can be created using StructType
and
StructField
More specifically, a schema is generally a StrucType
containing a
list of StructField
elements
These types are defined in pyspark.sql.types
from pyspark.sql.types import *
schema = StructType([StructField("first name", StringType(),True),\
StructField("last name", StringType(),True),\
StructField("gender", StringType(),True),\
StructField("grades", IntegerType(),True)])
newDF = spark.read.csv(path, header=True, schema=schema)
display(newDf)
newDf.printSchema()
A schema provides a blueprint of the data (metadata)
Each StructField should contain:
→ the names of the columns (string)
→ the types of values in each column (DataType)
→ a flag to allow columns to contain nulls or not*
from pyspark.sql.types import *
schema = StructType([StructField("first name", StringType(),True),\
StructField("last name", StringType(),True),\
StructField("gender", StringType(),True),\
StructField("grades", IntegerType(),True)])
newDF = spark.read.csv(path, header=True, schema=schema)
display(newDf)
newDf.printSchema()
Schemas are frequently inferred from the data with: inferSchema=True
With header=True
, the first row is used to populate the FieldName
DataType and Nullability are inferred by analyzing the whole data
Could lead to suboptimal results
A Dataframe can be created using a bi-dimensional list and a schema
It uses the spark.createDataFrame()
function
An RDD or a Pandas.DataFrame can also be used as a data source
Remember: avoid using Pandas on large datasets!!!
from pyspark.sql.types import *
students =[("john", "smith", 43222, "fall_2024","m",3.0),\
("kim", "pak", 43878, "winter_2025", "f", 3.4),\
("boon", "pak", 41359, "fall_2024", "m", 3.4),\
("juan", "lopez", 99271, "fall_2024", "m", 2.5),\
("ryan", "rey", 76419, "winter_2024", "m", 3.9),\
("leah", "lukez", 31315, "fall_2025", "f", 2.7)]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("cohort", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("gpa", DoubleType(), True) \
])
bda420 = spark.createDataFrame(data=students, schema=schema)
display(bda420)