This tutorial is adapated from the Web Age course Data Engineering Bootcamp Training (Using Python and PySpark).
1.1 Data Aggregation and Grouping
The pandas module offers functions that closely emulate SQL functionality for data grouping, aggregation, and filtering.
With pandas you can do the following tasks:
1.2 Sample Data Set
KPI|Value|Scale|Movement A|390|MEDIUM|DOWN B|72|SMALL|DOWN C|284|SMALL|UP D|754|LARGE|DOWN ... U|706|LARGE|DOWN V|363|MEDIUM|DOWN W|275|SMALL|UP X|538|MEDIUM|DOWN Y|974|LARGE|UP Z|696|MEDIUM|DOWN
1.3 The pandas.core.groupby.SeriesGroupBy Object
groupValueByScale = df['Value'].groupby([df['Scale']])
groupValueByScale.mean() groupValueByScale.max() groupValueByScale.count() groupValueByScale.sum()
SELECT Scale, AVG(Value) FROM DF GROUP BY Scale SELECT Scale, MAX(Value) FROM DF GROUP BY Scale …
df.groupby().mean() #or simply like df.groupby('Scale').mean()
Your output would look something like that:
Scale LARGE 803.285714 MEDIUM 512.714286 SMALL 168.200000 Name: Value, dtype: float64
1.4 Grouping by Two or More Columns
groupValueByScaleAndMovement = df.groupby(, df])
df.groupby()
Scale Movement LARGE DOWN 722.500000 UP 911.000000 MEDIUM DOWN 523.571429 UP 501.857143 SMALL DOWN 56.500000 UP 242.666667 Name: Value, dtype: float64
1.5 Emulating SQL’s WHERE Clause
df > 333]
Notes:
If you issue this command instead:
df > 333
You will produce a logical vector:
0 True 1 False 2 False 3 True 4 True 5 True 6 True 7 True 8 True ...
1.6 The Pivot Tables
df.pivot(index = 'KPI',columns='Movement',values='Value')
Movement DOWN UP KPI A 390.0 NaN B 72.0 NaN C NaN 284.0 D 754.0 NaN E 718.0 NaN F NaN 493.0 G NaN 483.0
Notes:
You can replace the NaN values in the above output with dashes (-) using this command:
dfPivo.fillna('-', inplace =True) making it a bit less noisy.
Here is a good illustration of how Pivot reshaping works (borrowed from DataCamp’s pandas cheat sheet):
1.7 Cross-Tabulation
pd.crosstab(df, df )
Movement DOWN UP Scale LARGE 4 3 MEDIUM 7 7 SMALL 2 3
Notes:
The above report could have been generated using this Group By command; the report’s output layout will be a bit different:
df.groupby(, df]).count() Scale Movement LARGE DOWN 4 UP 3 MEDIUM DOWN 7 UP 7 SMALL DOWN 2 UP 3 Name: Movement, dtype: int64