This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057-data-science-and-data-engineering-for-architects.
1.1 What is pandas?
- pandas (https://pandas.pydata.org/) is an open-source library that provides high-performance, memory-efficient, easy-to-use data structures, as well as support for data manipulation and analysis for Python
- The core pandas’ data structures are Series and DataFrame objects
- DataFrame’s major influence was R’s DataFrame object
1.2 The DataFrame Object
- A DataFrame is a two-dimensional data structure with integrated indexing (the row and column labeling scheme) similar to a relational table, Excel spreadsheet, and similar tabular dataset containers
- Essentially, a DataFrame object is a collection of Series objects that share the same index (acting as the row id attribute)
- Supports row-wise and column-wise operations
- The common layout is a two-dimensional format; higher-order formats are possible using hierarchical indexing
1.3 The DataFrame’s Value Proposition
- pandas DataFrames offer compact and efficient interfaces for reading and writing data stored in different formats:
- CSV, JSON, Parquet, Microsoft Excel, SQL databases, HDF5, etc.
- DataFrame offers integrated handling of missing data and other mechanisms for repairing datasets
- Supported operations include:
- dataset reshaping,
- grouping,
- aggregation,
- pivoting,
- joining,
- querying,
- and similar operations
1.4 Creating a pandas DataFrame
- You can build a DataFrame as a matrix (array of arrays) out of a NumPy 1-D array. The columns named parameter assigns column labels (indexes)
import pandas as pd import numpy as np m = np.array ([1,2,3,4,5,6,7,8]).reshape(4,2) df = pd.DataFrame(m, columns = ["Col1", "Col2"])
- The df will have this structure:
Col1 Col2 0 1 2 1 3 4 2 5 6 3 7 8
Notes:
You can customize and set up your own index in the same way you do with a Series object.
1.5 Getting DataFrame Metrics
- DataFrames can be queried for the following structural information (examples refer to the df object from the previous slide):
- Note: Some of the metrics attributes are properties, others are methods
- df.shape # (4,2)
- df.index # its index: RangeIndex(start=0, stop=4, step=1)
- df.columns # The columns list, usually called as list(df.column names)
- df.info() # The DataFrame description*
- df.count() # The number of non-N/A values per column*
- In many practical cases, developers access the shape property to get the number of rows and columns in a DataFrame:
df.shape[0] # Number of rows df.shape[1] # Number of columns
Notes:
The df.info() call on the DataFrame from the previous slide will reveal the following information:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 2 columns): Col1 4 non-null int64 Col2 4 non-null int32 dtypes: int32(1), int64(1) memory usage: 128.0 bytes print (df.count()) will print: Col1 4 Col2 4 dtype: int64
1.6 Accessing DataFrame Columns
- A column in a DataFrame is retrieved by its name; the retrieved column is a Series object
- You have two options: use the column name used as a DataFrame property via the dot notation, or as a dictionary key:
df.Col1 # or df['Col1']
- The following information will be returned in either option (the left column in the output, if printed, is the index):
0 1 1 3 2 5 3 7
Name: Col1, dtype: int32
- All returned Series objects share the source DataFrame’s index
- The retrieved columns can be updated by assignment setting all the elements to the assigned value as illustrated in the example below
df1.Col1 = 777 # All the values in Col1 will now have the same value of 777
Notes:
You can use a NumPy arange() method to assign auto-incremented values to a column (Series)
Example:
df.Col1 = np.arange(df.shape[0]) # .shape[0] returns the number of rows
Now df.Col1 will contain these values:
0 1 2 3
1.7 Accessing DataFrame Rows
- For data access, you can use the row index (label) that works similarly to NumPy’s array indexing
df[0:2] # Will select the first two rows returned packaged as a DataFrame
Notes:
Accessing the last several rows can be done this way:
size = df.shape[0] df[size -2:] # Access the last two rows
1.8 Accessing DataFrame Cells
- A cell lies at the intersection of a row and column
- You have two main indexing schemes for accessing DataFrame cells:
- .iloc, and
- .loc
- Details on how to use those properties are shown in the slide’s notes
Notes:
Using iloc
iloc [ ] specifies the integer index location of a single cell:
[ [row_idx], [column_idx] ], e.g.
df.iloc[[2], [1]]
will return the value of the cell at the intersection of the 3rd row and the second column
You can retrieve multiple cells by using NumPy’s indexing, e.g.
df.iloc[0:2, 1] # will retrieve cells located at the intersection of the two first rows and the second column
df.iloc[0:2, 1:3] # will retrieve cells located at the intersection of the two first rows and the second and third columns (the upper index is non-inclusive)
Getting the whole row (as a Series object)
df.iloc[1, :] # will retrieve the whole second row
A single cell referenced by iloc can also be used for assignments, e.g.
df.iloc[1,12] = ‘New Value’
Using loc
- loc[ ] accesses cells by their labels (an integer index, e.g. 7, is interpreted as a label ‘7’) returning either a DataFrame or Series object depending on whether the row index is passed on:
- As a list
- Then a DataFrame will be returned
- If not (a row label is passed),
- Then a Series will be returned
- As a list
- With loc you can access columns by their names, e.g. df.loc[0, ‘Col1’], which is not possible with iloc, where you have to provide the ordinal index of the column, e.g. df.iloc[0,1], if the Col1‘s index is 1 (the second column in the DataFrame)
Examples of Using loc
For DataFrame df with these contents:
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
The following loc[] operations are possible:
df.loc[[1],[‘A’, ‘D’]]# will return a DataFrame (a row list (in our case containing just a single element, 1) is passed):
A D
1 4 7
df.loc[1, [‘A’, ‘D’]] # will return a Series object (the row label (1) is unbracketed):
A 4
D 7
df.loc[1:2, ‘A’] # Will return a Series of this structure:
1 4
2 8
df.loc[2,’C’] = 333 # That’s how you perform an assignment
1.9 Deleting Rows and Columns
- You can delete rows or columns by specifying their indexes (labels) passed as a parameter to the drop() method
- By default default, rows are deleted (axis = 0)
- Columns are deleted by specifying axis = 1 parameter and giving the list of column names, e.g.
df.drop('NEW_COL', axis = 1)
- The drop operation does not affect the Series or DataFrame object on which you call the drop() method unless you specify the inplace=True parameter
- Examples of using the drop() method are given in the Slide’s Notes
Notes:
The drop() method’s signature is as follows:
drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
The drop() usage examples below are adapted from pandas documentation.
>>> df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D']) >>> df A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 Drop columns >>> df.drop(['B', 'C'], axis=1) A D 0 0 3 1 4 7 2 8 11 >>> df.drop(columns=['B', 'C']) A D 0 0 3 1 4 7 2 8 11 Drop a row by index (using the default axis = 0) >>> df.drop([0, 1]) A B C D 2 8 9 10 11
1.10 Adding a New Column to a DataFrame
- You can add a column to an existing DataFrame on the fly using this syntax:
df['NEW_COL_NAME'] = <LIST_OF_COLUMN_VALUES>
- The column is appended to the list of the existing columns, the column index and the shape property of the DataFrame are updated
- The length of the list of elements to be inserted into the new column should match the length of the row index (the number of rows in the DataFrame), otherwise, you will get an Error:
Length of values does not match length of index
Notes:
Appending / Concatenating DataFrames
pandas supports DataFrame appending (which is equivalent to DataFrame concatenation).
Here is a basic example of DataFrame appending (concatenating)
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB')) df_all = df1.append(df2, ignore_index = True); df_all A B 0 1 2 1 3 4 2 5 6 3 7 8
Note 1: The ignore_index = True parameter above helps to rebuild the index as an auto-incremented sequence starting from 0.
Note 2: The above code is functionally equivalent to this one:
df_all = pd.concat([df1, df2], ignore_index= True)
1.11 Getting Descriptive Statistics of DataFrame Columns
- Getting a DataFrame’s column statistics is similar to what you can do in SQL, e.g.:
SELECT SUM(col_name) FROM my_table; SELECT MAX(col_name) - MIN(col_Name) FROM my_table;
- pandas offers these useful DataFrame methods for column’s statistics:
- sum()*, min(), max(), mean(), median()
- You invoke the above methods like so:
df.COL_NAME.sum() df.COL_NAME.mean()
Notes:
When sum() is invoked on a column of str type, all the values in that column get concatenated, which may potentially kill your program with a MemoryError.
1.12 Getting Descriptive Statistics of DataFrames
- If you do not qualify the target of a statistics method (e.g. sum(), or mean()) with a column name, and your DataFrame has all the columns of the numeric type, pandas calculates and returns the relevant statistic across the DataFrame as follows:
- If no axis is specified, row-wise operations are performed; for column-wise operation pass on axis = 1 named argument to the statistics method, e.g.
df.min()
- (axis = 0 is assumed) will return a Series that contains the smallest elements in each column
df.sum(axis = 1)
- will return a Series that contains the sum of elements across columns in each row
- See examples in the Slide’s Notes
Notes:
For a DataFrame of this structure named df34:
A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11
Note: You can generate that structure using this command:
df34 = pd.DataFrame (np.arange(12).reshape(3,4), columns = list('ABCD'))
The following operations are possible:
df34.min() # axis = 0 is assumed
will return a Series with this content (it will be the first row – pay attention to the indexing: it is traversed column-wise: A, B, C, D):
A 0 B 1 C 2 D 3
You read it as follows: for column A, the minimum value (out of 0, 4, and 8) is 0; for column B, it is 1, etc. If, say the df34.loc[0,’A’] cell contained not 0 but, rather, 999, then the minimum value for column A would be 4 (now the smallest number in column A).
df34.max(axis = 1)
will return a Series with this content:
0 3 1 7 2 11
If, instead 2 in the cell at the first row and column ‘C’ was 999 (you could do this using this command:
df34.loc[0,’C’] = 999 , then that value, 999, would be returned instead of 3 :
0 999 1 7 2 11 dtype: int32
df34.sum() # axis = 0 is assumed
will return a Series with this content (it will be just one row that carries the sum of all the elements in each column – pay attention to the indexing: it is column-wise, A, B, C, D):
A 12 B 15 C 18 D 21
What may help to visualize the outcome is thinking about this operation like it were an SQL’s sum().
df34.sum(axis = 1)
will return a Series with this content:
0 6 1 22 2 38
df34.mean()
will return a Series with this content:
A 4.0 B 5.0 C 6.0 D 7.0 df34.sum(axis = 1)
will return a Series with this content:
0 1.5 1 5.5 2 9.5
1.13 Sorting DataFrames
- Sorting in DataFrames is performed column-wise
- Two types of syntax are supported:
df.COL_NAME.sort_values()
and
df.sort_values(by='COL_NAME)
- df.COL_NAME-based sorting returns as Series of sorted specified column’s values
- … (by = ‘COL_NAME’) sorting will return a DataFrame with rows sorted by values in the specified column
- Pass on the ascending=False parameter to order the values in descending order
Notes:
The sort_values() method has this signature:
sort_values(axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
The supported sorting algorithms are:
{'quicksort', 'mergesort' or 'heapsort'},
1.14 Reading From CSV Files
- Files in the popular CSV format are loaded into a pandas DataFrame using the read_csv() method*:
df_from_file = pd.read_csv('my_file.csv')
- The default column delimiter is a comma ‘,’
- You can load the first N rows of the file by specifying the nrows=N parameter
- If the header is present in the source file, the header labels will become columns names in the constructed DataFrame
- header=None will automatically assign column names as integers: 0,1,2, … and make all the columns of the object type (String)
- If the CSV file does contain the header row, pandas tries to determine the labels (header = ‘infer‘ is the default); as a bonus of this inference operation, pandas will try to also determine the suitable column data types
Notes:
The full signature of the read_csv() method:
pd.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
pandas can also read the following file formats:
pd.read_clipboard pd.read_json pd.read_sas
pd.read_stata
pd.read_csv
pd.read_gbq
pd.read_msgpack
pd.read_sql
pd.read_table
pd.read_excel
pd.read_hdf
pd.read_parquet
pd.read_sql_query
pd.reset_option
pd.read_feather
pd.read_html pd.read_pickle pd.read_sql_table
1.15 Writing to a CSV File
- You can save your DataFrame object in the CSV format using the to_csv() DataFrame’s method :
df.to_csv('my_file.csv')
- The default column delimiter is a comma ‘,’
- pandas saves the DataFrame with the header row and the row index (as the first column in the CSV file)
- To prevent saving the row index, pass the index = False parameter
- Text encoding is is ‘utf-8’ in Python 3+
Notes:
The full to_csv method signature:
dff.to_csv(path_or_buf=None, sep=',', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression='infer', quoting=None, quotechar='"', line_terminator=None, chunksize=None, tupleize_cols=None, date_format=None, doublequote=True, escapechar=None, decimal='.')
1.16 Summary
- In this tutorial, we discussed the following topics related to the pandas library:
- The DataFrame object
- Accessing DataFrames
- Deleting DataFrame rows and columns
- Read and Write operations