Home > Resources > Blog

Data Visualization in Python for Architects

December 29, 2021 by Mikhail Vladimirov
Categories: Architecture , Data Science and Business Analytics

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057-data-science-and-data-engineering-for-architects.

1.1 Why Do I Need Data Visualization?

The common wisdom states that:
- Seeing is believing and a picture is worth a thousand words
Data visualization techniques help users understand the data, underlying trends and patterns by displaying it in a variety of graphical forms (heat maps, scatter plots, charts, etc.)
Data visualization is also a great vehicle for communicating analysis results to stakeholders
Data visualization is an indispensable activity in exploratory data analysis (EDA)
Business intelligence software vendors usually bundle data visualization tools into their products
There are a number of free tools that may offer similar capabilities in certain areas of data visualization

1.2 Data Visualization in Python

The two most popular data visualization libraries with Python developers are:
- matplotlib, and
- seaborn

Notes:

Visualization with pandas

The pandas module leverages the matplotlib module for creating an easy-to-use visualization API (https://pandas.pydata.org/pandas-docs/stable/visualization.html) against the target DataFrame – you just need to import the matplotlib.pyplot module first.

Matplotlib

Matplotlib [https://matplotlib.org/] is a Python graphics library for data visualization.
The project dates back to 2002 and offers Python developers a MATLAB-like plotting interface.
Depends on NumPy.
You can generate plots, histograms, power spectra, bar charts, error charts, scatter plots, etc., with just a few lines of code.
Matplotlib’s main focus is 2D plotting; 3D plotting is possible with the mplot3d package.
The 2D and 3D desktop plotting package for Python.
3D plots are supported through the mtplot3d toolkit.
It supports different graphics platforms and toolkits, as well as all the common vector and raster graphics formats (JPG, PNG, GIF, SVG, PDF, etc.)
Matplotlib can be used in Python scripts, IPython REPL, and Jupyter notebooks.

1.3 Getting Started with matplotlib

In your Python program or Jupyter notebook, you start by importing the matplotlib.pyplot module and aliasing it like so:

	import matplotlib.pyplot as plt

In Jupyter notebooks, you can instruct the graphics rendering engine to embed the generated graphs with the notebook page with this “magic” command:

%matplotlib inline

The generated graphics will be in-lined in your notebook and there will be no plotting window popping up as in stand-alone Python (including IPython)
You can now use the matplotlib.pyplot object to draw your plots using its graphics functions
- Note: Any methods mentioned in the text below are those of the matplotlib.pyplot object
When done, invoke the show() method (of the matplotlib.pyplot object) to render your plot

1.4A Basic Plot

%matplotlib inline
import matplotlib.pyplot as plt
x = [i for i in range(10)]
y = [x[i] * 2 for i in x]
plt.grid()
plt.plot(x,y)
plt.xticks(x)
plt.yticks(y)
plt.show()

Notes:

Without the xticks() and yticks() methods, the plot will use some calculated values for the x and y coordinates that may not be quite user-friendly:

1.5 Scatter Plots

import random
random.seed(1011)
dots_count = 10
x = [i for i in range(dots_count)]
y = [random.randint(0,dots_count)
for i in x]
plt.grid()
plt.scatter(x,y)
plt.xticks(range(dots_count))
plt.yticks(range(dots_count + 1))
for i in range(dots_count):
plt.annotate(“#” + str(i), (x[i], y[i]))
plt.show()

Notes:

Here is the unwrapped code from the slide:

import random 
random.seed(1011)
dots_count = 10
x = [i for i in range(dots_count)]
y = [random.randint(0,dots_count) for i in x]
plt.grid()    
plt.scatter(x,y)
plt.xticks(range(dots_count))
plt.yticks(range(dots_count + 1))
for i in range(dots_count):
    plt.annotate("#" + str(i), (x[i], y[i]))
plt.show()

The y list holds the following random numbers:

[5, 4, 0, 6, 2, 10, 7, 6, 10, 7]

The plt.annotate() method annotates a single data point on the plot.

1.6 Figures

The figure() method call will launch the plotting window and render the image there
You can create multiple figures before the final call to show(), upon which all the images will be rendered in their respective plotting windows
You can optionally pass the function a number or a string as a parameter representing the figure coordinates to help the moving back and forth between the figures
An important function parameter is figsize which holds a tuple of the figure width and height in inches, e.g. plt.figure(figsize=[12,8])
- the default figsize values are 6.4 and 4.8 inches

1.7 Saving Figures to a File

Use the savefig() method to save the generated figure to a file
Matplotlib will try to figure out the file’s format using the file’s extension
Supported formats are:
- eps, jpeg, jpg, pdf, pgf, png, ps, raw, rgba, svg, svgz, tif, tiff
  - No, gif is not supported
Example:

    # Build a plot ... 
    # Then save it to a local file ...
    plt.savefig('img/justRedLineToX.jpeg', dpi=600)

Note: The destination directory must exist
For more details, visit https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html#matplotlib.pyplot.savefig

1.8 Seaborn

The seaborn data visualization package [https://seaborn.pydata.org/] is based and operates on top of matplotlib; it is tightly integrated with pandas data structures
It has a number of attractive features, including:
- A dataset-oriented API for examining relationships between multiple variables
- Convenient views of complex datasets
- High-level abstractions for structuring multi-plot grids
- Concise control over matplotlib figure styling with several built-in themes

1.9 Getting Started with seaborn

The required imports are as follows:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

Optionally, you can start your data visualization session by resetting the rendering engine settings to seaborn’s default theme and color palette using this command:

sns.set()

1.10 Histograms and KDE

You can render histogram plots along with the fitted kernel density estimate (KDE) line with the distplot() function, e.g.

sns.distplot (pandas_df.column_name)

1.11 Plotting Bivariate Distributions

In addition to plotting univariate distributions (using the distplot() function), seaborn offers a way to plot bivariate distributions using the joinplot() function:

sns.jointplot(x="col_nameA", y="col_nameB", data=DF, kind="kde");

1.12 Scatter Plots in seaborn

Scatter plots are rendered using the scatterplot() function, for example:

sns.scatterplot(x, y, hue=[list of color levels]);

1.13 Pair plots in seaborn

The pairplot() function automatically plots pairwise relationships between variables in a dataset
A sample output of the function is shown below
- Note: Trying to plot too many variables (stored as columns in your DataFrame) in one go may clutter the resulting pair plot

1.14 Heatmaps

Heatmaps, popularized by Microsoft Excel, are supported in seaborn through its heatmap() function
A sample output of the function is shown below

1.15A Seaborn Scatterplot with Varying Point Sizes and Hues

Notes:

The plot on the slide is borrowed from seaborn’s documentation: https://seaborn.pydata.org/examples/scatter_bubbles.html?highlight=hue

The relevant code is shown below; note that seaborn uses a pandas DataFrame to help with data rendering aspects organization (the mpg object is a DataFrame; values for hue, size, and other named parameters are taken from mpg’s columns referenced by their column index names):

import seaborn as sns
sns.set_theme(style="white")
# Load the example mpg dataset
mpg = sns.load_dataset("mpg")
# Plot miles per gallon against horsepower with other semantics
sns.relplot(x="horsepower", y="mpg", hue="origin", size="weight",
            sizes=(40, 400), alpha=.5, palette="muted",
            height=6, data=mpg)