Violin plot stands among the less popular visualisation techniques, largely because of its quite ambiguous character. Just like pie chart, it does not give definite numbers, but provides visual representation of possible trends in data, which aids subsequent in-depth analysis of the corpus. Because of that, violin plots are perfect tools for exploratory data analysis, preceding formation of hypotheses and analysis proper.
import seaborn as sns
import pandas as pd
# Read the data
titanic = pd.read_csv('titanic-data.csv')
# The following is entirely optional.
# It makes aesthetic difference in
# the final plot.
if survival == 1 or survival == 'Survived':
return 'Did not survive'
titanic.Survived = titanic.Survived.apply(survival_status)
# The following line deletes rows with null values
# in the fields which we will analyse with the plot
subset=['Age', 'Sex', 'Survived'])
# The following line describes the violin plot
# Three values are taken into consideration:
# Survival on the X axis, because it takes only
# two values
# Age on the Y axis
# Sex is marked by colour
# The 'split' parameter takes boolean values,
# True or False. It allows you to minimize
# the output, if you also set the 'hue',
# by drawing half a violin for each
# of the levels marked by hue
age_plot = sns.violinplot(x='Survived',
data=titanic, palette='Set2', split=True)
# You can set title for the plot:
age_plot.set_title('Exploratory violin plot')
The resulting plot should look as the following:
We can also do some research on other pieces of information from the dataset. The following code can be used to produce violin plots analysing social class (the introductory part of code changes the numeric values in ‘Pclass’ to nominal values):
# The following function changes numeric
# values in the 'Pclass' section to nominal
if pclass == 1 or pclass == 'Upper':
elif pclass == 2 or pclass == 'Middle':
titanic.Pclass = titanic.Pclass.apply(pclass_to_class)
# The following line defines the plot, much like in
# the code given above:
class_plot = sns.violinplot(x='Survived', y='Age', hue='Pclass',
data=titanic, palette = 'Set2')
# The following code sets the title:
This should produce the following plot:
The plots are easy to produce, and offer a good insight into general relations between different fields in the dataset. Color coding in the violin plot allows the user to include additional field, described in the automatically produced legend.
In the Titanic data, presented here, we can draw several careful hypotheses about the passengers. We can see, that there was relatively high survival rate among young boys, but not among young girls. We can also see, that high survival of young passengers is characteristic to the passengers from lower and middle class, but not necessarily to those from upper class.
This exploratory analysis with the use of violin plots lets us aim the analysis proper in a more precise direction, and can be very helpful at the outset of data analysis.