Skip to content

Data tidying with Python and Pandas

Overview

This workshop covers practical approaches for handling data in Python. We will use the Python library Pandas. This workshop is a recommended prerequisite for the Data Visualisation workshop.

In order to do effective data analysis or visualisation, we usually need to have our data cleaned and in a consistent format. We will cover the concept of "tidy", long-form, and wide-form data, and hands-on approaches for manipulating data and fixing common problems. This workshop concentrates on tabular data, like that found in spreadsheets or databases.

Learning Objectives

At the end of this workshop you will be able to:

  • read tabular data into Python using Pandas, and manipulate it
  • identify problems in datasets that will hinder analysis
  • use Python to fix common problems
  • understand and convert between different data layouts such as wide-form and "tidy" as appropriate for the problem being solved

Requirements

This workshop is designed for participants with a basic knowledge of Python, but is also appropriate for attendees who do not know Python but have significant experience using another programming language. If you are new to Python, you will probably want to refer to a Python primer.

Attendees are required to bring their own laptop computers.

You should install the Anaconda Python distribution before attending:

  • Go to: https://www.anaconda.com/distribution/#download-section
  • Select your operating system
  • Select the Python 3.7 (not 2.7) option to download and install. This is a large download (over 600MB). If you aren't able to install it prior to the workshop, we can work around this, but please contact us beforehand.

Notebooks and Data

This workshop is implemented as a set of Jupyter Notebooks, and we will use (and introduce) Jupyter during the workshop.

You can find all notebooks and data in this github repository. For this workshop, we will use the Pandas_and_tidying.ipynb notebook.