Home and Learn: Data Analysis


Getting Started with Data Analysis in Python

What is Pandas? What is a DataFrame?

Pandas is an open source tool used for data manipulation and analysis. It's built on top of Python. The idea is that you create something called a Dataframe, which is like a table, then perform operations on this Dataframe to extract information about your data. Here's what a Dataframe might look like when printed out:


Pet Number
0 Cat 3
1 Dog 2
2 Fish 7

Simple, hey! This data is a list of pets that people own and how many. The table has three columns. This first column is called the index, and is just a unique value.

There are three rows in the table, row 0, row 1, and row 2. It might be that that first row represents a pet owner. They keep cats and have three of them. The pet owner in the second row, row 1, keeps 2 dogs, while the owner of the third row has 7 fish.

A single column of data, by the way, is called a Series. You can, for example, get just the Pet column from the Dataframe and do something with this single column (Series).

Start up your Jupyter Notebooks app. Create a new Notebook and rename it to anything you like. (If you're not sure how to do this, see here: Install Jupyter.)

To construct such a table in Pandas, you first import the library:

import pandas as pd

The pd here is just a variable name. You could call it almost anything you like:

import pandas as pan

The variable is now called pan. You can use this pan variable from now on whenever you want to use something from the pandas library. (You'll see how it works shortly.) However, we'll stick with pd as the variable name as it's become a quite common naming convention.

To create the simple table above, we use this syntax:

df_pets = pd.DataFrame(

{

'COL_1_NAME': ['COL_VALUE_1', 'COL_VALUE_2', 'COL_VALUE_3'],
'COL_2_NAME': ['COL_VALUE_1', 'COL_VALUE_2', 'COL_VALUE_3'],

}

)

The df_pets above is just another variable name. The Dataframe will be stored inside of this variable. After an equal sign, we have this:

pd.DataFrame

The pd variable is the one holding a reference to the pandas library, the one we imported as.

All the round, curly, and square brackets that come after DataFrame can be a pain. Miss one out and you'll get errors. But you need a pair of round brackets after DataFrame:

pd.DataFrame()

Inside the round brackets you need a pair of curly brackets:

pd.DataFrame( { } )

Curly brackets in Python mean you want a dictionary object. A dictionary is something with a Key/Value pair. Like this:

Name: Ken
Age: 34
Job: Writer

The Keys here are Name, Age and Job. The values are Ken, 34, and Writer.

Inside the curly brackets, you can add your column names and values:

import pandas as pd
df_pets = pd.DataFrame( { 
	'Pet': ['Cat', 'Dog', 'Fish'],
	'Number': [3, 2, 7]
} )
df_pets

Notice the format for column names: They go between quotation marks, followed by a colon:

'Pet':
'Number':

You can use single or double quotes.

After the colon, you can add a Python list (the square brackets) for your values.

['Cat', 'Dog', 'Fish'],

Notice the comma after the list - you need that to separate each column and values. (Except the final one.)

So, copy and paste the following into the first empty cell of your new Jupyter Notebook (The indent should be one press of the TAB key on your keyboard):

import pandas as pd
df_pets = pd.DataFrame( { 
	'Pet': ['Cat', 'Dog', 'Fish'],
	'Number': [3, 2, 7]
} )
df_pets

Press the Run button and you should see this:

A simple Pandas dataframe.

Incidentally, you can type the code all on one line. We've spread it over a few lines just so that you can see the syntax better. So you could do this instead:

df_pets = pd.DataFrame( { 'Pet': ['Cat', 'Dog', 'Fish'], 'Number': [3, 2, 7] } )

(The spaces don't matter, either.)

A Pandas dataframe without line breaks

Let's move on and import a file that we can use as a Dataframe.

Back to Pandas Contents Page

 


Email us: enquiry at homeandlearn.co.uk