Home and Learn: Data Analysis
In the previous lesson, you ran some basic Pandas commands to inspect your data. We mentioned something about null values. In this lesson, you'll learn more about these null values.
Whenever you get some data, it's typical to find some values missing. In Excel, for example, these would be cells with no data in them. In Pandas, and in Data Analysis in general, these missing values are called Nulls.
As an example, let's load a new data set. Download the following CSV file:
Student Scores Data Set (right click, Save As)
Save the file to your computer and remember the file location. Now copy and paste this code into a new cell in a Jupyter Notebook:
import pandas as pd
df_students = pd.read_csv('PATH_TO_FILE\\StudentScores.csv')
df_students.head(10)
Obviously, replace PATH_TO_FILE to point to a location on your own computer. (You saw how to do this in a previous lesson.)
When you run the code above, you should see the first 40 rows of the data print out. Notice row 36:
Row 36 has a value of NaN for the Physics exam. NaN stands for Not a Number. Pandas will enter this NaN when it finds a blank value - missing data, in other words.
To see how many null values you have, run the infor command in a new cell:
df_students.info()
You should see this:
The Non-Null Count column in the image above is referring to the NaN values in the data set. The Range Index runs from 0 to 101, for a total of 102 entries. We're OK with the first two columns, First and Last. They don't have any null values. Three of the columns have null values, though (NaN values). Math has 2, Physics has 3 and Computers has 4. To see this more clearly, you can get a sum and a list of your null values. Run this command in a new cell in your Jupyter Notebook:
df_students.isnull().sum()
You should see this when you run your code:
We chain two functions together, here, the isnull function and the sum function we used in the previous lesson. The result is a table of the null values in each column of the data set.
There are built-in functions to deal with your NaN values. But you could just leave them alone. If you wanted something like the average score, Pandas will ignore NaN values.
Depending on your data, you can fill your NaN values with anything you like. A simple way is this:
df_students.fillna(0)
The fillna function will fill the NaN values with whatever you have between its round brackets. In the line above, we're telling Pandas to fill the NaN values with zeros. It will do this for the entire dataset, whichever column the NaN value is in.
This can be hit and miss, though. You get better results if you use fillna on a column-by-column basis. In which case, use something like this:
df_students['Math'] = df_students['Math'].fillna(0)
This selects just the Math column and replaces the NaN values with zeros. Replace Math with whatever your column is called. (You'll learn more about how square brackets are used in this way later.)
You can also choose to drop rows and columns, if leaving them will have an impact on your results.
To drop all your rows that have NaN values, use the following:
df_students.dropna()
To make sure these changes filter down into your dataset, add inplace=True between the round brackets:
df_students.dropna(inplace=True)
You can drop all the columns that have NaN values with this:
df_students.dropna(axis=1, inplace=True)
By setting axis=1, you're specifying columns as the thing you want to drop. The alternative is axis=0, which drops rows. Rows is the default, so you can miss it out.
Let's fill our NaN values with zeros. We might be skewing the results a little if we were to do something like average scores, but not by much, as we only have a few NaN values.
So, add the following four lines to a new cell in your Jupyter Notebook:
df_students['Math'] = df_students['Math'].fillna(0)
df_students['Physics'] = df_students['Physics'].fillna(0)
df_students['Computers'] = df_students['Computers'].fillna(0)
df_students.isnull().sum()
Run your code and you should see this:
Once last point on Nulls - if you use Aggregate function like sum and count, NaN values are ignored.
Dtypes are the type of data going into your columns. In a new cell in your Notebook, enter the following:
df_students.dtypes
You should see this:
We have a list of column names and the type of data going into each column. The First and Last columns are both of the object data type. This means they will be able to hold string of text. The Math, Physics, and Computers columns are storing scores in the float64 data type.
We'd like to change some of these types. The object types are OK. But the rest of the types can be changed. The scores should be integers rather than float. After all, we don't have scores like 58.7 out of 100. Using the correct types means you won't be using up so much memory in your data analysis. Not crucial here, with only 100 or so rows. But it would be crucial if you had a Dataframe with hundreds of thousands of records.
To convert a column to a different type, use the function astype(). In between the round brackets of astype, you need the data type you're converting to. We'd like to convert to the int type, so the format is this:
df_students['Math'] = df_students['Math'].astype(int)
Before the equal sign, we're specifying that we want to convert the Math column. This goes between square brackets after the data frame name, df_students. After the equal sign, we do the same - specify the column we want to convert. This time, we type a dot and then add astype(int).
With that in mind, add the following five lines to a new cell in your Notebook:
df_students['Math'] = df_students['Math'].astype(int)
df_students['Physics'] = df_students['Physics'].astype(int)
df_students['Computers'] = df_students['Computers'].astype(int)
df_students.dtypes
Run your code to see the following:
Our final four columns are now converted to int32 data types.
(A word of warning: If you have NaN values in your column, you won't be able to convert to integers from float. You'll get an error.)
In the next lesson below, you'll learn how to filter your Pandas Dataframes so that you can return only the data you're interested in.
<< Basic Operations | Filtering Data >>
Email us: enquiry at homeandlearn.co.uk