Home and Learn: Data Analysis
Let's load up our student data again and see if we can't answer a few questions with the help of Pandas. If you haven't yet downloaded the student data, you can grab it here:
Student Scores Data Set (right click, Save As)
We're going to do some filtering, so it helps to know what these symbols mean (the symbols are called conditional operators):
| Symbol | Meaning |
| > | Greater than |
| < | Less than |
| >= | Greater than or equal to |
| <= | Less than or equal to |
| == | Equal to |
| != | Not equal to |
| & | And |
| | | Or |
With that in mind, off we go.
Load up the Student data with these lines, changing PATH_TO_FILE to point to a location on your own computer:
import pandas as pd
df_students = pd.read_csv('PATH_TO_FILE\\StudentScores.csv')
To see is loaded OK, view the first 5 records:
df_students.head()
Nothing we haven't already done in previous lessons. But let's start with a simple question: Did anyone score 100 on the Math exam?
We could just do this:
df_students['Math'] == 100
In between the square brackets after our Dataframe name, we have the name of one of our Columns. Because the column name is text, Math needs to go between quote marks. The double equal sign means equal to. So we're asking if any value in the Math column of the df_students Dataframe is equal to 100. When you run the line, you should see a printout of True or False values:

To see more details, you need to wrap the filter code in a Dataframe object, which is df_students for us. Try this in a new Jupyter Notebook cell:
df_students[df_students['Math'] == 100]
When you run the code, you should see that only Alex Huffman scored 100 on the Math exam:

If you like, you can assign the conditional test to a variable:
math_filter = df_students['Math'] == 100
Now, the results of the equality test are placed into the variable we have called math_filter. To get some output, wrap the variable name in a Dataframe object:
math_filter = df_students['Math']
== 100
df_students[math_filter]
The result is the same as above, but it might be easier to read. Let's ask another question.
Which students scored more than 90 on the Physics exam?
We can anwer this question in a similar way to the previous one. We only need to change a couple of things. Try this code in a new Notebook cell:
physics_filter = df_students['Physics']
> 90
df_students[physics_filter]
Now, the column we want from our df_students Dataframe is Physics. The conditional operator we're using is the greater than symbol (>)
If we wanted to say 'greater than or equal to 90', we'd add an equal sign:
physics_filter = df_students['Physics'] >= 90
We can change the symbol to a less than sign to find out who scored less than 90:
physics_filter = df_students['Physics'] < 90
Less than or equal to 90 is this:
physics_filter = df_students['Physics'] <= 90
df_students[physics_filter]
Run the code and note that only 1 student, Oliver Miah, scored 100 in Physics. If we wanted to exclude this result, we'd write this:
math_filter = df_students['Physics'] !=
100
df_students[math_filter]
We'd then get 101 rows returned, the missing one being poor Oliver. He's missing because we use the operators for not equal to (!=)
Let's ask another question.
Who scored more than 85 on the Computer exam? Ignoring the index numbers, display just their last name and the score.
We clearly need a greater than symbol here. But how do you specify which columns to display? The answer is to put your column names into a python list:
df_students[ ['Last', 'Computers'] ]
You can add as many column names as you want. Here, we add the 'Physics' columns as well:
df_students[ ['Last', 'Computers', 'Physics'] ]
Notice where all the square brackets are in the line above. If you were to run the line, you'd see just your selected columns displayed:

This prints out all the students, though. We need to narrow it down to just those who scored more than 85. To do that, add the conditional statement on the end:
df_students[['Last', 'Computers']][df_students['Computers'] > 85]
Run the code to see the results:

Let's do a query that involves AND.
Find all the students who scored more than 80 on both the Math and Physics exams.
We're better off using variables to answer this, as it could be messy to read, otherwise. First, we can store the Math query into a variable:
math_exam_result = df_students['Math'] > 85
Then do the Physics query:
phys_exam_result = df_students['Physics'] > 85
Now display the results with this line:
df_students[ math_exam_result & phys_exam_result ]
Our two variable go between the square brackets of the Dataframe called df_students. Separating the two is the and symbol (&). Run the code to see this:

If you wanted it all on one line, it would be this:
df_students[(df_students['Math'] > 85) & (df_students['Physics'] > 85)]
Very messy!
In the image above, we have the Computers results. If we only wanted to display certain column, we can change out code to this:
df_students[['Last', 'Math', 'Physics']][math_exam_result & phys_exam_result]
When the code is run, the result is this:

If you wanted to know who got more than 85 in Math OR Physics change the and (&) symbol to an or symbol (|).
Counting
If you wnated to know how many results are in your query, you can wrap them in a len function. Like this:
len( df_students[df_students['Math'] < 20] )
Run the code and you'll see an answer of 21 displayed.
If you wanted to add some text as well, wrap the above line in a str function: (str is short for string)
"Scored less than 20 in Math: " + str( len(df_students[df_students['Math'] < 20]) )
The result would be:
Scored less than 20 in Math: 21
OK. Let's move on and explore something else. But getting good at filtering is the way forward, if you want to get good at Pandas. In the next lesson, you'll learn about loc and iloc, two more ways to filter your data.
<< Dealing with Nulls | Pandas loc, iloc >>
Email us: enquiry at homeandlearn.co.uk