- Load data with Pandas
- Filter data by Columns
- Filter data by Rows
- Graph Data
- Create New Columns
In the previous exercise,
we loaded some data and trained(fit) a model to it.
Several aspects of this were simplified,
particularly that the data was hard-coded into our python script,
and we didn't spend any time really looking at the data itself.
Here, we'll
load data from a file,
filter it,
and graph it.
Doing so is a very important first step in order to build proper models,
or to understand their limitations.
1. Load data with Pandas
There are large variety of libraries
that help you work with data.
In Python, one of the most common is Pandas.
We used pandas briefly in the previous exercise.
Pandas can
open data saved as text files and
store it in an organized table called a DataFrame.
Let's open some text data that's stored on disk.
Our data is saved in a file called 'doggy-boot-harness.csv'
import pandas
!wget <https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py>
!wget <https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-boot-harness.csv>
# Read the text file containing data using pandas
dataset = pandas.read_csv('doggy-boot-harness.csv')
# Print the data
# Because there are a lot of data, use head() to only print the first few rows
dataset.head()

As you can see,
this dataset contains information about dogs,
including their doggy boot size, harness size, sex, and age in years.
Data is stored as columns and rows, similar to a table you might see in Excel.
2. Filter data by Columns
Data is easy to filter by columns.
We can either type this directly,
like 'dataset.my_column_name,'
or
like so: dataset["my_column_name"].
We can use this to either extract data, or to delete data.
Lets take a look at the harness sizes, and delete the sex and age_years columns.
# Look at the harness sizes
print("Harness sizes")
print(dataset.harness_size)
# Remove the sex and age-in-years columns.
del dataset["sex"]
del dataset["age_years"]
# Print the column names
print("\\nAvailable columns after deleting sex and age information:")
print(dataset.columns.values)

3. Filter data by Rows
We can get data from the top of the table by using the
head() function,
or
from the bottom of the table by using the tail() function.
Both functions make a shallow copy of a section of our dataframe.
Here, we're sending these copies to the print() function.
The head and tail views can also be used for other purposes,
such as for use in analyses or graphs.


We can also filter logically.
For example,
we can look at data for dogs who have a harness smaller than a size 55.
This works by calculating a True or False value for each row,
then keeping only those rows where the value is True.
# Print how many rows of data we have
print(f"We have {len(dataset)} rows of data")
# Determine whether each avalanche dog's harness size is < 55
# This creates a True or False value for each row where True means
# they are smaller than 55
is_small = dataset.harness_size < 55
print("\\nWhether the dog's harness was smaller than size 55:")
print(is_small)
# Now apply this 'mask' to our data to keep the smaller dogs
data_from_small_dogs = dataset[is_small]
print("\\nData for dogs with harness smaller than size 55:")
print(data_from_small_dogs)
# Print the number of small dogs
print(f"\\nNumber of dogs with harness size less than 55: {len(data_from_small_dogs)}")