We've learned that models are computer code that processes information to make a prediction or a decision. Here, we'll train a model to guess a comfortable boot size for a dog, based on the size of the harness that fits them.
The first thing we do with a model is load data.
We'll cover this in more detail in a later exercise.
For now,
we'll just write our data directly in our code.
Review and run the following code to get started:
import pandas
!wget <https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py>
!wget <https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-boot-harness.csv>
!pip install statsmodels
# Make a dictionary of data for boot sizes
# and harness size in cm
data = {
'boot_size' : [ 39, 38, 37, 39, 38, 35, 37, 36, 35, 40,
40, 36, 38, 39, 42, 42, 36, 36, 35, 41,
42, 38, 37, 35, 40, 36, 35, 39, 41, 37,
35, 41, 39, 41, 42, 42, 36, 37, 37, 39,
42, 35, 36, 41, 41, 41, 39, 39, 35, 39
],
'harness_size': [ 58, 58, 52, 58, 57, 52, 55, 53, 49, 54,
59, 56, 53, 58, 57, 58, 56, 51, 50, 59,
59, 59, 55, 50, 55, 52, 53, 54, 61, 56,
55, 60, 57, 56, 61, 58, 53, 57, 57, 55,
60, 51, 52, 56, 55, 57, 58, 57, 51, 59
]
}
# Convert it into a table using pandas
dataset = pandas.DataFrame(data)
# Print the data
# In normal python we would write
# print(dataset)
# but in Jupyter notebooks, if we simple write the name
# of the variable and it is printed nicely
dataset
As you can see,
we have the sizes of boots and harnesses for 50 avalanche dogs.
We want to use harness size to estimate boot size.
This means harness_size is our input.
We want a model that will process the input
and
make its own estimations of the boot size (output).
The second thing we must do is
select a model.
We're just getting started, so we'll start with a very simple model called OLS.
This is just a straight line (sometimes called a trendline).
Let's use an existing library to create our model, but we won't train it yet.
# Load a library to do the hard work for us
import statsmodels.formula.api as smf
# First, we define our formula using a special syntax
# This says that boot_size is explained by harness_size
formula = "boot_size ~ harness_size"
# Create the model, but don't train it yet
model = smf.ols(formula = formula, data = dataset)
# Note that we have created our model but it does not
# have internal parameters set yet
if not hasattr(model, 'params'):
print("Model selected but it does not have parameters set. We need to train it!")
OLS models have two parameters
(a slope:β and an offset:α),
but these haven't been set in our model yet.
We need to train (fit) our model to find these values
so that the model can reliably estimate dogs boot size based on their harness size.
The following code fits(train) our model to data you've now seen:
# Load some libraries to do the hard work for us
import graphing
# Train (fit) the model so that it creates a line that
# fits our data. This method does the hard work for
# us. We will look at how this method works in a later unit.
fitted_model = model.fit()
# Print information about our model now it has been fit
print("The following model parameters have been found:\\n" +
f"Line slope: {fitted_model.params[1]}\\n"+
f"Line Intercept: {fitted_model.params[0]}")