Making machine learning pipelines with scikit-learn

  • Level: Intermediate
  • Time to read: 30 minutes
  • Libraries: scikit-learn
  • Prerequisites: machine learning experience with scikit-learn
  • Keywords: piplines, machine learning, data mining, topic modelling, classification, scikit-learn, python

Machine learning programs can get quite complicated, and we need to take care when developing them for a few reasons:

  1. We want our results to be reproducible, which helps with getting accurate evaluations, consistent results in different environments and the ability to recreate models if, for example, our server dies and we need to rebuild;
  2. We want our programs to be easy to read, easy to update with new features or bugfixes, and not get in the way of our development.
  3. Separation of concerns is a key programming aspect to maintaining "Clean code" (as Uncle Bob tells us), and therefore the management of the machine learning pipeline should be separated from other concerns of our program.

Finally, while scikit-learn (and other libraries like TensorFlow and PyTorch) are great libraries with easy to use interfaces, they get complicated in non- trivial programs quickly.

To see this in action, take a look at the following example, where we get some data, train a model and evaluate the classifier. We'll see why pipelines are needed, and use them to drastically simplify a machine learning workflow.

import sklearn
print(f"Scikit-learn version {sklearn.__version__}")
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
[
 {
  "name": "stdout",
  "output_type": "stream",
  "text": "Scikit-learn version 0.22.1\n"
 }
]

The standard workflow

We'll use the classic 20 newsgroups dataset, which is a topic modelling exercise where we attempt to predict the category of document

news = datasets.fetch_20newsgroups()
document_index = 0
document_class_index = news.target[document_index]
document_class = news.target_names[document_class_index]
print(f"Document {document_index} is of category {document_class}")
print("Document:\n")
print(news.data[document_index])
[
 {
  "name": "stdout",
  "output_type": "stream",
  "text": "Document 0 is of category rec.autos\nDocument:\n\nFrom: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n\n"
 }
]

The standard text mining example is to vectorise this data to count individual word frequencies, then pass that into our model:

documents_train, documents_test, targets_train, targets_test = train_test_split(news.data, news.target)
transformer = TfidfVectorizer()  # Note this is already two-transformers in one, simplifying our workflow
transformer.fit(documents_train, targets_train)
[
 {
  "data": {
   "text/plain": "TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n                dtype=<class 'numpy.float64'>, encoding='utf-8',\n                input='content', lowercase=True, max_df=1.0, max_features=None,\n                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,\n                smooth_idf=True, stop_words=None, strip_accents=None,\n                sublinear_tf=False, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n                tokenizer=None, use_idf=True, vocabulary=None)"
  },
  "execution_count": 5,
  "metadata": {},
  "output_type": "execute_result"
 }
]

Now we use that to convert our documents to word-count matrices (which are also normalised), and it is that data that we can pass into our LogisticRegression algorithm.

X_train = transformer.transform(documents_train)

X_test = transformer.transform(documents_test)
classifier = LogisticRegression()

classifier.fit(X_train, targets_train)

predictions = classifier.predict(X_test)

Finally, we evaluate:

print(classification_report(targets_test, predictions))
[
 {
  "name": "stdout",
  "output_type": "stream",
  "text": "              precision    recall  f1-score   support\n\n           0       0.91      0.84      0.88       122\n           1       0.70      0.89      0.79       142\n           2       0.81      0.86      0.84       153\n           3       0.79      0.78      0.78       147\n           4       0.91      0.86      0.88       132\n           5       0.88      0.88      0.88       146\n           6       0.78      0.85      0.81       162\n           7       0.96      0.88      0.91       171\n           8       0.94      0.96      0.95       138\n           9       0.92      0.96      0.94       137\n          10       0.98      0.96      0.97       151\n          11       0.98      0.94      0.96       139\n          12       0.88      0.82      0.85       149\n          13       0.97      0.93      0.95       166\n          14       0.96      0.94      0.95       163\n          15       0.84      0.95      0.89       142\n          16       0.95      0.92      0.94       135\n          17       0.97      0.97      0.97       147\n          18       0.97      0.95      0.96       103\n          19       0.88      0.67      0.76        84\n\n    accuracy                           0.89      2829\n   macro avg       0.90      0.89      0.89      2829\nweighted avg       0.90      0.89      0.89      2829\n\n"
 }
]

Not bad! Overall, about 89% accuracy across 20 categories, and we haven't done much work to get there.

That's the problem though - we haven't done much to the data. We used a single transformer and passed the resuls straight into our classifier. I'll point out though, that to do this, we have the following variables to keep track of:

  • documents_train, the documents in our training set
  • documents_test, the documents in our testing set
  • targets_train, the classes of the training set
  • targets_test, the classes of the testing set
  • X_train, the vectorised version of the training set documents
  • X_test, the vectorised version of the testing set documents
  • transformer, the object that converts from text documents to matrices
  • classifier, the object that converts from matrices to predictions

That's a lot to work with! Further, let's complicate it only slighly by adding one more step. A common requirement of this dataset is to remove the header lines (those lines at the top of format Header: Value). We need to remove those, because otherwise our classifier is not predicting from the content, but predicting from the usernames! In other words, the classifier predicts that user lerxst@wam.umd.edu is posing in auto, and the classifier has trouble generalising from here.

Lets write a transformer that removes these lines:

def remove_headers(document):
    # Headers are all lines until the first blank line. We could check also they have a : in them
    trimmed_document_lines = []
    found_blank_line = False

    for line in document.splitlines():
        if found_blank_line is False and line.strip() == "":
            found_blank_line = True
            continue

        if found_blank_line:
            trimmed_document_lines.append(line)

    return str.join("\n", trimmed_document_lines)
print(remove_headers(news.data[0]))
[
 {
  "name": "stdout",
  "output_type": "stream",
  "text": " I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"
 }
]

As a scikit-learn transformer this is not much code, because we aren't fitting a model or anything complicated, just mapping a function

from sklearn.preprocessing import FunctionTransformer

Now, let us add it to our workflow. The process will be:

  1. Remove headers from all documents
  2. Pass results to TfidfVectorizer
  3. Pass those results to LogisticRegression
def remove_headers_all(documents_list, y=None):
    return [remove_headers(document) for document in documents_list]
trimmer = FunctionTransformer(remove_headers_all)
trimmer.fit(documents_train)
trimmed_documents_train = trimmer.transform(documents_train)
trimmed_documents_test = trimmer.transform(documents_test)

Now the result of the workflow as before, but using the trimmed versions of the documents instead:

transformer = TfidfVectorizer()  # Note this is already two-transformers in one, simplifying our workflow
transformer.fit(trimmed_documents_train, trimmed_documents_test)
[
 {
  "data": {
   "text/plain": "TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n                dtype=<class 'numpy.float64'>, encoding='utf-8',\n                input='content', lowercase=True, max_df=1.0, max_features=None,\n                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,\n                smooth_idf=True, stop_words=None, strip_accents=None,\n                sublinear_tf=False, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n                tokenizer=None, use_idf=True, vocabulary=None)"
  },
  "execution_count": 14,
  "metadata": {},
  "output_type": "execute_result"
 }
]
X_train = transformer.transform(documents_train)

X_test = transformer.transform(documents_test)
classifier = LogisticRegression()

classifier.fit(X_train, targets_train)

predictions = classifier.predict(X_test)
print(classification_report(targets_test, predictions))

Accuracy dropped only marginally, and we can be more confident in the results:

# The newline at the start stops our trimmer removing all the data.
eval_documents = ["\nI'd like to sell my 2002 Corolla. Asking price is $5000", 
                  "\nI'd like to fix the engine on my 2002 Corolla"]
trimmed_eval_documents = trimmer.transform(eval_documents)
X_eval = transformer.transform(trimmed_eval_documents)
trimmed_eval_documents
eval_predictions = classifier.predict(X_eval)
eval_predictions
for i, p in enumerate(eval_predictions):
    print(f"Eval document {i} is predicted as being from category {news.target_names[p]}")

That's great, but the problem we have now is that we have a new transformer, new documents and new step to our process. Also, we need to get the steps in the right order, every time, and ensure that the correct output from one step is the input to the next step. (Fun fact, I got this wrong when I first did the code above.)

Pipelines to the rescue

This is the use case for Pipelines - they are scikit-learn's model for how a data mining workflow is managed, and simplifies the process. A pipeline is a multi-step process, where the last step is a classifier (or regression algorithm) and all steps preceeding it are transformers. The output of each step becomes the input to the next step, until the final step when the output is the predictions of the classifier (or regression algorithm).

It is vastly less code to work with:

from sklearn.pipeline import make_pipeline
# Pass all the steps, in order, to make_pipeline
my_pipeline = make_pipeline(FunctionTransformer(remove_headers_all), TfidfVectorizer(), LogisticRegression())
my_pipeline.fit(documents_train, targets_train)
predictions = my_pipeline.predict(documents_test)

print(classification_report(targets_test, predictions))

Easy

(Ignore the slight drop in performance, it's not significant and will fluctuate with rerunning the algorithm)

Going further, we can add more steps to the above, simply by adding another transformer to make_pipeline - easy! We can swap steps out, change parameters and so on. Having a pipeline manage the complexity makes your data mining workflows easier to work with.

You can also reference individual steps, if you need to get information out of it:

my_pipeline.named_steps
my_pipeline.named_steps['tfidfvectorizer'].get_feature_names()[:5]

Pipelines can simplify your workflow, allow updating and making changes easier, and more. Further, they are less code than even a simple example, so it is worth switching to using them for even very simple machine learning workflows.

Check out the references here: https://scikit- learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline

Python Charmers can provide expert-led training on Data Mining, through our https://pythoncharmers.com/training/python-for-predictive-data-analytics/ or through bespoke courses.