How to Create Pipelines in Python

Jay Shaw Feb 02, 2024
  1. Create a Pipeline in Python for a Custom Dataset
  2. Create a Pipeline in Python for a Scikit-Learn Dataset
How to Create Pipelines in Python

This article will demonstrate creating a Python pipeline for machine learning for sklearn datasets and custom datasets.

Create a Pipeline in Python for a Custom Dataset

We need two import packages to create a Python pipeline, Pandas to generate data frames and sklearn for pipelines. Along with it, we deploy two other sub-packages, Pipeline and Linear Regression.

Below is the list of all the packages used.

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

Form a Dataset With Values of an Equation

This program intends to create a pipeline that will predict the consequent values of an equation when enough following values train the model.

The equation used here is:

c = a + 3*\sqrt[3]{b}

We create a Pandas dataset with the values of the linear equation.

df = pd.DataFrame(columns=["col1", "col2", "col3"], val=[[15, 8, 21], [16, 27, 25]])

Split Data Into Train and Test Sets

Every machine learning model requires splitting the data into two unequal halves. After separating, we use these two sets to train and test the model.

The more significant part is used to train, and the other is used to test the model.

In the below code snippet, the first 8 values are taken for training the model and the rest for testing it.

learn = df.iloc[:8]
evaluate = df.iloc[8:]

The scikit-learn pipeline works by taking values into the pipeline and then giving out the results. Values are provided through two input variables - X and y.

In the equation used, c is a function of a and b. So to make the pipeline fit the values in the linear regression model, we will transfer a, b values into X and c values in y.

It is important to note that X and y are learning and evaluating variables. So, we transfer variables a and b to the train function and assign variable c to the test function.

learn_X = learn.drop("col3", axis=1)
learn_y = learn.col3

evaluate_X = evaluate.drop("col3", axis=1)
evaluate_y = evaluate.col3

In the code above, the Pandas drop() function removes the values of column c when values are fed into the learn_X variable. In the learn_y variable, values of column c are transferred.

axis = 1 stands for the column, while a 0 value represents rows.

Create a Python Pipeline and Fit Values in It

We create a pipeline in Python using the Pipeline function. We must save it in a variable before use.

Here, a variable named rock is declared for this purpose.

Inside the pipeline, we must give its name and the model to be used - ('Model for Linear Regression', LinearRegression()).

rock = Pipeline(steps=[("Model for Linear Regression", LinearRegression())])

Once the steps to create the pipeline in Python are completed, it needs to be fitted with the learning values so that the linear model can train the pipeline with the values provided.

rock.fit(learn_X, learn_y)

After the pipeline is trained, the variable evaluate_X predicts the following values through the pipe1.predict() function.

The predicted values are stored in a new variable, evalve, and printed.

evalve = rock.predict(evaluate_X)
print(f"\n{evalve}")

Let’s put everything together to observe how a pipeline is created and its performance.

import pandas as pd

# import warnings
# warnings.filterwarnings('ignore')

from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression

df = pd.DataFrame(
    columns=["col1", "col2", "col3"],
    data=[
        [15, 8, 21],
        [16, 27, 25],
        [17, 64, 29],
        [18, 125, 33],
        [19, 216, 37],
        [20, 343, 41],
        [21, 512, 45],
        [22, 729, 49],
        [23, 1000, 53],
        [24, 1331, 57],
        [25, 1728, 61],
        [26, 2197, 65],
    ],
)

learn = df.iloc[:8]
evaluate = df.iloc[8:]

learn_X = learn.drop("col3", axis=1)
learn_y = learn.col3

evaluate_X = evaluate.drop("col3", axis=1)
evaluate_y = evaluate.col3

print("\n step: Here, the pipeline is formed")
rock = Pipeline(steps=[("Model for Linear Regression", LinearRegression())])
print("\n Step: Fitting the data inside")
rock.fit(learn_X, learn_y)
print("\n Searching for outcomes after evaluation")
evalve = rock.predict(evaluate_X)
print(f"\n{evalve}")

Output:

"C:/Users/Win 10/pipe.py"

 step: Here, the pipeline is formed

 Step: Fitting the data inside

 Searching for outcomes after evaluation

[53. 57. 61. 65.]

Process finished with exit code 0

As we see, the pipeline predicts the exact values.

Create a Pipeline in Python for a Scikit-Learn Dataset

This example demonstrates how to create a pipeline in Python for a Scikit learn dataset. Performing pipeline operations on large datasets is a little different from small ones.

The pipeline needs to use additional models to clean and filter the data when dealing with large datasets.

Below are the import packages we need.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn import datasets

A dataset from sklearn is used. It has multiple columns and values, but we will specifically use two columns - data and target.

Load and Split the Dataset into Train and Test Sets

We will be loading the dataset into the variable bc and storing the individual column values in variables X and y.

bc = datasets.load_breast_cancer()
X = bc.data
y = bc.target

Once the dataset is loaded, we define the learn and evaluate variables. The dataset must be split into train and test sets.

a_learn, a_evaluate, b_learn, b_evaluate = train_test_split(
    X, y, test_size=0.40, random_state=1, stratify=y
)

We allocate the dataset into 4 primary variables - X_learn, X_evaluate, y_learn, and y_evaluate. Unlike the previous program, here, allocation is done through the train_test_split() function.

test_size=0.4 directs the function to reserve 40% of the dataset for testing, and the rest half is kept for training.

random_state=1 ensures that splitting the dataset is done uniformly so that the prediction gives the same output every time the function is run. random_state=0 will provide a different outcome every time the function is run.

stratify=y ensures that the same data size is used in the sample size, as provided to stratify parameters. If there are 15% of 1’s and 85% of 0’s, stratify will ensure that the system has 15% of 1’s and 85% of 0’s in every random split.

Create a Python Pipeline and Fit Values in It

pipeline = make_pipeline(
    StandardScaler(),
    RandomForestClassifier(
        n_estimators=10, max_features=5, max_depth=2, random_state=1
    ),
)

Where:

  • make_pipeline() is a Scikit-learn function to create pipelines.
  • Standard scaler() removes the values from a mean and distributes them towards its unit values.
  • RandomForestClassifier() is a decision-making model that takes a few sample values from the dataset, creates a decision tree with each sample value, then predicts results from each decision tree. Then, the model voted the predicted results for their accuracy, and the result with the most votes is chosen as the final prediction.
  • n_estimators indicate the number of decision trees to be created before voting.
  • max_features decides how many random states will get formed when the splitting of a node is executed.
  • max_depth indicates how deep the node of the tree goes.

After creating the pipeline, the values are fitted, and the result is predicted.

pipeline.fit(a_learn, b_learn)
y_pred = pipeline.predict(a_evaluate)

Let’s look at the complete program.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn import datasets

bc = datasets.load_breast_cancer()
X = bc.data
y = bc.target

a_learn, a_evaluate, b_learn, b_evaluate = train_test_split(
    X, y, test_size=0.40, random_state=1, stratify=y
)

# Create the pipeline

pipeline = make_pipeline(
    StandardScaler(),
    RandomForestClassifier(
        n_estimators=10, max_features=5, max_depth=2, random_state=1
    ),
)


pipeline.fit(a_learn, b_learn)

y_pred = pipeline.predict(a_evaluate)

print(y_pred)

Output:

"C:/Users/Win 10/test_cleaned.py"
[0 0 0 1 1 1 1 0 0 1 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 1 1 1 1 1
 1 1 0 1 1 1 1 0 1 1 0 0 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1
 1 1 0 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1
 1 1 0 0 1 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 1 1 1 0 1 1
 0 0 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 1 0 1 1
 1 1 1 1 1 0]

Process finished with exit code 0