What is the Pipeline in data engineering?

April 14, 2020

What is the Pipeline in data engineering?

Pipeline is a group of transformers. It helps organize code lines and gives a sense of productivity. As I found so far, a pipeline is similar to OOP. pipeline is a group of objects called transformers and a model.
ColumnTransformer works similar to the pipeline. The different thing is ColumsTransformer can set the list of columns that you want to transform. For example, In your data set, there are numerical data and also categorical value. You decide to use different encoders to deal with missing data. In this case, you need to designate particular columns that stand for types of data. So you can use this ColumnTransformer.

preprossing_categorical = Pipeline(steps = [
('impute' , SimpleImputer(strategy = 'most_frequent')),
('encode' , OneHotEncoder(handle_unknown= 'ignore'))])

preprocessing = ColumnTransformer(transformers = [
('num', SimpleImputer(), num_cols),
('cat', preprossing_categorical, obj_cols)])

As you can see, ColumnTransformer has third parameter that is a list of columns. I tried to convert missing data using SimpleImputer. And SimpleImputer has strategy parameter that allows us to choose the way to convert your missing data. In this case, I used 'most_frequent' strategy for categorical data and default strategy is 'mean' for numerical data. As what needs you have, you can use Pipeline and ColumnTransformer appropriately.

Search This Blog

Deep learning and Linear algebra