It's time to learn what feature engineering is with Kaggle

April 16, 2020

It's time to learn what feature engineering is with Kaggle

What is feature engineering?

there are several things that have an effect on the accuracy of prediction. One of the most powerful factors is feature engineering. In simple, feature engineering is technic about how to choose the right features which have a correlation and drop the wrong features. Usually, you don't want to use every data in the data set. Feature engineering will work with you to classify which features of them you need to select and train.

baseline model

as soon as you start, you need to get raw data ready. I mean, raw data has categorical data, date information that is not prepared to use, etc. Your model might prefer numerical data more than the others. So you need to prep your data.

some data looks like this shape of data. It seems like numerical data but it isn't. So you need to convert timestamp type into integer type.
* Whole code and introduction is in Kaggle feature engineering course and I referenced the web page very bottom on this article.

clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
clicks['hour'] = clicks['click_time'].dt.hour.astype('uint8')
clicks['minute'] = clicks['click_time'].dt.minute.astype('uint8')
clicks['second'] = clicks['click_time'].dt.second.astype('uint8')

Tada! look's cool. Before you are going to convert, you need to tell your data click_time column is a date using parse_dates

Data leakage in time-series data

After preparing the data, you will train and test your model with prepared-split data. Here is what you have to be careful. As we dealt with, It's time-series data and what our goal is to predict the event in the future. So you need to use train data that has more previous time-series data and test data that is more later time than train. If you use time-mixed data, your model will extremely accurate to predict the target. But that is just the result of data leakage.

clicks_srt = clicks.sort_values('click_time')

train_split = int(-1*len(clicks)*0.2)
valid_split = int(-1*len(clicks)*0.1)

train = clicks_srt[:train_split]
valid = clicks_srt[train_split:valid_split]
test = clicks_srt[valid_split:]

Yes, It is ready to be trained and tested. On the course, the instructor uses LightGBM. I actually do not know how that algorithm works with this data so I can not post it. But it is enough for a beginner like me to know how to prep raw data. while I was taking this course, I thought prepping data is quite similar to prepping some ingredients for cooking. It was so interested.

what to do

- hands-on the Algorithms from scratch.
- keep going the Kaggle micro-course.

reference

[1] https://www.kaggle.com/matleonard/baseline-model

Search This Blog

Deep learning and Linear algebra