Making Sure All Your Data Sources End Up in The Same Shape

I was reading a thread a while ago and the OP asked:

I have data from different input sources that vary in the number of columns they have. It looks like for Keras I need to make sure every file has the same number of columns. My first thought is to just add 0's where the columns should be but I don't know what this does to weighting so I'm asking here. Thanks!

This looks like another feature engineering problem. My best answer is to make a combined dataset of these various sources.

This can be done with pandas. Using the concatenate function and merge functions.

image001.png
result = pd.concat([df1, df4], axis=1)

An example from the pandas documentation.

Another example from the documentation:

image002.png
result = pd.concat([df1, df4], ignore_index=True, sort=False)

To get rid of the NaN rows. You use the DropNA function or while concatenating the data frames use the inner attribute.

Check out the pandas documentation here.

 

A person in the thread recommended a decent course of action:

it might be the best to manually prepare the data to use only/mostly the columns which are present across all datasets, and make sure they match together.

This is a good idea. But they may be important features that may be dropped in the process. Because of that, I would recommend the first option. Then deciding which columns to drop after training. Depending on your data you may want to add extra features yourself mentioning the input source.

 

The main task to do is some extra pre-processing of your data. Combining them is the best bet. From there you can apply various feature selection techniques to decide which columns you would like to keep. Check out the pandas documentation if you not sure how to deal with Dataframes.

If you found this article interesting, then check out my mailing list. Where I write more stuff like this