Detect Missing Values
Missing data can be filled with values like 'NA'
,
'NaN'
, '-'
, or '.'
etc. To detect
these values, we can
- use
pandas.Series.unique()
to find unique values and then check the first and the last element from the sorted list of unique values bynumpy.sort()
; - use
pandas.DataFrame.info()
to check if there are default missing values,'NA'
s or'NaN'
s, in a pandas DataFrame by columns; > check numeric variables further if they are reported asobject
with no missing values since a numeric variable will be treated as anobject
(a string variable) if numeric values are mixed with'NA'
s or'NaN'
s - use
pandas.DataFrame.describe()
to check descriptive statistics of variables and their counts of non-missing values
Once missing values have been identified, we can reload data using pandas.read_csv()
with specified na_value
argument.
We can either fill missing values by imputation or delete them using
pandas.DataFrame.dropna()
.
Detect Types of Missing
Reference: https://en.wikipedia.org/wiki/Missing_data#Types
Missing Completely at Random (MCAR)
Missing at Random (MAR)
There is a systematic relationship between missing value and other observed data, but not the variable itself. Since MAR is an assumption that is impossible to verify statistically, we must rely on its substantive reasonableness. Wikipedia Example: Males are less likely to fill in a depression survey but this has nothing to do with their level of depression, after accounting for maleness.
Missing not at Random (MNAR)
To extend the previous example, this would occur if men failed to fill in a depression survey because of their level of depression. This also raises another concern of selection bias.
missingno.heatmap()
,
missingno.dendrogram()
, and missingno.matrix()
in Python package missingno
can help us visualize the patterns of missing values.
Impute Values
Impute with mean, median, mode, and a constant using
sklearn.impute.SimpleImputer()
with specifiedstrategy
argument.Impute with lead or lag values in a time-series data frame using
pandas.DataFrame.fillna()
with specifiedmethod
(='bfill', 'ffill'
and etc.) argument orpandas.DataFrame.interpolate()
with specifiedmethod
(='linear', 'nearest', 'quadratic'
and etc.) argument.1
2
3
4
5
6
7
8
9
10
11
12# Set nrows to 3 and ncols to 1
fig, axes = plt.subplots(3, 1, figsize=(30, 20))
# Create a dictionary of interpolated DataFrames for looping
interpolations = {'Linear Interpolation': linear,
'Quadratic Interpolation': quadratic,
'Nearest Interpolation': nearest}
# Loop over axes and interpolations
for ax, df_key in zip(axes, interpolations):
# Select and also set the title for a DataFrame
interpolations[df_key].Ozone.plot(color='red', marker='o', linestyle='dotted', ax=ax)
airquality.Ozone.plot(title=df_key + ' - Ozone', marker='o', ax=ax)
plt.show()Impute with advanced models
References: https://scikit-learn.org/stable/modules/impute.html and
https://github.com/iskandr/fancyimpute
Nearest neighbors (KNN) imputation
When imputing missing values for a categorical variable, we firstly transform the categorical variable to a numerical variable using
OrdinalEncoder().fit_transform(X)
, then apply the imputation method, and lastly transform the imputed encoded variable back to the categorical usingOrdinalEncoder().inverse_transform(X)
.Multivariate feature imputation
This method iteratively estimates a regression of every feature with missing values on other features, and uses regression predicts for imputation. It is very robust. However, we should keep an eye on the concern of information leakage if the dependent/target variable (or the dependent/target-related variables) of the final model is (are) used in the imputation.
Takeaway: The best imputation solution might vary by data sets. We can select the solution by comparing the final model performance using data sets processed by different imputation methods.
View / Make Comments