Toolbox: Dask – Drop Rows with Specific Substrings
Dask package is very similar to
pandas but not all operations work as expected. One of those is getting an index of specific records as an iterable, that allows us to drop unwanted rows like this:
# This will work with Pandas but not within Dask idxs = df[df[some_column] >= n].index ndf = df.drop(idxs)
It won’t work with
Dask! That’s why the idea is to use conditions to filter records. The selection of records based on condition is the same as dropping records from a DataFrame. To make this article more interesting, we build a solution that removes rows with a specific substring. Function for it is straightforward:
def remove_rows_with_substrings(df, column_name, substrings): """ Function drops unwanted records based on the condition: drop record if df[column_name] for this record contains one of substrings. :param df: (pandas DataFrame) or (dask DataFrame), :param column_name: column which is filtered, :param substrings: (list) or list-like substrings to remove from DataFrame. :return: (pandas DataFrame) or (dask DataFrame). """ # Remove rows with substrings for sub in substrings: df = df[~df[column_name].str.contains(sub)] return df
You can easily make it work for other cases and conditions.