Top
Sp.4ML > Dask  > Toolbox: Dask – Drop Rows with Specific Substrings
Decorative image with succulent

Toolbox: Dask – Drop Rows with Specific Substrings

Dask package is very similar to pandas but not all operations work as expected. One of those is getting an index of specific records as an iterable, that allows us to drop unwanted rows like this:

# This will work with Pandas but not within Dask

idxs = df[df[some_column] >= n].index
ndf = df.drop(idxs)

It won’t work with Dask! That’s why the idea is to use conditions to filter records. The selection of records based on condition is the same as dropping records from a DataFrame. To make this article more interesting, we build a solution that removes rows with a specific substring. Function for it is straightforward:

def remove_rows_with_substrings(df, column_name, substrings):
    """
    Function drops unwanted records based on the condition: drop record if df[column_name] for this
        record contains one of substrings.

    :param df: (pandas DataFrame) or (dask DataFrame),
    :param column_name: column which is filtered,
    :param substrings: (list) or list-like substrings to remove from DataFrame.
    :return: (pandas DataFrame) or (dask DataFrame).
    """

    # Remove rows with substrings
    for sub in substrings:
        df = df[~df[column_name].str.contains(sub)]

    return df

You can easily make it work for other cases and conditions.

Szymon
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x