{"id":581,"date":"2021-11-03T17:27:34","date_gmt":"2021-11-03T17:27:34","guid":{"rendered":"https:\/\/ml-gis-service.com\/?p=581"},"modified":"2021-11-03T17:27:35","modified_gmt":"2021-11-03T17:27:35","slug":"toolbox-dask-drop-rows-with-specific-substrings","status":"publish","type":"post","link":"https:\/\/ml-gis-service.com\/index.php\/2021\/11\/03\/toolbox-dask-drop-rows-with-specific-substrings\/","title":{"rendered":"Toolbox: Dask &#8211; Drop Rows with Specific Substrings"},"content":{"rendered":"\n<p><code>Dask<\/code> package is very similar to <code>pandas<\/code> but not all operations work as expected. One of those is getting an index of specific records as an iterable, that allows us to drop unwanted rows like this:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># This will work with Pandas but not within Dask\n\nidxs = df[df[some_column] >= n].index\nndf = df.drop(idxs)<\/pre>\n\n\n\n<p>It won&#8217;t work with <code>Dask<\/code>! That&#8217;s why the idea is to <strong>use conditions to filter records<\/strong>. The selection of records based on condition is the same as dropping records from a DataFrame. To make this article more interesting, we build a solution that removes rows with a specific substring. Function for it is straightforward:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def remove_rows_with_substrings(df, column_name, substrings):\n    \"\"\"\n    Function drops unwanted records based on the condition: drop record if df[column_name] for this\n        record contains one of substrings.\n\n    :param df: (pandas DataFrame) or (dask DataFrame),\n    :param column_name: column which is filtered,\n    :param substrings: (list) or list-like substrings to remove from DataFrame.\n    :return: (pandas DataFrame) or (dask DataFrame).\n    \"\"\"\n\n    # Remove rows with substrings\n    for sub in substrings:\n        df = df[~df[column_name].str.contains(sub)]\n\n    return df<\/pre>\n\n\n\n<p>You can easily make it work for other cases and conditions.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Remove rows with a specific substring from Dask DataFrame<\/p>\n","protected":false},"author":1,"featured_media":583,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[124,2,79,3,17],"tags":[130,125,126,127,128,64,7,129],"class_list":["post-581","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dask","category-data-engineering","category-pandas","category-python","category-scripts","tag-contains","tag-dask","tag-drop","tag-drop-rows","tag-index","tag-pandas","tag-python","tag-substring"],"_links":{"self":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/581","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/comments?post=581"}],"version-history":[{"count":3,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/581\/revisions"}],"predecessor-version":[{"id":585,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/581\/revisions\/585"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media\/583"}],"wp:attachment":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media?parent=581"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/categories?post=581"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/tags?post=581"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}