{"id":1241,"date":"2024-11-05T10:44:27","date_gmt":"2024-11-05T10:44:27","guid":{"rendered":"https:\/\/ml-gis-service.com\/?p=1241"},"modified":"2024-11-05T10:44:28","modified_gmt":"2024-11-05T10:44:28","slug":"toolbox-drop-duplicated-geometries-from-geodataframe-in-python-2024-update","status":"publish","type":"post","link":"https:\/\/ml-gis-service.com\/index.php\/2024\/11\/05\/toolbox-drop-duplicated-geometries-from-geodataframe-in-python-2024-update\/","title":{"rendered":"Toolbox: drop duplicated geometries from GeoDataFrame in Python &#8211; 2024 update"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">You came here to get the code for dropping duplicated geometries in GeoPandas <code>GeoDataFrame<\/code>, so let&#8217;s start with the code. If you are interested in a detailed explanation, you can read the text below the code.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def norm_drop_geometries(geoseries: gpd.GeoSeries):\n    \"\"\"\n    Function normalizes geometry in a geoseries, and then drops repeating records.\n    \n    INPUT\n    \n    :param geoseries: (gpd.GeoSeries)\n    \n    OUTPUT:\n    \n    :returns: (gpd.Geoseries)\n    \"\"\"\n    \n    normalized = geoseries.normalize()\n    deduplicated = normalized.drop_duplicates()\n    return deduplicated<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><br>The article is an <a href=\"https:\/\/ml-gis-service.com\/index.php\/2021\/09\/24\/toolbox-drop-duplicated-geometries-from-geodataframe\/\" data-type=\"post\" data-id=\"475\">updated version of the note from 2021<\/a>. Many thanks to Freddy Fingers for pointing out the <code>normalize()<\/code> method from GeoPandas.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In 2021, I&#8217;ve shared an article on how to deduplicate complex geometries in GeoDataFrame. The biggest obstacle with this procedure was the ordering of points in Polygons. You can have two geometries that are the same but start from a different point, and the <code>drop_duplicates()<\/code> method won&#8217;t work. To overcome this problem, I shared this function:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def drop_duplicated_geometries(geoseries: gpd.GeoSeries):\n    \"\"\"\n    Function drops duplicated geometries from a geoseries. It works as follow:\n    \n        1. Take record from the dataset. Check it's index against list of indexes-to-skip.\n        If it's not there then move to the next step.\n        2. Store record's index in the list of processed indexes (to re-create geoseries without duplicates)\n        and in the list of indexes-to-skip.\n        3. Compare this record to all other records. If any of them is a duplicate then store its index in\n        the indexes-to-skip.\n        4. If all records are checked then re-create dataframe without duplicates based on the list\n        of processed indexes.\n        \n    INPUT:\n    \n    :param geoseries: (gpd.GeoSeries)\n    \n    OUTPUT:\n    \n    :returns: (gpd.Geoseries)\n    \"\"\"\n    \n    indexes_to_skip = []\n    processed_indexes = []\n    \n    for index, geom in geoseries.items():\n        if index not in indexes_to_skip:\n            processed_indexes.append(index)\n            indexes_to_skip.append(index)\n            for other_index, other_geom in geoseries.items():\n                if other_index in indexes_to_skip:\n                    pass\n                else:\n                    if geom.equals(other_geom):\n                        indexes_to_skip.append(other_index)\n                    else:\n                        pass\n    output_gs = geoseries[processed_indexes].copy()\n    return output_gs\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Three years passed, and I got a comment from Freddy Fingers about a different method &#8211; first, normalizing geometries in GeoSeries and then using the <code>drop_duplicates()<\/code> method. The only problem was benchmarking. Is this other method faster? Short answer: It is.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Normalize and drop<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Usually, internal methods provided by package maintainers are better than ad hoc solutions. Thus, if we can, we should use those. The other important practice is benchmarking. To be sure that the function based on GeoPandas <code>normalize()<\/code> is faster than the function based on the Shapely <code>equals()<\/code> method, I&#8217;ve performed multiple tests on a dataset with a growing size. The full test suite is available in this notebook: <a href=\"https:\/\/github.com\/SimonMolinsky\/articles\/blob\/main\/2024-11\/drop-geometries-2\/drop-geoms-compare.ipynb\">HERE<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Going straight to the results, here is the plot of processing time for each function:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"997\" height=\"525\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2024\/11\/comp.png\" alt=\"\" class=\"wp-image-1243\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2024\/11\/comp.png 997w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2024\/11\/comp-300x158.png 300w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2024\/11\/comp-768x404.png 768w\" sizes=\"auto, (max-width: 997px) 100vw, 997px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">As you can see, the processing time of a custom function is much slower, and the difference between times sharply increases with a dataset size. Code readability is the next issue.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We started a note with a valid code, and we will end it with the same code. Remember to update your projects if you are using the old function!<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def norm_drop_geometries(geoseries: gpd.GeoSeries):\n    \"\"\"\n    Function normalizes geometry in a geoseries, and then drops repeating records.\n    \n    INPUT\n    \n    :param geoseries: (gpd.GeoSeries)\n    \n    OUTPUT:\n    \n    :returns: (gpd.Geoseries)\n    \"\"\"\n    \n    normalized = geoseries.normalize()\n    deduplicated = normalized.drop_duplicates()\n    return deduplicated<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>You came here to get the code for dropping duplicated geometries in GeoPandas GeoDataFrame, so let&#8217;s start with the code. If you are interested in a detailed explanation, you can read the text below the code. The article is an updated version of the note&#8230;<\/p>\n","protected":false},"author":1,"featured_media":1244,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,170,79,3,17],"tags":[272,271,273,57,7,62],"class_list":["post-1241","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-engineering","category-geocomputation","category-pandas","category-python","category-scripts","tag-drop-duplicated-geometries","tag-drop-duplicates","tag-duplicated-polygons","tag-geopandas","tag-python","tag-spatial"],"_links":{"self":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/1241","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/comments?post=1241"}],"version-history":[{"count":2,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/1241\/revisions"}],"predecessor-version":[{"id":1245,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/1241\/revisions\/1245"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media\/1244"}],"wp:attachment":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media?parent=1241"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/categories?post=1241"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/tags?post=1241"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}