{"id":475,"date":"2021-09-24T10:41:02","date_gmt":"2021-09-24T10:41:02","guid":{"rendered":"https:\/\/ml-gis-service.com\/?p=475"},"modified":"2024-11-05T10:46:02","modified_gmt":"2024-11-05T10:46:02","slug":"toolbox-drop-duplicated-geometries-from-geodataframe","status":"publish","type":"post","link":"https:\/\/ml-gis-service.com\/index.php\/2021\/09\/24\/toolbox-drop-duplicated-geometries-from-geodataframe\/","title":{"rendered":"Toolbox: Drop Duplicated Geometries from GeoDataFrame"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">NOTE<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The updated version of this post with a better solution is <a href=\"https:\/\/ml-gis-service.com\/index.php\/2024\/11\/05\/toolbox-drop-duplicated-geometries-from-geodataframe-in-python-2024-update\/\" data-type=\"post\" data-id=\"1241\">available here.<\/a><\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Old version (2021) of the article<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Did you ever encounter any problems with duplicated geometries in your dataset? I do! The duplicate may create hard-to-debug errors in our analysis. That&#8217;s why it is crucial to track copies and exclude them from our research. Data presented in the table below is not so uncommon in GIS applications as one should expect:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>ID<\/strong><\/td><td><strong>x<\/strong><\/td><td><strong>y<\/strong><\/td><td><strong>value<\/strong><\/td><\/tr><tr><td>1<\/td><td>10.78<\/td><td>52.11<\/td><td>10<\/td><\/tr><tr><td>2<\/td><td>9.11<\/td><td>51.05<\/td><td>5<\/td><\/tr><tr><td>3<\/td><td>9.11<\/td><td>51.05<\/td><td>12<\/td><\/tr><tr><td>4<\/td><td>7.42<\/td><td>52.98<\/td><td>3<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">Sample readings<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Look at the point 2 and 3 &#8211; their coordinates are the same but values are different. Are both the same point? Are those the various locations, and we made a mistake with rounding up the floating points? Sometimes there&#8217;s no way to answer those questions, and the only option is to remove duplicated geometries.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There is <code>drop_duplicates()<\/code> method from <code>Pandas<\/code> package in Python, which we could potentially use to remove duplicates:<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><code>drop_duplicates()<\/code> method from the Pandas<\/h2>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import geopandas as gpd\n\ngdf = gpd.read_file('geoseries.shp')  # We assume that the geometry column is a column with geometry\ncleaned = gdf.drop_duplicates('geometry')<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This method should work fine for <code>Point<\/code> type geometry. The problem arises when we compare the <code>Line<\/code> and <code>Polygon<\/code> types of geometry. Look into an example:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from shapely.geometry import Polygon\n\npolygon_a = Polygon([(0, 1), (2, 3), (2, 6), (0, 1)])\npolygon_b = Polygon([(2, 3), (2, 6), (0, 1), (2, 3)])\n\ngs = gpd.GeoSeries(data=[polygon_a, polygon_b])\n\nprint(len(gs.drop_duplicates()))<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">2<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">What went wrong? Both geometries are the same, but for Pandas, they are different. The order of points is relevant here! That&#8217;s why we should be very cautious with the methods designed for the non-spatial datasets. So how can we drop duplicates? There&#8217;s a method from the <code>shapely<\/code> package named <code>equals()<\/code> which we can use to test if geometries are the same. If so, then we can drop unwanted rows from our GeoDataFrame.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Drop duplicated geometries with <code>shapely<\/code><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The function works as follow:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Take record from the dataset. Check it&#8217;s index against list of indexes-to-skip. If it&#8217;s not there then move to the next step.<\/li>\n\n\n\n<li>Store record&#8217;s index in the list of processed indexes (to re-create geoseries without duplicates) and in the list of indexes-to-skip.<\/li>\n\n\n\n<li>Compare this record to all other records. If any of them is a duplicate then store its index in the indexes-to-skip.<\/li>\n\n\n\n<li>If all records are checked then re-create dataframe without duplicates based on the list of processed indexes.<\/li>\n<\/ol>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def drop_duplicated_geometries(geoseries: gpd.GeoSeries):\n    \"\"\"\n    Function drops duplicated geometries from a geoseries. It works as follow:\n    \n        1. Take record from the dataset. Check it's index against list of indexes-to-skip.\n        If it's not there then move to the next step.\n        2. Store record's index in the list of processed indexes (to re-create geoseries without duplicates)\n        and in the list of indexes-to-skip.\n        3. Compare this record to all other records. If any of them is a duplicate then store its index in\n        the indexes-to-skip.\n        4. If all records are checked then re-create dataframe without duplicates based on the list\n        of processed indexes.\n        \n    INPUT:\n    \n    :param geoseries: (gpd.GeoSeries)\n    \n    OUTPUT:\n    \n    :returns: (gpd.Geoseries)\n    \"\"\"\n    \n    indexes_to_skip = []\n    processed_indexes = []\n    \n    for index, geom in geoseries.items():\n        if index not in indexes_to_skip:\n            processed_indexes.append(index)\n            indexes_to_skip.append(index)\n            for other_index, other_geom in geoseries.items():\n                if other_index in indexes_to_skip:\n                    pass\n                else:\n                    if geom.equals(other_geom):\n                        indexes_to_skip.append(other_index)\n                    else:\n                        pass\n    output_gs = geoseries[processed_indexes].copy()\n    return output_gs<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This function is much safer than the <code>.drop_duplicates()<\/code> method of Pandas. The only problem is its complexity: in the worst-case scenario, it is O(n^2), and in most cases, it is O(n log n). The sample curve of execution time vs. a number of records is present below:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"655\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/09\/exectime-1024x655.png\" alt=\"\" class=\"wp-image-484\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/09\/exectime-1024x655.png 1024w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/09\/exectime-300x192.png 300w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/09\/exectime-768x491.png 768w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/09\/exectime-1536x983.png 1536w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Execution time versus the number of records in a GeoSeries.<\/figcaption><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>We should avoid .drop_duplicates() method from Pandas!<\/p>\n","protected":false},"author":1,"featured_media":485,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,17],"tags":[106,105,104,7,62],"class_list":["post-475","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-engineering","category-scripts","tag-data-engineering","tag-drop-duplicated-geometry","tag-drop-duplicates-geopandas","tag-python","tag-spatial"],"_links":{"self":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/475","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/comments?post=475"}],"version-history":[{"count":11,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/475\/revisions"}],"predecessor-version":[{"id":1247,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/475\/revisions\/1247"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media\/485"}],"wp:attachment":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media?parent=475"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/categories?post=475"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/tags?post=475"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}