{"id":232,"date":"2021-07-02T15:41:48","date_gmt":"2021-07-02T15:41:48","guid":{"rendered":"https:\/\/ml-gis-service.com\/?p=232"},"modified":"2021-07-02T15:41:49","modified_gmt":"2021-07-02T15:41:49","slug":"data-science-text-matching-with-python-and-fuzzywuzzy","status":"publish","type":"post","link":"https:\/\/ml-gis-service.com\/index.php\/2021\/07\/02\/data-science-text-matching-with-python-and-fuzzywuzzy\/","title":{"rendered":"Data Science: Text Matching with Python and fuzzywuzzy"},"content":{"rendered":"\n<p>Do you have database with records indicating the same object but described in multiple ways by users? Think of the <em>Main Square Street<\/em> record which could be written as:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li><em>The Main Sq Street<\/em>,<\/li><li><em>Main square<\/em>,<\/li><li><em>M. Square street,<\/em><\/li><li>or much, much more&#8230;<\/li><\/ol>\n\n\n\n<p>The good news is that <strong>you&#8217;re not alone<\/strong> with the problem of multiple ways to describe one record! This pattern is extremely common. It is unavoidable when users have opportunity to name features as they like. The bad news is that <strong>string matching is not a trivial task<\/strong> and it&#8217;s rather a semi-supervised problem. Machine Learning algorithms help to find matches but output must be checked by the human operator and those algorithms are not able to find each match. In this tutorial we&#8217;ll take a look into the most popular algorithm for string matching: <em>Levenshtein Distance.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Algorithm<\/h2>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>In <a href=\"https:\/\/en.wikipedia.org\/wiki\/Information_theory\">information theory<\/a>, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Linguistics\">linguistics<\/a>, and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Computer_science\">computer science<\/a>, the <strong>Levenshtein distance<\/strong> is a <a href=\"https:\/\/en.wikipedia.org\/wiki\/String_metric\">string metric<\/a> for measuring the difference between two sequences.<\/p><cite><a href=\"https:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\">Wikipedia<\/a><\/cite><\/blockquote>\n\n\n\n<p>Algorithm has few rules to calculate distance between phrases. For strings <em>A<\/em> and <em>B<\/em> those rules are:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>If length of A is equal to 0 then distance is equal to the length B.<\/li><li>If length of B is equal to 0 then distance is equal to length A.<\/li><li>If first character of A is equal to the first character of B then skip them and analyze distance for the rest of characters.<\/li><li>Now comparison occurs. First add 1 to the distance and add the best operation from the list:<ul><li>remove one element from A,<\/li><li>insert one element to A,<\/li><li>replace one element in A with element from B.<\/li><\/ul><\/li><\/ol>\n\n\n\n<p>Based on those rules we can calculate distances between words <em>shark<\/em>, <em>trans<\/em>, <em>mark<\/em>:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Transform <em>shark<\/em> to <em>trans<\/em><\/h3>\n\n\n\n<ol class=\"wp-block-list\"><li>S H A R K &lt;-&gt; T R A N S (distance = 0)<\/li><li><strong>T<\/strong> H A R K &lt;-&gt; <strong>T<\/strong> R A N S (distance = 1) -&gt; replace<\/li><li>T <strong>R<\/strong> A R K &lt;-&gt; T <strong>R<\/strong> A N S (distance = 2) -&gt; replace<\/li><li>T R A <strong>N<\/strong> K &lt;-&gt; T R A <strong>N<\/strong> S (distance = 3) -&gt; replace<\/li><li>T R A N <strong>S<\/strong> &lt;-&gt; T R A N <strong>S<\/strong> (distance = 4) -&gt; replace<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Transform <em>shark<\/em> to <em>mark<\/em><\/h3>\n\n\n\n<ol class=\"wp-block-list\"><li>S H A R K &lt;-&gt; M A R K (d = 0)<\/li><li>(<strong>0<\/strong>) H A R K &lt;-&gt; M A R K (d = 1) -&gt; remove<\/li><li><strong>M<\/strong> A R K &lt;-&gt; M A R K (d = 2) -&gt; replace<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Transform <em>mark<\/em> to <em>trans<\/em><\/h3>\n\n\n\n<ol class=\"wp-block-list\"><li>M A R K &lt;-&gt; T R A N S (d = 0)<\/li><li><strong>T<\/strong> M A R K &lt;-&gt; T R A N S (d = 1) -&gt; insert<\/li><li>T <strong>R<\/strong> A R K &lt;-&gt; T R A N S (d = 2) -&gt; replace<\/li><li>T R A <strong>N<\/strong> K &lt;-&gt; T R A N S (d = 3) -&gt; replace<\/li><li>T R A N <strong>S<\/strong> &lt;-&gt; T R A N S (d = 4) -&gt; replace<\/li><\/ol>\n\n\n\n<p>Algorithm implementation which is fast is not a straightforward process. Fortunately there&#8217;s Python package <code>fuzzywuzzy<\/code> dedicated to calculate a distance between words. Let&#8217;s use it!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Project plan<\/h2>\n\n\n\n<p>Our project is divided into following sections:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Preparation of <code>conda<\/code> environment.<\/li><li>Download data from the webpage.<\/li><li>Data cleaning.<\/li><li><em>Levenshtein Distance<\/em> function.<\/li><li><code>DataFrame<\/code> for the analysis.<\/li><li>Exploratory Data Analysis: <em>bar plots<\/em>.<\/li><li>Comparison of all records altogether.<\/li><li>Similarity map of records.<\/li><\/ol>\n\n\n\n<p>Main takeoffs from this article are:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>you will build a system which scrapes data from the webpage (<code>requests<\/code>, <code>BeautifulSoup<\/code>),<\/li><li>you will perform exploratory data analysis on the set of words (<code>fuzzywuzzy<\/code>, <code>pandas<\/code>),<\/li><li>you will learn which data presentation methods are the best for this case (<code>matplotlib<\/code>, <code>seaborn<\/code>).<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Project: Match online text and check similarity of the words<\/h2>\n\n\n\n<p>Project is divided into multiple parts. If you are not interested in some of them then move to the next section. Those parts are loosely depended on each other.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Part 1: Environment preparation<\/h3>\n\n\n\n<p>Before we start we must have <em>anaconda<\/em> (<em>conda<\/em>, <em>miniconda<\/em>) environment installed in our system. With it we can setup our working environment. We are going to create environment named <code>stringmatch<\/code> with seven dependencies:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">conda create -n stringmatch -c conda-forge fuzzywuzzy pandas requests beautifulsoup4 matplotlib seaborn notebook<\/pre>\n\n\n\n<p> To activate environment type:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">conda activate stringmatch<\/pre>\n\n\n\n<p>And to run <em>Jupyter Notebook<\/em> type:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">jupyter-notebook<\/pre>\n\n\n\n<p>&#8230; and create new notebook.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Part 2: Download data from the webpage<\/h3>\n\n\n\n<p>To make tutorial more interesting we use external data source for our work. It simulates our dependency on the records from database. We use prepared list of words from <a href=\"https:\/\/ml-gis-service.com\/index.php\/teaching\/\">HERE<\/a>. <\/p>\n\n\n\n<p>Working with data from the HTML pages is a tricky thing. We must be aware that the structure of HTML may change over time and the good practice is to monitor if it is the same over and over again when we are scraping page content. Very simple approach is to check if some kind of headers or div elements are present in the webpage. We will do it in this step, along with data downloading. In summary we are going to:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>test connection to the webpage,<\/li><li>check changes in the webpage content,<\/li><li>download dataset for analysis.<\/li><\/ul>\n\n\n\n<p>All of it is possible with <code>requests<\/code> Python package. Let&#8217;s go! We start with package import and with setting up our first constant &#8211; <em>webpage url<\/em>:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import requests\n\nWEBPAGE_URL = 'https:\/\/ml-gis-service.com\/index.php\/teaching\/'<\/pre>\n\n\n\n<p>Just few words about the <code>requests<\/code> package capabilities. I bet that you had seen <strong>404 error<\/strong> in the past. There is a possibility that you&#8217;d seen <strong>502<\/strong> error too. Those two kinds of errors informs us that something is wrong. With <strong>4XX<\/strong> errors &#8211; there is something wrong with the webpage on our side (maybe wrong url?) and errors from the <strong>5XX<\/strong> family are related to backend issues (maybe database connection is lost?). If you build any system which automatically crawls over internet then be aware that the connection error handling is a must-have skill. <code>Requests<\/code> package has few internal exceptions which we can use to catch those errors. The good practice is to download page content within <code>try ... except ...<\/code> statement and if error occurs then log it to the admin. Here&#8217;s example how to open connection to the webpage given in a <code>WEBPAGE_URL<\/code> variable with error handling in mind:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Always monitor connections!\n# https:\/\/requests.readthedocs.io\/en\/latest\/user\/quickstart\/#errors-and-exceptions\n\ntry:\n    page = requests.get(WEBPAGE_URL)\nexcept requests.exceptions.Timeout:\n    print('Info: page is not available now...')\nexcept requests.exceptions.TooManyRedirects:\n    print('Info: probably url has changed...')\nexcept requests.exceptions.RequestException as e:\n    raise SystemExit(e('Critical Error!'))<\/pre>\n\n\n\n<p>Is <code>page<\/code> loaded successfully? If any of those <code>print<\/code> statements don&#8217;t work that we are at the right path! To test it better we can use <code>status_code<\/code> attribute of our <code>requests.models.Response<\/code> object:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">page.status_code<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">200<\/pre>\n\n\n\n<p>Status code equal to <strong>200<\/strong> means that our connection is fine and we can start to dig in within the page.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">(Optional) Check page content<\/h4>\n\n\n\n<p>You may skip this step &#8211; it&#8217;s optional. Treat it as a good advice. Successful connection is the initial step of checks (if we build automated system). The next step is to look into structure of the HTML document which we are going to scrap. <strong>Chances that something has changed in the website code are pretty high<\/strong> and a good practice is to monitor all elements from the HTML tags, especially <strong>id and classes<\/strong> and a DOM structure of a document. In this example we can do it very simply by monitoring if specific kind of header tag with specific text exists in the scraped document (it is my webpage and I don&#8217;t plan to change or remove this specific header). How to do it? Very simple: get full document as <code>bytes<\/code> type from the <code>requests<\/code> package then find specific HTML tag <code>&lt;h2>&lt;\/h2><\/code> with <code>BeautifulSoup<\/code>. If those tags exist then check text within them and if any of phrases is equal to <code>Data Science workshops materials<\/code> then we are in a good place:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def check_page_tag(page_content, tag_type, unique_text):\n    \"\"\"\n    Function performs check if we're connected to the right page\n    based on the header value. Method raises IOError if header is\n    not detected.\n    \n    :param page_content: (bytes) output of requests package,\n    :param tag_type: (str) HTML tag type to check\n    :param unique_text: (str) unique text inside tag_type element from the page\n\n    \"\"\"\n    page_content = str(page_content)\n    error_message = f'You\\'ve not connected to the service.\\n\\\n    HTML tag {tag_type} of value \"{unique_text}\" wasn\\'t detected.'\n    \n    headers = soup.find_all(tag_type)\n    \n    # Now we can terminate if there are no headers\n    if not headers:\n        raise IOError(error_message)\n    else:\n        # Check if unique header exist\n        headers_text = [h.get_text() for h in headers]\n    \n        if not unique_text in headers_text:\n            raise IOError(error_message)\n\n\nfrom bs4 import BeautifulSoup\n\n\nWEBPAGE_HEADER = 'Data Science workshops materials'\n\nsoup = BeautifulSoup(page.content, 'html.parser')\ncheck_page_tag(page.content, 'h2', WEBPAGE_HEADER)<\/pre>\n\n\n\n<p>If function <code>check_page_tag()<\/code> doesn&#8217;t return error then everything is fine and we can retrieve data. We use <code>BeautifulSoup<\/code> for this task. First, parse <code>bytes<\/code> from <code>requests<\/code> to <code>soup<\/code> object type:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup = BeautifulSoup(page.content, 'html.parser')<\/pre>\n\n\n\n<p>Next <code>soup<\/code> object, or parsed document, is passed into the function <code>check_page_tag()<\/code>. Within this function we use other specific method of <code>BeuatifulSoup<\/code> package <code>.find_all()<\/code> which scans parsed document and return <code>list<\/code> of specific tag elements. Then for each tag we extract text with <code>.get_text()<\/code> method and compare it to the <code>unique_text<\/code> parameter. <code>IOError<\/code> is returned if <code>unique_text<\/code> isn&#8217;t found anywhere.<\/p>\n\n\n\n<p>This is a very simple implementation of a function which scans webpage to find any structural changes. In reality it will be more complex, especially with pages loaded dynamically. (Keep this in mind!)<\/p>\n\n\n\n<p>Now we are returning to our baseline path. We are going to retrieve text for distance calculation.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Data Transformation with <code>BeautifulSoup<\/code><\/h4>\n\n\n\n<p>Part 2 of the article is a complicated way to retrieve data. In reality <strong>we use database with clear schema<\/strong> or flat file. But HTML document has its structure too and this non-trivial data preparation task teaches us how to handle<em> boundary conditions <\/em>within projects. Take a look into source code of the webpage <a href=\"https:\/\/ml-gis-service.com\/index.php\/teaching\/\">HERE<\/a>. Words used in this tutorial are presented as a bullet list. Bullet list tag is <code>&lt;li>&lt;\/li><\/code>. The case here is to obtain concrete list with specific words without tags.<\/p>\n\n\n\n<p>But here&#8217;s a problem, if we just take all <code>&lt;li&gt;&lt;\/li&gt;<\/code> elements with the <code>.find_all()<\/code> method from <code>BeautifulSoup<\/code> we get:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">words_candidates = soup.find_all('li')\nprint(words_candidates)<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"html\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">[&lt;li class=\"menu-item menu-item-type-post_type menu-item-object-page narrow\" id=\"nav-menu-item-87\">&lt;a class=\"\" href=\"https:\/\/ml-gis-service.com\/index.php\/about\/\">&lt;span class=\"item_outer\">&lt;span class=\"item_inner\">&lt;span class=\"item_text\">Author&lt;\/span>&lt;\/span>&lt;\/span>&lt;\/a>&lt;\/li>, &lt;li class=\"menu-item menu-item-type-post_type menu-item-object-page current-menu-item page_item page-item-121 current_page_item mkd-active-item narrow\" id=\"nav-menu-item-124\">&lt;a class=\"current\" href=\"https:\/\/ml-gis-service.com\/index.php\/teaching\/\">&lt;span class=\"item_outer\">&lt;span class=\"item_inner\">&lt;span class=\"item_text\">Teaching&lt;\/span>&lt;\/span>&lt;\/span>&lt;\/a>&lt;\/li>, &lt;li class=\"cat-item cat-item-2\">&lt;a href=\"https:\/\/ml-gis-service.com\/index.php\/category\/data-engineering\/\">Data Engineering&lt;\/a>\n&lt;\/li>, &lt;li class=\"cat-item cat-item-18\">&lt;a href=\"https:\/\/ml-gis-service.com\/index.php\/category\/data-science\/\">Data Science&lt;\/a>\n&lt;\/li>, &lt;li class=\"cat-item cat-item-19\">&lt;a href=\"https:\/\/ml-gis-service.com\/index.php\/category\/machine-learning\/\">Machine Learning&lt;\/a>\n&lt;\/li>, &lt;li class=\"cat-item cat-item-50\">&lt;a href=\"https:\/\/ml-gis-service.com\/index.php\/category\/management\/\">Management&lt;\/a>\n&lt;\/li>, &lt;li class=\"cat-item cat-item-48\">&lt;a href=\"https:\/\/ml-gis-service.com\/index.php\/category\/personal-development\/\">Personal Development&lt;\/a>\n&lt;\/li>, &lt;li class=\"cat-item cat-item-3\">&lt;a href=\"https:\/\/ml-gis-service.com\/index.php\/category\/python\/\">Python&lt;\/a>\n&lt;\/li>, &lt;li class=\"cat-item cat-item-49\">&lt;a href=\"https:\/\/ml-gis-service.com\/index.php\/category\/rd\/\">R&amp;amp;D&lt;\/a>\n&lt;\/li>, &lt;li class=\"cat-item cat-item-4\">&lt;a href=\"https:\/\/ml-gis-service.com\/index.php\/category\/raster\/\">Raster&lt;\/a>\n&lt;\/li>, &lt;li class=\"cat-item cat-item-20\">&lt;a href=\"https:\/\/ml-gis-service.com\/index.php\/category\/remote-sensing\/\">Remote Sensing&lt;\/a>\n&lt;\/li>, &lt;li class=\"cat-item cat-item-17\">&lt;a href=\"https:\/\/ml-gis-service.com\/index.php\/category\/scripts\/\">Scripts&lt;\/a>\n&lt;\/li>, &lt;li class=\"cat-item cat-item-30\">&lt;a href=\"https:\/\/ml-gis-service.com\/index.php\/category\/spatial-statistics\/\">Spatial Statistics&lt;\/a>\n&lt;\/li>, &lt;li class=\"cat-item cat-item-31\">&lt;a href=\"https:\/\/ml-gis-service.com\/index.php\/category\/tutorials\/\">Tutorials&lt;\/a>\n&lt;\/li>, &lt;li class=\"menu-item menu-item-type-post_type menu-item-object-page\" id=\"mobile-menu-item-87\">&lt;a class=\"\" href=\"https:\/\/ml-gis-service.com\/index.php\/about\/\">&lt;span>Author&lt;\/span>&lt;\/a>&lt;\/li>, &lt;li class=\"menu-item menu-item-type-post_type menu-item-object-page current-menu-item page_item page-item-121 current_page_item mkd-active-item\" id=\"mobile-menu-item-124\">&lt;a class=\"current\" href=\"https:\/\/ml-gis-service.com\/index.php\/teaching\/\">&lt;span>Teaching&lt;\/span>&lt;\/a>&lt;\/li>, &lt;li>time&lt;\/li>, &lt;li>year&lt;\/li>, &lt;li>people&lt;\/li>, &lt;li>way&lt;\/li>, &lt;li>day&lt;\/li>, &lt;li>man&lt;\/li>, &lt;li>thing&lt;\/li>, &lt;li>woman&lt;\/li>, &lt;li>life&lt;\/li>, &lt;li>child&lt;\/li>, &lt;li>world&lt;\/li>, &lt;li>school&lt;\/li>, &lt;li>state&lt;\/li>, &lt;li>family&lt;\/li>, &lt;li>student&lt;\/li>, &lt;li>group&lt;\/li>, &lt;li>country&lt;\/li>, &lt;li>problem&lt;\/li>, &lt;li>hand&lt;\/li>, &lt;li>part&lt;\/li>, &lt;li>place&lt;\/li>, &lt;li>case&lt;\/li>, &lt;li>week&lt;\/li>, &lt;li>company&lt;\/li>, &lt;li>system&lt;\/li>, &lt;li>program&lt;\/li>, &lt;li>question&lt;\/li>, &lt;li>work&lt;\/li>, &lt;li>government&lt;\/li>, &lt;li>number&lt;\/li>, &lt;li>night&lt;\/li>, &lt;li>point&lt;\/li>, &lt;li>home&lt;\/li>, &lt;li>water&lt;\/li>, &lt;li>room&lt;\/li>, &lt;li>mother&lt;\/li>, &lt;li>area&lt;\/li>, &lt;li>money&lt;\/li>, &lt;li>story&lt;\/li>, &lt;li>fact&lt;\/li>, &lt;li>month&lt;\/li>, &lt;li>lot&lt;\/li>, &lt;li>right&lt;\/li>, &lt;li>study&lt;\/li>, &lt;li>book&lt;\/li>, &lt;li>eye&lt;\/li>, &lt;li>job&lt;\/li>, &lt;li>word&lt;\/li>, &lt;li>business&lt;\/li>, &lt;li>issue&lt;\/li>, &lt;li>side&lt;\/li>, &lt;li>kind&lt;\/li>, &lt;li>head&lt;\/li>, &lt;li>house&lt;\/li>, &lt;li>service&lt;\/li>, &lt;li>friend&lt;\/li>, &lt;li>father&lt;\/li>, &lt;li>power&lt;\/li>, &lt;li>hour&lt;\/li>, &lt;li>game&lt;\/li>, &lt;li>line&lt;\/li>, &lt;li>end&lt;\/li>, &lt;li>member&lt;\/li>, &lt;li>law&lt;\/li>, &lt;li>car&lt;\/li>, &lt;li>city&lt;\/li>, &lt;li>community&lt;\/li>, &lt;li>name&lt;\/li>, &lt;li>president&lt;\/li>, &lt;li>team&lt;\/li>, &lt;li>minute&lt;\/li>, &lt;li>idea&lt;\/li>, &lt;li>kid&lt;\/li>, &lt;li>body&lt;\/li>, &lt;li>information&lt;\/li>, &lt;li>back&lt;\/li>, &lt;li>parent&lt;\/li>, &lt;li>face&lt;\/li>, &lt;li>others&lt;\/li>, &lt;li>level&lt;\/li>, &lt;li>office&lt;\/li>, &lt;li>door&lt;\/li>, &lt;li>health&lt;\/li>, &lt;li>person&lt;\/li>, &lt;li>art&lt;\/li>, &lt;li>war&lt;\/li>, &lt;li>history&lt;\/li>, &lt;li>party&lt;\/li>, &lt;li>result&lt;\/li>, &lt;li>change&lt;\/li>, &lt;li>morning&lt;\/li>, &lt;li>reason&lt;\/li>, &lt;li>research&lt;\/li>, &lt;li>girl&lt;\/li>, &lt;li>guy&lt;\/li>, &lt;li>moment&lt;\/li>, &lt;li>air&lt;\/li>, &lt;li>teacher&lt;\/li>, &lt;li>force&lt;\/li>, &lt;li>education&lt;\/li>]<\/pre>\n\n\n\n<p>That&#8217;s terribly wrong! We obtained a lot of garbage&#8230; But <strong>we can investigate closer <code>&lt;li>&lt;\/li><\/code> tags and check for something which can be used to distinguish thrash from the important words<\/strong>. There&#8217;s something. Or maybe there isn&#8217;t! <code>&lt;li>&lt;\/li><\/code> tags with a words from the list are not members of any <code>class<\/code>. And we can utilize this fact. Method <code>.find_all()<\/code> has <code>class_<\/code> parameter. If we set it to <code>None<\/code> then function parses only those elements <strong>which are not members of any class<\/strong>:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">words_candidates = soup.find_all('li', class_=None)<\/pre>\n\n\n\n<p>Great! Now we are limited to the list of simple English nouns. To retrieve text only, without tags, we use <code>.get_text()<\/code> method and a list comprehension:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">words = [x.get_text() for x in words_candidates]<\/pre>\n\n\n\n<p>Excellent! Our corpus of sample English words is prepared to the most important step.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Part 3: Distance from words to sample word<\/h3>\n\n\n\n<p>Now we have everything to move with the data science algorithms. Let&#8217;s start the implementation with the sample English word. In my case it is <code>shark<\/code> but you should choose other word. Create variable with your word and import <code>fuzz<\/code> module from <code>fuzzywuzzy<\/code> package:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from fuzzywuzzy import fuzz\n\nSAMPLE = 'shark'  # you should create other word<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Distance calculation function<\/h4>\n\n\n\n<p>We have multiple words to compare and usually we have multiple records in a database to check. That&#8217;s why it is better to write small function which iterates through words and creates array of distances to any given sentence:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def find_distance(word, list_of_words):\n    \"\"\"\n    Function calculates Levenshtein Distance between given word\n    and other words.\n    \n    :param word: (str) single word to test distance from it to other\n        words,\n    :param list_of_words: (list) list of other words to compare\n    \n    SAMPLE\n    :return: (list) list of distances from a given word to other words,\n        position of each distance is the same as other words indexes.\n    \"\"\"\n    \n    distances = []\n    for single_word in list_of_words:\n        distance = fuzz.ratio(word, single_word)\n        distances.append(distance)\n    return distances\n\ndistances = find_distance(SAMPLE, words)<\/pre>\n\n\n\n<p>If we plot <code>distances[:3]<\/code> we can check first three results:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">[0, 44, 0]<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Result <code>list<\/code> do <code>DataFrame<\/code><\/h4>\n\n\n\n<p>Python <code>list<\/code> is not the best object for analytics. We get distances but those are <strong>only single values<\/strong> assigned to specific words by index in the list which can lead to mistakes and errors with further data handling. The better idea is to create human-friendly structure which preservers information about words. The best idea is to build <code>DataFrame<\/code> from our results. We have all information for this transformation: <code>data<\/code> as a list of distances; <code>index<\/code> as a list of words; and we have only single column &#8211; chosen word &#8211; which we will pass into <code>columns<\/code> parameter of <code>DataFrame<\/code>:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import pandas as pd\n\ndf = pd.DataFrame(data=distances, index=words, columns=[SAMPLE])<\/pre>\n\n\n\n<p>Now we can perform fast analysis, as example filtering based on the distance value:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">df[df[SAMPLE] >= 50]<\/pre>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><\/td><td><strong>shark<\/strong><\/td><\/tr><tr><td><strong>car<\/strong><\/td><td>50<\/td><\/tr><tr><td><strong>art<\/strong><\/td><td>50<\/td><\/tr><tr><td><strong>war<\/strong><\/td><td>50<\/td><\/tr><tr><td><strong>air<\/strong><\/td><td>50<\/td><\/tr><\/tbody><\/table><figcaption>Distance between word <strong>shark<\/strong> and other words greater or equal to 50.<\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Bar plot of distances<\/h4>\n\n\n\n<p>To make things more pleasant for the human viewer we will plot results as a bar plot in the <code>seaborn<\/code> package. Not all distances are important from an analyst perspective. In reality we are more interested in similar words than dissimilar. To make plot clean we write helper function which transforms passed <code>DataFrame<\/code>. Function should limit results to those greater than <code>lower_thresh<\/code> limit and it should sort values from the biggest to the smallest.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import matplotlib.pyplot as plt\nimport seaborn as sns\n\ndef plot_bars(bardata, column_to_plot, lower_thresh=20):\n    \n    ndf = (bardata[bardata[column_to_plot] > lower_thresh]).sort_values(\n        column_to_plot, ascending=False)\n    \n    plt.figure(figsize=(14, 6))\n    sns.barplot(x=ndf.index,\n                y=ndf[column_to_plot],\n                hue=ndf[column_to_plot],\n                palette=\"rocket\",\n                dodge=False)\n    plt.xticks(rotation=90)\n    plt.title(f'Distance from the word {column_to_plot} to other words')\n    plt.show()\n\nplot_bars(df, SAMPLE)<\/pre>\n\n\n\n<p>I got this plot:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"829\" height=\"406\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/07\/distances_one_sample.png\" alt=\"\" class=\"wp-image-277\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/07\/distances_one_sample.png 829w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/07\/distances_one_sample-300x147.png 300w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/07\/distances_one_sample-768x376.png 768w\" sizes=\"auto, (max-width: 829px) 100vw, 829px\" \/><\/figure>\n\n\n\n<p>Is it clearer? Plot like this may be used in the report for client or business division of our company.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Part 4: Distance between all words<\/h3>\n\n\n\n<p>Part 3 of the article was a warm-up. Meaningful task is to write a pipeline which <strong>compares each record with each other record<\/strong> in a database and that&#8217;s how we try to match records which are describing the same object but in a slightly different way. <code>DataFrame<\/code> is a nice simulation of relational database table. Let&#8217;s create it. This time its structure resembles the matrix graph representation of a data where each word is a <em>node<\/em>, and we assume that all nodes are <em>connected<\/em> with other nodes and <em>weights<\/em> of connection is set as a distance between words.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><\/td><td><strong>car<\/strong><\/td><td><strong>cat<\/strong><\/td><\/tr><tr><td><strong>car<\/strong><\/td><td>100<\/td><td>67<\/td><\/tr><tr><td><strong>cat<\/strong><\/td><td>67<\/td><td>100<\/td><\/tr><\/tbody><\/table><figcaption>Distance matrix between words.<\/figcaption><\/figure>\n\n\n\n<p>Distance to the same word is equal to 100 (closes word) and each diagonal record will have this distance. Created matrix is symmetric: value in position <code>[i][j]<\/code> is equal to value in position <code>[j][i]<\/code>. To create it we simply initiate a new <code>DataFrame<\/code> with columns and index the same as our list of words. (Remember to append to this list your sample word before you create a <code>DataFrame<\/code>).<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">words.append(SAMPLE)  # append sample word\ndf = pd.DataFrame(index=words, columns=words)  # Create new DataFrame\ndf.head()<\/pre>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><\/td><td><strong>time<\/strong><\/td><td><strong>year<\/strong><\/td><td><strong>people<\/strong><\/td><td>&#8230;<\/td><\/tr><tr><td><strong>time<\/strong><\/td><td>NaN<\/td><td>NaN<\/td><td>NaN<\/td><td>NaN<\/td><\/tr><tr><td><strong>year<\/strong><\/td><td>NaN<\/td><td>NaN<\/td><td>NaN<\/td><td>NaN<\/td><\/tr><tr><td><strong>people<\/strong><\/td><td>NaN<\/td><td>NaN<\/td><td>NaN<\/td><td>NaN<\/td><\/tr><tr><td>&#8230;<\/td><td>NaN<\/td><td>NaN<\/td><td>NaN<\/td><td>NaN<\/td><\/tr><\/tbody><\/table><figcaption>Empty distances table.<\/figcaption><\/figure>\n\n\n\n<p>Next we iterate through each <code>[row, column]<\/code> in a table and calculate ratio of two words. We add two conditions to the loop:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>If name of column is the same as the name of row then algorithm does not do anything (<code>NaN<\/code> value stays &#8211; we are not interested in the same words).<\/li><li>If value at <code>[row, column]<\/code> is <code>NaN<\/code> <code>and<\/code> value at <code>[column, row]<\/code> is <code>NaN<\/code> then we assign Levenshtein distance at those places. This is <em>symmetric matrix<\/em> so calculation may be done once.<\/li><\/ol>\n\n\n\n<p>We create two <code>for<\/code> loops and iterate through each word to assign distance to the specific cell in the output matrix. With condition 2. we can speed up algorithm and perform calculations only one time and not twice.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">for w_ in words:\n    for _w in words:\n        if w_ == _w:\n            pass\n        else:\n            # Check if value has been set already \n            if pd.isna(df.at[w_, _w]) and pd.isna(df.at[_w, w_]):\n                distance = fuzz.ratio(w_, _w)\n                df.at[w_, _w] = distance\n                df.at[_w, w_] = distance\n\ndf.head()<\/pre>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><\/td><td><strong>time<\/strong><\/td><td><strong>year<\/strong><\/td><td><strong>people<\/strong><\/td><td>&#8230;<\/td><\/tr><tr><td><strong>time<\/strong><\/td><td>NaN<\/td><td>25<\/td><td>20<\/td><td>&#8230;<\/td><\/tr><tr><td><strong>year<\/strong><\/td><td>25<\/td><td>NaN<\/td><td>20<\/td><td>&#8230;<\/td><\/tr><tr><td><strong>people<\/strong><\/td><td>20<\/td><td>20<\/td><td>NaN<\/td><td>&#8230;<\/td><\/tr><tr><td>&#8230;<\/td><td>&#8230;<\/td><td>&#8230;<\/td><td>&#8230;<\/td><td>NaN<\/td><\/tr><\/tbody><\/table><figcaption>Filled distances table.<\/figcaption><\/figure>\n\n\n\n<p>We can filter this table and search for the most similar word pairs. Standard <code>pandas<\/code> operations are enough for it but to make things more interesting we are going to build heat map of similarity. It is a beautiful graph. Maybe not so informative if we deal with a lot of words but it can be a useful tool for reporting.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">plt.figure(figsize=(14, 12))\nsns.heatmap(df.astype(float), linewidths=1)\nplt.show()<\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"798\" height=\"735\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/07\/heatmap.png\" alt=\"\" class=\"wp-image-279\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/07\/heatmap.png 798w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/07\/heatmap-300x276.png 300w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/07\/heatmap-768x707.png 768w\" sizes=\"auto, (max-width: 798px) 100vw, 798px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n\n\n\n<p>In this article we&#8217;ve uncovered techniques to match different words with <code>fuzzywuzzy<\/code> package in Python. Bonus takeaways are web scraping techniques with <code>requests<\/code> and data visualization techniques with <code>seaborn<\/code>. Hope that you&#8217;ll use those things in your projects!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Links<\/h2>\n\n\n\n<p>Notebook with code presented in the article is available here: <a href=\"https:\/\/github.com\/szymon-datalions\/articles\/tree\/main\/2021-05\/fuzzywuzzy\">https:\/\/github.com\/szymon-datalions\/articles\/tree\/main\/2021-05\/fuzzywuzzy<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Sentence matching in Python.<\/p>\n","protected":false},"author":1,"featured_media":282,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[18,19,68,3,69],"tags":[73,72,7,75,74,70,76],"class_list":["post-232","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","category-machine-learning","category-natural-language-processing","category-python","category-web-scraping","tag-levenshtein-distance","tag-match-sentences","tag-python","tag-requests","tag-seaborn","tag-text-matching","tag-web-scraping"],"_links":{"self":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/232","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/comments?post=232"}],"version-history":[{"count":23,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/232\/revisions"}],"predecessor-version":[{"id":289,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/232\/revisions\/289"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media\/282"}],"wp:attachment":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media?parent=232"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/categories?post=232"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/tags?post=232"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}