{"id":830,"date":"2022-04-27T20:25:16","date_gmt":"2022-04-27T20:25:16","guid":{"rendered":"https:\/\/ml-gis-service.com\/?p=830"},"modified":"2022-04-27T20:25:17","modified_gmt":"2022-04-27T20:25:17","slug":"toolbox-python-list-of-dicts-to-jsonl-json-lines","status":"publish","type":"post","link":"https:\/\/ml-gis-service.com\/index.php\/2022\/04\/27\/toolbox-python-list-of-dicts-to-jsonl-json-lines\/","title":{"rendered":"Toolbox: Python List of Dicts to JSONL (json lines)"},"content":{"rendered":"\n<p>Converting Python dict to JSON is very simple. We can use <code>json<\/code> module and its <code>json.dump<\/code> or <code>json.dumps<\/code> methods, and voila! We have our JSON. But nowadays, with unstructured data streams, the new type of JSON has become a popular choice: <a href=\"https:\/\/jsonlines.org\">JSON Lines<\/a>. It is a text file where each line is a valid JSON separated by the newline character <code>\\n<\/code>. It is natural to think about this structure in terms of a Python <code>list<\/code> of <code>dict<\/code>(s). It could be a stream of website events with additional properties, for example, user actions or viewed products:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">{\"user\": \"xyz\", \"action\": \"click\", \"element\": \"submit button\"}\n{\"user\": \"zla\", \"action\": \"products view\", \"items\": [\"product a\", \"product x\"]}\n{\"user\": \"iks\", \"action\": \"add to cart\", \"items\": [\"product b\"], \"item properties\": {\"price\": 3.5, \"color\": \"silver\"}}<\/pre>\n\n\n\n<p>How to store this object?<\/p>\n\n\n\n<p>Here is a function that can be used for this task:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import gzip\nimport json\n\n\ndef dicts_to_jsonl(data_list: list, filename: str, compress: bool = True) -> None:\n    \"\"\"\n    Method saves list of dicts into jsonl file.\n\n    :param data: (list) list of dicts to be stored,\n    :param filename: (str) path to the output file. If suffix .jsonl is not given then methods appends\n        .jsonl suffix into the file.\n    :param compress: (bool) should file be compressed into a gzip archive?\n    \"\"\"\n\n    sjsonl = '.jsonl'\n    sgz = '.gz'\n\n    # Check filename\n\n    if not filename.endswith(sjsonl):\n        filename = filename + sjsonl\n\n    # Save data\n    \n    if compress:\n        filename = filename + sgz\n        with gzip.open(filename, 'w') as compressed:\n            for ddict in data:\n                jout = json.dumps(ddict) + '\\n'\n                jout = jout.encode('utf-8')\n                compressed.write(jout)\n    else:\n        with open(filename, 'w') as out:\n            for ddict in data:\n                jout = json.dumps(ddict) + '\\n'\n                out.write(jout)<\/pre>\n\n\n\n<p>The function works as follow:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>It takes a list of dicts and filename as the main parameters. Optional <code>compress<\/code> parameter is used to gzip data &#8211; a handy feature if you stream or store those files.<\/li><li>In the first step, it creates a valid filename.<\/li><li>In the next step, if compression is set to <code>True<\/code>, the function opens the gzipped file, encodes each dict to string, and stores it in a gzipped file.<\/li><li>Otherwise, a dict is transformed to a string and stored in a plain JSON Lines file.<\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>How to transform list of dicts to JSON lines in Python<\/p>\n","protected":false},"author":1,"featured_media":837,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,3,17],"tags":[197,194,195,192,193,196,7],"class_list":["post-830","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-engineering","category-python","category-scripts","tag-convert-list-of-dicts-to-jsonl","tag-dict","tag-dicts","tag-json-lines","tag-jsonl","tag-list","tag-python"],"_links":{"self":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/830","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/comments?post=830"}],"version-history":[{"count":7,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/830\/revisions"}],"predecessor-version":[{"id":838,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/830\/revisions\/838"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media\/837"}],"wp:attachment":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media?parent=830"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/categories?post=830"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/tags?post=830"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}