Toolbox: MongoDB nested bson to the flattened DataFrame

Decorative Image of a bird's nest with blue eggs inside.

Toolbox: MongoDB nested bson to the flattened DataFrame

July 19, 2021

Data Engineering, pandas, Python, Scripts

Nested structures in the MongoDB dumps are very common. Direct transformation of those entities to the DataFrame leads to the strange results where a single entry in a DataFrame is a whole dictionary. Do you want to parse those nested structures and create DataFrame with flattened columns? Use function from the toolbox!

import pandas as pd
from bson import json_util


def nested_bson_to_df(bson_file):
    """
    Function transforms input bson files (from the MongoDB) with nested structures into a DataFrame.

    INPUT:
    :param bson_file: (str) bson file path from the MongoDB database.

    OUTPUT:
    :returns: (pandas.DataFrame)
    """
    with open(bson_file, 'r') as inp_str:
        data = json_util.loads(inp_str.read())

    normalized = pd.json_normalize(data)
    return normalized

What happend?

First we open bson_file as a string and parse it with json_util.loads() method from the bson package.
Next we normalize nested structures of the data with pandas.json_normalize() method.
Function returns the flattened DataFrame!