When looking at big data storage options for big business, almost all options fall into two categories: data warehouses and data lakes. The latter has been rising in popularity and MongoDB has risen to the occasion of providing a way to implement, manage, and mine these structures with their own framework: MongoDB Atlas Data Lake.

    Possible approaches to data

    Data warehouses and data lakes entail two very different approaches to handling your decision-making business data. From storing to structuring to processing and, finally, to analysis. So much so that they’re often used in tandem as a total big data stack.

    However, storage lakes are gaining ground for the cost-effectiveness and sheer flexibility they have to offer data analysts and businesses. Data lakes are like that one drawer in your kitchen that you throw all of the random things you come across in your house, a flexible, unstructured storage location that you temporarily (even if it’s for years) place things so they are off the counter until you need to organize them later. To put it loosely, data lakes follow a collect first—sort, process, and analyze later approach. This storage of unstructured data, from different sources and in different formats, in a single location helps organizations that have flexible plans with their data or have such large volumes that traditional, structured data storage options may just not work. 

    This is particularly useful in contexts where there is an intermingling of business and consumer data, such as e-commerce. It’s also becoming more prevalent as the IoT (Internet of Things) grows and leads to data collection from ever more disparate sources. Manufacturing, multi-nationals as well as giant retailers or e-commerce businesses are natural breeding grounds for this conglomeration of information.

    Data lakes also seamlessly gel with cloud-based, serverless storage thanks to the cloud’s intrinsic availability, scalability, and inexpensive storage infrastructure. It’s no surprise that industry leaders in cloud storage, such as Google Cloud and Amazon AWS, offer data lake resources. 

    MongoDB Atlas Data Lake is a new form of tool that can help structure data stored in Data Lakes and is what we’ll be looking at here. MongoDB is already used by many businesses globally for their non-relational data platform, and is expanding their tool set to give more power to utilize unstructured data. This provides users a way to immediately act upon data stored within Data Lakes, without having to use a parsing tool to structure the data before it is extracted.

    MongoDB Atlas Data Lake: A happy medium? 

    MongoDB Atlas Data Lake allows you to natively query data stored across both  MongoDB Atlas and Amazon AWS. Data can be queried as long as it’s stored in any of the following formats: JSON, BSON, CSV, TSV, Avro, ORC, and Parquet. Event queries can be made using the mongo shell, MongoDB Compass (the official GUI), or any MongoDB-supported drivers (libraries).

    A typical use case scenario for MongoDB Atlas Data Lake would be something like this: A successful online retailer purchases and assimilates another e-commerce store into their business. Let’s say that Store A uses an MongoDB Atlas cluster for its data storage while Store B uses Amazon AWS S3 buckets. Now, how can you start aggregating the data across these two sources while maintaining their “richness” and historical context? 

    Examples of analysis you might want to do across these two data sets are:

    • A combined list of top-performing/under-performing products from each store with the number of units sold and total profit generated.
    • The top customers from both stores as well as which products they bought and how much they’ve spent.
    • Pulling together similar/related/identical products from both stores with the relevant metadata.

    It would be preferable to keep both data sets intact as separate entities. For one, both data sets may have unique fields. Secondly, while they are now under the same umbrella, both are still very much part of separate entities. If we were to restructure either (or both), it would lose some of its richness and context. What’s needed is a framework that allows us to query both these data sets at the same time and aggregate the information in a useful way.

    This is exactly the type of problem that MongoDB Atlas Data Lake and similar technologies are meant to solve.

    The superpower of MongoDB Atlas Data Lake

    So, how does MongoDB Atlas Data Lake solve this issue? With a conventional data warehouse, you would have to restructure the new data into the same format as the existing data. This is usually in a SQL-like relational database format.

    However, by adopting the data lake approach, you can keep each source of data in its own format, schema, and even location. You can then use one of the methods mentioned to run event queries that aggregate and process the information dynamically. 

    Assuming we are in the scenario described above, the step-by-step approach will look something like this:

    1. You will need data sources. One is a MongoDB Atlas cluster (Store A’s data) and the other is an Amazon AWS S3 bucket (Store B’s data). 
    2. Connect the S3 bucket to your Atlas account and define a database and collection that refers to this data.
    3. Create a new Data Lake from within your MongoDB Atlas dashboard.
    4. Define objects using a query language for both the Atlas collection and S3 stores.
    5. Define databases for both of these data sources within the data lake.
    6. Create a new database(s) where aggregate data from querying these two data sources can be stored.
    7. Code a data pipeline using query syntax with the business logic to execute your requirements. This includes a $out statement to output the results to your new Atlas cluster with its own collection and database.
    8. Run event queries that will extract, parse, and combine the data and store it in the unified database.

    This video goes through the process of an almost identical scenario as well as providing code snippets to implement it.

    The important thing is that this unified data should now have a predictable and structured format. For example, with the first analysis use case above, you might have something like this

    • Cluster: unified_data
    • Database: data_analysis
    • Collection: top_product

    This is exactly the type of problem that MongoDB Atlas Data Lake and similar technologies are meant to solve:

    Entry #1

    • _id: 1470
    • units_sold: 671
    • profit: $7,986.41

    Entry #2

    • _id: 0031
    • units_sold: 543
    • profit: $5,923.22

    Conclusion: Great for organizations using MongoDB that need to step into hybrid solutions

    MongoDB has been one of the most popular NoSQL database platforms for some time now. It’s only natural that its cloud-based framework, Atlas, would be equally popular. With MongoDB Atlas Data lakes, MongoDB is continuing to appeal to modern app developers that deal with more complex and intricate data gathering, storage, and analysis.

    However, that doesn’t necessarily mean it’s the best option in each and every scenario. For now, Mongo DB servers a very specific use case of providing flexibility between data sets within MongoDB and AWS S3, providing the native querying of data lakes across MongoDB Atlas and Amazon AWS data sources. 

    There’s still something to be said for the uniformity, practicality, and ease of analysis that conventional data warehouses provide. That’s exactly why many businesses with big data use both in their “data storage stack.”

    Data lakes can be used to maintain unmanipulated data with its full richness and flexibility. The original data will always be in place should you need it. This data can be queried and aggregated into a more readable and standardized format to store in a data warehouse. Finally, this formatted and structured data can be queried for frequent analysis, insights, and visualizations.

    And if you’re not sure if that’s a perfect fit: keep looking! Amazon S3, Azure Cosmos DB, and Google Cloud provide similar data storage solutions for other contexts and storage ecosystems. 

    Photo by will terra on Unsplash

    Andrew Murray

    Written by Andrew Murray