How to Perform SQL JOIN Equivalent in MongoDB

Tahseen Tauseef Feb 02, 2024

MongoDB MongoDB Join

Joins in MongoDB
Perform SQL JOIN Equivalent Query Without an Index in MongoDB
Indexes in MongoDB
Perform SQL JOIN Equivalent Query With the Created Index in MongoDB

How to Perform SQL JOIN Equivalent in MongoDB

MongoDB has a voracious hunger for unstructured data and belongs in the toolset of any Database Developer. Some aggregation procedures are significantly slower than relational databases, even with existing indexes.

This is the case when using joins between collections. Lookup, MongoDB’s counterpart of JOIN, can’t currently do Merge or Hash joins; therefore, it’ll never be fast in its current state.

It is significantly more appropriate for enumerations with a restricted number of options. You can aid Lookup by supplying an index that allows it to make an index nested loops join, but you’ll have difficulty improving the efficiency of any join beyond that.

Of course, there may be an argument that collections of documents eliminate the need for joins, but this is only true for very static, unchanging data. Data that is likely to change should always be kept in a single location.

This article describes how to make a MongoDB database that reports historic and slowly-changing data behave adequately.

Joins in MongoDB

In a document database, why bother with joins? A lot more databases moved to MongoDB from relational databases and constructed new databases in MongoDB.

These, especially for reporting, need a large number of lookups. Some suggest that document databases should de-normalize their data to eliminate the need for lookups.

There is an argument that it doesn’t matter if you supply and maintain summary collections, also known as aggregates or pre-aggregates and if you utilize a thread separate from the application to update them as the data in the tables changes.

Conversion of SQL Server’s traditional practice database, AdventureWorks, will be employed to MongoDB as a test-bed. It is selected because it requires several lookups to generate reports.

It can cause some uncomfortable migration issues that are beneficial for our needs. It also allows comparing the two database systems directly.

Perform SQL JOIN Equivalent Query Without an Index in MongoDB

We’ll begin without any indexes for the example below and then add them later, monitoring the timings to ensure they are used. The MongoDB profiler will also be utilized to double-check the approach.

The following SQL in MongoDB is run using Studio 3T.

Example:

SELECT p.PersonType, Sum(soh.TotalDue), Count(*)
    FROM "Sales.SalesOrderHeader" soh
        INNER JOIN "Sales.Customer" c
            ON soh.CustomerID = c.CustomerID
        INNER JOIN "Person.Person" p
        ON c.PersonID = p.BusinessEntityID
    GROUP BY p.PersonType
--Primary type of person: SC = Store Contact,
--IN = Individual (retail) customer

The result shows how many customers and store contacts there are and the overall worth of their orders.

The only difference from the SQL Server is that it uses string delimiters around the collection name. Lookups are used to implement two joins.

Now, suppose the button is pressed. It ends 5 minutes and 17 seconds later.

Shortly after, a concerned but reproachful individual from the Society for the Prevention of Cruelty to Databases calls. She claims that a few indexes would have spared her a lot of grief.

cursor.maxTimeMS() is a cursor method that only works with queries. At this stage, it’s advisable to look at the auto-generated code.

use AdventureWorks;
db.getCollection("Sales.SalesOrderHeader").aggregate(
    [
        {
            "$project" : {
                "_id" : NumberInt(0),
                "soh" : "$$ROOT"
            }
        },
        {
            "$lookup" : {
                "localField" : "soh.CustomerID",
                "from" : "Sales.Customer",
                "foreignField" : "_id",
                "as" : "c"
            }
        },
        {
            "$unwind" : {
                "path" : "$c",
                "preserveNullAndEmptyArrays" : false
            }
        },
        {
            "$lookup" : {
                "localField" : "c.PersonID",
                "from" : "Person.Person",
                "foreignField" : "BusinessEntityID",
                "as" : "p"
            }
        },
        {
            "$unwind" : {
                "path" : "$p",
                "preserveNullAndEmptyArrays" : false
            }
        },
        {
            "$group" : {
                "_id" : {
                    "p᎐PersonType" : "$p.PersonType"
                },
                "SUM(soh᎐TotalDue)" : {
                    "$sum" : "$soh.TotalDue"
                },
                "COUNT(*)" : {
                    "$sum" : NumberInt(1)
                }
            }
        },
        {
            "$project" : {
                "p.PersonType" : "$_id.p᎐PersonType",
                "SUM(soh᎐TotalDue)" : "$SUM(soh᎐TotalDue)",
                "COUNT(*)" : "$COUNT(*)",
                "_id" : NumberInt(0)
            }
        }
    ],
    {
        "allowDiskUse" : true
    }
);

When you look up MongoDB, the field of the documents in the collection you’re searching for is the key field you provide in the aggregation stage. This field describes the papers you will gather as an array of documents.

The $lookup operation compares the foreign field in the input documents to the local field in the output documents.

It’s possible that the key field in the referred document doesn’t exist, in which case it’s presumed to be null. If the foreign field does not have an index, it will do a full collection scan (COLLSCAN) query for each page in the pipeline.

This becomes prohibitively costly since you require index hits rather than table scans.

Indexes in MongoDB

If you require a few fields from a collection, creating a covering index with these fields and the actual query criteria is significantly more efficient. MongoDB may now employ the faster method of delivering a result from the index without visiting the document.

This should be done with any query expected to be used frequently. What fields should be indexed?

Any key fields that identify a document in lookups or searches; and
Fields that function as foreign keys.

A compound index should be used when the key has several fields, such as a First Name/Last Name combination. When more than one field is utilized for sorting, it’s good to think about how you want reports to be sorted since this will decide the optimum order for the fields in the index.

Perform SQL JOIN Equivalent Query With the Created Index in MongoDB

The _id field is included in the import if the relational databases contain a single column as the primary key. These unique _id fields function similarly to clustered indexes.

It would be best to name them _id to be approved as clustered indexes. So that queries don’t break, the original field is added under its original name.

You will have to create an index for all other fields utilized for the lookup, such as the from fields, and those referenced by the lookup, such as the foreignField.

This is the same as in the JOIN’s ON clause. Then, to utilize our already-indexed _id, which has the same values as customer id, type CustomerID.

Retest it and find that the reaction time had decreased to 6.6 seconds. This is faster than the 5 minutes, 17 seconds it took without an index, but it still falls short of the capabilities of the original SQL Server database.

SQL Server manages to do the same aggregation in 160 milliseconds on the same server as MongoDB.

Unfortunately, the MongoDB profiler can’t tell anything other than that a COLLSCAN was used. This is unavoidable because, while individual lookups have utilized an index discreetly, an index can’t act simply as a component of a bigger aggregate unless it includes an initial match step.

Suppose you are rearranging the joins in the SQL query in Studio 3T. In that instance, SQL Server uses the same method as before: performing a hash table match inner join on the customer and person tables. You will be using clustered index scans on both rows and an inner join of the result with SalesOrderHeader.

Studio 3T version is given below.

Example:

SELECT p.PersonType, Sum(soh.TotalDue), Count(*)
    FROM "Sales.Customer" c
        INNER JOIN "Person.Person" p
            ON c.PersonID = p.BusinessEntityID
        INNER JOIN  "Sales.SalesOrderHeader" soh
            ON soh.CustomerID = c.CustomerID
    GROUP BY p.PersonType
--Primary type of person: SC = Store Contact,
--IN = Individual (retail) customer

The order of aggregation in Studio 3T mirrors the order of the joins; therefore, the execution order is different and better at 4.2 seconds. Optimizing the aggregation script in the Aggregation Editor makes little difference, taking it down to only over three seconds.

The optimizations merely reduced the fields taken through the pipeline to just the essential ones.

use AdventureWorks2016;
db.getCollection("Sales.Customer").aggregate(
    [
        {
            "$project" : {
                "_id" : NumberInt(0),
                "CustomerID" : 1.0,
                "PersonID" : 1.0
            }
        },
        {
            "$lookup" : {
                "localField" : "PersonID",
                "from" : "Person.Person",
                "foreignField" : "BusinessEntityID",
                "as" : "p"
            }
        },
        {
            "$unwind" : {
                "path" : "$p",
                "preserveNullAndEmptyArrays" : false
            }
        },
        {
            "$project" : {
                "CustomerID" : 1.0,
                "PersonID" : 1.0,
                "PersonType" : "$p.PersonType"
            }
        },
        {
            "$lookup" : {
                "localField" : "CustomerID",
                "from" : "Sales.SalesOrderHeader",
                "foreignField" : "CustomerID",
                "as" : "soh"
            }
        },
        {
            "$unwind" : {
                "path" : "$soh",
                "preserveNullAndEmptyArrays" : false
            }
        },
        {
            "$project" : {
                "CustomerID" : 1.0,
                "PersonID" : 1.0,
                "PersonType" : 1.0,
                "TotalDue" : "$soh.TotalDue"
            }
        },
        {
            "$group" : {
                "_id" : {
                    "PersonType" : "$PersonType"
                },
                "SUM(TotalDue)" : {
                    "$sum" : "$TotalDue"
                },
                "COUNT(*)" : {
                    "$sum" : NumberInt(1)
                }
            }
        },
        {
            "$project" : {
                "PersonType" : "$_id.PersonType",
                "Total" : "$SUM(TotalDue)",
                "Transactions" : "$COUNT(*)",
                "_id" : NumberInt(0)
            }
        }
    ],
    {
        "allowDiskUse" : false
    }
);

If you continue moving along this path, you will have to devote a significant effort to optimize each query.

Imagine that you have a slew of managers hounding you for a stack of income reports. What to do now?

Use Pre-aggregation Collections to Simplify Reporting in MongoDB

It would help if you made an aggregate collection with the smallest granularity that you’re likely to report. This is an OLAP cube’s equivalent.

It is now working with trade records extracted from invoices in this scenario. These aren’t going to change, and there’s a strong reason for that.

If an intermediate collection like this is used to pre-aggregate,

use AdventureWorks2016;
db.getCollection("Sales.Customer").aggregate(
    [
        {
            "$project" : {
                "_id" : NumberInt(0),
                "CustomerID" : 1.0,
                "PersonID" : 1.0
            }
        },
        {
            "$lookup" : {
                "localField" : "PersonID",
                "from" : "Person.Person",
                "foreignField" : "BusinessEntityID",
                "as" : "p"
            }
        },
        {
            "$unwind" : {
                "path" : "$p",
                "preserveNullAndEmptyArrays" : false
            }
        },
        {
            "$project" : {
                "CustomerID" : 1.0,
                "PersonID" : 1.0,
                "PersonType" : "$p.PersonType"
            }
        },
        {
            "$lookup" : {
                "localField" : "CustomerID",
                "from" : "Sales.SalesOrderHeader",
                "foreignField" : "CustomerID",
                "as" : "soh"
            }
        },
        {
            "$unwind" : {
                "path" : "$soh",
                "preserveNullAndEmptyArrays" : false
            }
        },
        {
            "$project" : {
                "CustomerID" : 1.0,
                "PersonID" : 1.0,
                "PersonType" : 1.0,
                "TotalDue" : "$soh.TotalDue"
            }
        },
        {
            "$group" : {
                "_id" : {
                    "PersonType" : "$PersonType"
                },
                "SUM(TotalDue)" : {
                    "$sum" : "$TotalDue"
                },
                "COUNT(*)" : {
                    "$sum" : NumberInt(1)
                }
            }
        },
        {
            "$project" : {
                "PersonType" : "$_id.PersonType",
                "Total" : "$SUM(TotalDue)",
                "Transactions" : "$COUNT(*)",
                "_id" : NumberInt(0)
            }
        }
    ],
    {
        "allowDiskUse" : false
    }
);

You will not want to store such a specialized aggregation collection in practice. Instead, you should segment the more general data into a period, such as weeks, months, or years so that you may graph sales with time.

You should also provide the salesperson’s and store’s IDs so that someone is given credit for the transaction.

Aggregate will be sketched out in SQL. However, you can leave things like date calculations and the output stage out of the SQL because it is restricted in what it can accomplish.

SELECT c.PersonID, p.PersonType, soh.SalesPersonID, psp.Name, psp.CountryRegionCode,
    Sum(soh.TotalDue), Count(*)
    --,
    --Year(soh.OrderDate) AS year, Month(soh.OrderDate) AS month,
    --DatePart(WEEK, soh.OrderDate) AS week
    FROM "Sales.SalesOrderHeader" AS soh
        INNER JOIN "Sales.Customer" AS c
            ON c.CustomerID = soh.CustomerID
        INNER JOIN "Person.Person" AS p
            ON p.BusinessEntityID = c.PersonID
        INNER JOIN "Person.Address" AS pa
            ON pa.AddressID = soh.BillToAddressID
        INNER JOIN "Person.StateProvince" AS psp
            ON psp.StateProvinceID = pa.StateProvinceID
    GROUP BY c.PersonID, p.PersonType, soh.SalesPersonID, psp.Name,
    psp.CountryRegionCode
    --, Year(soh.OrderDate), Month(soh.OrderDate),
    --DatePart(WEEK, soh.OrderDate);

Ensure Correct Sequence and Close Up Loose Ends

After that, you copy the mongo shell query code and put it into Studio 3T’s MongoDB aggregate query builder, the Aggregation Editor.

The aggregate should then be fine-tuned. After that, you can run reports straight from the SQL Query tab in Studio 3T.

A query can be cut from nearly five minutes to roughly 100 milliseconds using the brain. Just put common sense indexes on foreign key references and keys, then experiment with covering and intersecting indexes to get over the first misery of waiting minutes.

Then, check whether you’re unnecessarily scanning ancient or non-changing data. Unfortunately, this is such a prevalent blunder that it’s nearly epidemic.

This post shows how a cube may help speed up the design and production of many reports using the same core data.

Finally, getting the sequence of the phases in an aggregate pipeline correct is critical. Lookups, like sorting, should be saved for when you only have the documents you need for the final report.

Early on, matching and projecting should be done.

The moment you conduct grouping is a more tactical option, although it isn’t a prolonged procedure with MongoDB. It makes sense to keep the pipeline small, pushing only the data you need into each document as it comes through.

But this is best regarded as part of the final clean-up, and while it will speed things up, it will not provide significant advantages.

On the other hand, current transactional information can never be handled this way.

For example, you would never want out-of-date information regarding the recent trade. However, because this is a modest amount of data, it is unlikely to cause problems with lookups.