Travis Ennis

Mongo and Case-Insensitive Queries

I recently had to analyze an incident where our Mongo servers became completely overloaded and simply could not process requests in a timely-manner.

I began my analysis by downloading the logs from mongodb. Since we use Mongo hosted, I had to get the logs from Atlas: View and Download MongoDB Logs — MongoDB Atlas. I decided to get an entire 24 hours worth of logs for our 8 server mongo cluster, which turned out to be a huge amount of data. Once unzipped, I had nearly 3GB of log data to sift through. Fortunately, I was made aware of mtools, which really helped find turn the raw data into something useful. This command would rather quickly parse through the log file and create a table of the queries along with some useful statistics

mloginfo mongo.log --queries > queries.log

The output would look something like this:

QUERIES
namespace                                            operation    pattern                                    count    min (ms)    max (ms)    95%-ile (ms)    sum (ms)      mean (ms)    allowDiskUse

e3c3fbe7-04d0-430a-9b65-0c527e2240d4-ls.students     find         {"$or": [{"email": 1}, {"userName": 1}]}   94004         136      148299         77347.7    1158787373      12327.0    None
e3c3fbe7-04d0-430a-9b65-0c527e2240d4-ls.clients      find         {"loginName": 1}                           15967         200      231710         75783.8     214490232      13433.3    None

The namespace is comprised of the database and the collection. The pattern describes in general terms what the query that is being made looks like. Familiarity with your code base will help you locate the places where you make these queries. One thing that is not immediately obvious with these queries are what values we are trying to match against. Examining the code will help answer that. Finally we have the number of times the query was made under count and then some statistics of how long these queries took. As you can see, during our outage, some of these queries were taking an unacceptable amount of time, which some taking nearly 4 minuts to complete.

Also, it is not obvious whether or not the fields being searched are indexed. Hopefully they are, but do some research to see if you are querying against indexed fields. I used Compass to check the collections and fields and found that we did have these fields indexed, so I next turned to looking at the queries themselves and I found the following usage:

{ "loginName": { "$regex": paginationParams.searchQuery, "$options": "i" } }

The use of regexes caught my eye, so I went to do some research on how a regex query behaved with regards to the index and found this following tidbit.

If the sole reason for using a regex is case insensitivity, use a case-insensitive index instead, as those are faster.

source: Performance Best Practices: Indexing | MongoDB

Case-insensitive queries are best accomplished in Mongo with collation indexes, which gives us the case-insensitive index we need to optimize this use case:

Collation allows users to specify language-specific rules for string comparison, such as rules for lettercase and accent marks.

source: Indexes — MongoDB Manual

You can use a regex query in mongo and still take advantage of the index, but you must pin the regex to the beginning of the string like so and it works better if it not a case-insensitive query:

{ "loginName": { "$regex": /^test/ } }

The case-insensitive indexes are still the preferred solution for case-insensitive searches, but it is good to know that a regex query can be acceptable query from a performance standpoint, because it does give the ability to do partial matches, so if the database collection contained:

{
    loginName: "testUser"
}

then this query would find it:

{ "loginName": { "$regex": /^test/ } }

In a small index, a case-insensitive regex might still do an index scan, but this behavior becomes unreliable on larger collection. But I digress. If your queries are being constructed simply to take advantage of case-insensitivity, then a collation index is best.

To construct a collation index with Mongoose, you would make the following call on your database schema:

clientSchema.index({
    "loginName": 1
},
{
    name: "loginName_collation_1",
    collation: {
        locale: "en_US",
        strength: 1
    }
});

source: Mongoose v5.11.17: Schemas

This creates a collation index for loginName with a locale of en_US and strength of 1, which is the

Primary level of comparison. Collation performs comparisons of the base characters only, ignoring other differences such as diacritics and case.

See Collation — MongoDB Manual for other strength levels.

With this index in place, you can then make queries against the collection:

db.getCollection('clients').find({ loginName: test })
    .collation({ locale: "en_US", strength: 1 })

Appending collation to query informs the database that you would like to use the collation index that was configured with those parameters with this call. You have to explicitly call the collation you want to use as Mongo allows you to construct different collation indexes for different locales and with different strengths depending on your needs and use-cases. If you didn't specify the collation then Mongo would either use a regular index, if available, or the query wouldn't use an index at all.

Give this a try if you need to make case-insensitive queries on your data. I think you'll find that using collation indexes will result in much more reliable and performant queries, especially on larger collections. Just be aware of the drawbacks. Moving from a regex queries to one that specifies a collation will mean that you lose any kind partial matching, but if partial matching is a requirement for you, then you probably need a full-blown text search index like Lucene, but that is a topic for another article.

You've found yourself on the site of Travis Ennis, a software engineer who lives in Indiana. If you'd like, you can contact me.