Finding duplicates in MongoDB via the shell

I thought this was an interesting question to answer on StackOverflow (summarized here):

I’m trying to create an index, but an error is returned that duplicates exist for the field I want to index. What should I do?

I answered with one possibility.

The summary is that you can use the power of MongoDB’s aggregation framework to search and return the duplicates. It’s really quite slick.

For example, in the question, Wall documents had a field called event_time. Here’s one approach:

db.Wall.aggregate([
       {$group : { _id: "$event_time" ,  count : { $sum: 1}}},
       {$match : { count : { $gt : 1 } }} ])

The trick is to use the $group pipeline operator to select and count each unique event_time. Then, match on only those groups that contained more than one match.

While it’s not necessarily as readable as the equivalent SQL statement potentially, it’s still easy to read. The only really odd thing is the mapping of the event_time into the _id. As all documents pass through the pipeline, the event_time is used as the new aggregate document key. The $ sign is used as the field reference to a property of the document in the pipeline (a Wall document). Remember that the _id field of a MongoDB document must be unique (and this is how the $group pipeline operator does its magic).

So, if the following event_times were in the documents:

event_time
4:00am
5:00am
4:00am
6:00pm
7:00a

It would results in a aggregate set of documents:

_id count
4:00am 2
5:00am 1
6:00pm 1
7:00am 1

Notice how the _id is the event_time. The aggregate results would look like this:

{
        "result" : [
                {
                        "_id" : "4:00am",
                        "count" : 2
                }
        ],
        "ok" : 1
}