{"id":1857,"date":"2013-03-10T13:45:41","date_gmt":"2013-03-10T18:45:41","guid":{"rendered":"http:\/\/www.wiredprairie.us\/blog\/?p=1857"},"modified":"2013-03-10T13:45:44","modified_gmt":"2013-03-10T18:45:44","slug":"finding-duplicates-in-mongodb-via-the-shell","status":"publish","type":"post","link":"https:\/\/www.wiredprairie.us\/blog\/index.php\/archives\/1857","title":{"rendered":"Finding duplicates in MongoDB via the shell"},"content":{"rendered":"

I thought this was an interesting question to answer on StackOverflow (summarized here):<\/p>\n

\n

I\u2019m trying to create an index, but an error is returned that duplicates exist for the field I want to index. What should I do?<\/p>\n<\/blockquote>\n

I answered with one<\/a> possibility.<\/p>\n

The summary is that you can use the power of MongoDB\u2019s aggregation<\/a> framework to search and return the duplicates. It\u2019s really quite slick.<\/p>\n

For example, in the question, Wall <\/strong>documents had a field called event_time<\/strong>. Here\u2019s one approach:<\/p>\n

db.Wall.aggregate([\n       {$group : { _id: "$event_time" ,  count : { $sum: 1}}},\n       {$match : { count : { $gt : 1 } }} ])<\/code><\/pre>\n

The trick is to use the $group pipeline operator to select and count each unique event_time. Then, match on only those groups that contained more than one match. <\/p>\n

While it\u2019s not necessarily as readable as the equivalent SQL statement potentially, it\u2019s still easy to read. The only really odd thing is the mapping of the event_time<\/strong> into the _id<\/strong>. As all documents pass through the pipeline, the event_time<\/strong> is used as the new aggregate document key. The $ sign is used as the field reference to a property of the document in the pipeline (a Wall<\/strong> document). Remember that the _id<\/strong> field of a MongoDB document must be unique (and this is how the $group pipeline operator does its magic).<\/p>\n

So, if the following event_time<\/strong>s were in the documents:<\/p>\n\n\n\n\n\n\n\n\n
event_time<\/strong><\/td>\n<\/tr>\n
4:00am<\/td>\n<\/tr>\n
5:00am<\/td>\n<\/tr>\n
4:00am<\/td>\n<\/tr>\n
6:00pm<\/td>\n<\/tr>\n
7:00a<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

It would results in a aggregate set of documents:<\/p>\n\n\n\n\n\n\n\n
_id<\/strong><\/td>\ncount<\/strong><\/td>\n<\/tr>\n
4:00am<\/td>\n2<\/td>\n<\/tr>\n
5:00am<\/td>\n1<\/td>\n<\/tr>\n
6:00pm<\/td>\n1<\/td>\n<\/tr>\n
7:00am<\/td>\n1<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

Notice how the _id is the event_time. <\/strong>The aggregate results would look like this:<\/p>\n

{\n        "result"<\/span> : [\n                {\n                        "_id"<\/span> : "4:00am"<\/span>,\n                        "count"<\/span> : 2\n                }\n        ],\n        "ok"<\/span> : 1\n}<\/pre>\n