{"id":1857,"date":"2013-03-10T13:45:41","date_gmt":"2013-03-10T18:45:41","guid":{"rendered":"http:\/\/www.wiredprairie.us\/blog\/?p=1857"},"modified":"2013-03-10T13:45:44","modified_gmt":"2013-03-10T18:45:44","slug":"finding-duplicates-in-mongodb-via-the-shell","status":"publish","type":"post","link":"https:\/\/www.wiredprairie.us\/blog\/index.php\/archives\/1857","title":{"rendered":"Finding duplicates in MongoDB via the shell"},"content":{"rendered":"<p>I thought this was an interesting question to answer on StackOverflow (summarized here):<\/p>\n<blockquote>\n<p>I\u2019m trying to create an index, but an error is returned that duplicates exist for the field I want to index. What should I do?<\/p>\n<\/blockquote>\n<p>I answered with <a href=\"http:\/\/stackoverflow.com\/a\/15323839\/95190\">one<\/a> possibility.<\/p>\n<p>The summary is that you can use the power of MongoDB\u2019s <a href=\"http:\/\/docs.mongodb.org\/manual\/applications\/aggregation\/\">aggregation<\/a> framework to search and return the duplicates. It\u2019s really quite slick.<\/p>\n<p>For example, in the question, <strong>Wall <\/strong>documents had a field called <strong>event_time<\/strong>. Here\u2019s one approach:<\/p>\n<pre><code>db.Wall.aggregate([\n       {$group : { _id: &quot;$event_time&quot; ,  count : { $sum: 1}}},\n       {$match : { count : { $gt : 1 } }} ])<\/code><\/pre>\n<p>The trick is to use the $group pipeline operator to select and count each unique event_time. Then, match on only those groups that contained more than one match. <\/p>\n<p>While it\u2019s not necessarily as readable as the equivalent SQL statement potentially, it\u2019s still easy to read. The only really odd thing is the mapping of the <strong>event_time<\/strong> into the <strong>_id<\/strong>. As all documents pass through the pipeline, the <strong>event_time<\/strong> is used as the new aggregate document key. The $ sign is used as the field reference to a property of the document in the pipeline (a <strong>Wall<\/strong> document). Remember that the <strong>_id<\/strong> field of a MongoDB document must be unique (and this is how the $group pipeline operator does its magic).<\/p>\n<p>So, if the following <strong>event_time<\/strong>s were in the documents:<\/p>\n<table cellspacing=\"0\" cellpadding=\"2\" width=\"400\" border=\"1\">\n<tbody>\n<tr>\n<td valign=\"top\" width=\"400\"><strong>event_time<\/strong><\/td>\n<\/tr>\n<tr>\n<td valign=\"top\" width=\"400\">4:00am<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\" width=\"400\">5:00am<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\" width=\"400\">4:00am<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\" width=\"400\">6:00pm<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\" width=\"400\">7:00a<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>It would results in a aggregate set of documents:<\/p>\n<table cellspacing=\"0\" cellpadding=\"2\" width=\"400\" border=\"1\">\n<tbody>\n<tr>\n<td valign=\"top\" width=\"200\"><strong>_id<\/strong><\/td>\n<td valign=\"top\" width=\"200\"><strong>count<\/strong><\/td>\n<\/tr>\n<tr>\n<td valign=\"top\" width=\"200\">4:00am<\/td>\n<td valign=\"top\" width=\"200\">2<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\" width=\"200\">5:00am<\/td>\n<td valign=\"top\" width=\"200\">1<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\" width=\"200\">6:00pm<\/td>\n<td valign=\"top\" width=\"200\">1<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\" width=\"200\">7:00am<\/td>\n<td valign=\"top\" width=\"200\">1<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Notice how the _id is the <strong>event_time. <\/strong>The aggregate results would look like this:<\/p>\n<pre class=\"csharpcode\">{\n        <span class=\"str\">&quot;result&quot;<\/span> : [\n                {\n                        <span class=\"str\">&quot;_id&quot;<\/span> : <span class=\"str\">&quot;4:00am&quot;<\/span>,\n                        <span class=\"str\">&quot;count&quot;<\/span> : 2\n                }\n        ],\n        <span class=\"str\">&quot;ok&quot;<\/span> : 1\n}<\/pre>\n<style type=\"text\/css\">\n.csharpcode, .csharpcode pre\n{\n\tfont-size: small;\n\tcolor: black;\n\tfont-family: consolas, \"Courier New\", courier, monospace;\n\tbackground-color: #ffffff;\n\t\/*white-space: pre;*\/\n}\n.csharpcode pre { margin: 0em; }\n.csharpcode .rem { color: #008000; }\n.csharpcode .kwrd { color: #0000ff; }\n.csharpcode .str { color: #006080; }\n.csharpcode .op { color: #0000c0; }\n.csharpcode .preproc { color: #cc6633; }\n.csharpcode .asp { background-color: #ffff00; }\n.csharpcode .html { color: #800000; }\n.csharpcode .attr { color: #ff0000; }\n.csharpcode .alt \n{\n\tbackground-color: #f4f4f4;\n\twidth: 100%;\n\tmargin: 0em;\n}\n.csharpcode .lnum { color: #606060; }<\/style>\n","protected":false},"excerpt":{"rendered":"<p>I thought this was an interesting question to answer on StackOverflow (summarized here): I\u2019m trying to create an index, but an error is returned that duplicates exist for the field I want to index. What should I do? I answered with one possibility. The summary is that you can use the power of MongoDB\u2019s aggregation [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[4],"tags":[129],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pd5QIe-tX","jetpack_likes_enabled":true,"jetpack-related-posts":[{"id":1895,"url":"https:\/\/www.wiredprairie.us\/blog\/index.php\/archives\/1895","url_meta":{"origin":1857,"position":0},"title":"Using $inc to increment a field in a sub-document in an array and a field in main document","date":"July 14, 2013","format":false,"excerpt":"(Blog post inspired by question I answered on StackOverflow) Lets say you have a schema in MongoDB that looks something like this: { '_id' : 'star_wars', 'count' : 1234, 'spellings' : [ { spelling: 'Star wars', total: 10}, { spelling: 'Star Wars', total : 15}, { spelling: 'sTaR WaRs', total\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1835,"url":"https:\/\/www.wiredprairie.us\/blog\/index.php\/archives\/1835","url_meta":{"origin":1857,"position":1},"title":"How to rewrite a MongoDB C# LINQ with a Projection Requirement using a MongoCursor","date":"January 26, 2013","format":false,"excerpt":"The LINQ Provider for MongoDB does not currently take into account data projections efficiently when returning data. This could mean that you\u2019re unnecessarily returning more data from the database than is needed. So, I\u2019m going to show you the pattern I applied as a replacement for the LINQ queries when\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1833,"url":"https:\/\/www.wiredprairie.us\/blog\/index.php\/archives\/1833","url_meta":{"origin":1857,"position":2},"title":"How to view the MongoDB Query when using the C# LINQ Provider","date":"January 26, 2013","format":false,"excerpt":"If you\u2019re using the Official MongoDB C# Driver from 10gen, you may want to occasionally verify that the generated query matches your LINQ query (or at least that it\u2019s building something efficient). Take for example this query: var query = (from r in DataLayer.Database.GetCollection<Research>().AsQueryable<Research>() where !r.Deleted select new { Id\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1563,"url":"https:\/\/www.wiredprairie.us\/blog\/index.php\/archives\/1563","url_meta":{"origin":1857,"position":3},"title":"Knockout.JS: AsDictionary","date":"March 9, 2012","format":false,"excerpt":"I frequently find that I have an array of objects in JavaScript that I want to display in a particular order and also have the ability to quickly locate an object by an ID or a key (and not use the indexOf function). As my recent project is using Knockout.JS,\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.wiredprairie.us\/blog\/wp-content\/uploads\/2012\/03\/image3.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":1508,"url":"https:\/\/www.wiredprairie.us\/blog\/index.php\/archives\/1508","url_meta":{"origin":1857,"position":4},"title":"Nest Thermostat Review, Update #9","date":"January 22, 2012","format":false,"excerpt":"Summary\/Index When I woke up this morning, I decided that I\u2019d use the remote features of my Nest Thermostat to increase the temperature of the first floor as the normal schedule hadn\u2019t started yet. Here\u2019s what I saw on my iPad: Basement: ? First Floor: ? When I tapped the\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"image","src":"https:\/\/i0.wp.com\/www.wiredprairie.us\/blog\/wp-content\/uploads\/2012\/01\/image23.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":1214,"url":"https:\/\/www.wiredprairie.us\/blog\/index.php\/archives\/1214","url_meta":{"origin":1857,"position":5},"title":"Setup for the Asante VoyagerIP Cameras: Wireless Woes","date":"June 13, 2011","format":false,"excerpt":"I recently purchased two new IP cameras from Amazon. The Asante Voyager I and Asante Voyager II. They\u2019re both good cameras with lots of bells and whistles, and a decent amount of configuration options that should satisfy both the geeks and a non-geek. The reason I\u2019m posting this is to\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"image","src":"https:\/\/i0.wp.com\/www.wiredprairie.us\/blog\/wp-content\/uploads\/2011\/06\/image1.png?resize=350%2C200","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"https:\/\/www.wiredprairie.us\/blog\/index.php\/wpjson\/wp\/v2\/posts\/1857"}],"collection":[{"href":"https:\/\/www.wiredprairie.us\/blog\/index.php\/wpjson\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wiredprairie.us\/blog\/index.php\/wpjson\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wiredprairie.us\/blog\/index.php\/wpjson\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wiredprairie.us\/blog\/index.php\/wpjson\/wp\/v2\/comments?post=1857"}],"version-history":[{"count":0,"href":"https:\/\/www.wiredprairie.us\/blog\/index.php\/wpjson\/wp\/v2\/posts\/1857\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.wiredprairie.us\/blog\/index.php\/wpjson\/wp\/v2\/media?parent=1857"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wiredprairie.us\/blog\/index.php\/wpjson\/wp\/v2\/categories?post=1857"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wiredprairie.us\/blog\/index.php\/wpjson\/wp\/v2\/tags?post=1857"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}