pymongo去除重复数据

唯一索引
1
db.things.ensureIndex({'key' : 1}, {unique : true, dropDups : true})
但是dropDups is not supported by MongoDB 2.7.5 or newer所以这个方法只能在2.7.5版本以下才行

用aggreate找出重复的数据，然后再一个一个删除(效率比较低)，python代码

#先找到重复的数据
deleteData=collection.aggregate([
{'$group': { 
    '_id': { 'firstField': "$area", 'secondField': "$time_point" }, 
    'uniqueIds': { '$addToSet': "$_id" },
    'count': { '$sum': 1 } 
  }}, 
  { '$match': { 
    'count': { '$gt': 1 } 
  }}
]);
first=True
for d in deleteData:
    first=True
    for did in d['uniqueIds']:
        if !first:    #第一个不删除
            collection.delete_one({'_id':did});
        first=False

参考1
参考2

第二种方法当数据量很大的时候，需要把数据写入表中。aggregate的pipeline中要加上out项，同时由于aggregate只接受两个参数，self是默认的，所以要用allowDiskUse=True这种形式添加参数

# 找出重复的放入result表中
def findDuplicate():
    deleteData=collection.aggregate([
        {'$group': {
            '_id': { 'firstField': "$mid", 'secondField': "$created_at" },
            'uniqueIds': { '$addToSet': "$_id" },
            'count': { '$sum': 1 }
            }
        },
        { '$match': {
            'count': { '$gt': 1 }
            }
        },{'$out':'result'}
    ],allowDiskUse=True); 

def deleteDup():
    deleteData=db.result.find()
    first=True
    for d in deleteData:
        first=True
        for did in d['uniqueIds']:
            if first==False:
                collection.delete_one({'_id':did});
            first=False