计算哈希值(MD5),和已有的哈希结果比对,重复删除,没有则保留。
改吧改吧可以在整个磁盘下删除重复文件,建立软连接到首个文件
1: import os,sys
2: import hashlib
3:
4: def file_md5(filepath):
5: f = open(filepath,'rb')
6: md5obj = hashlib.md5()
7: md5obj.update(f.read())
8: hash = md5obj.hexdigest()
9: return hash
10:
11:
12: def file_dedup(dirpath):
13: hashpool = [];
14: print "dedup files in",dirpath
15: filelist = os.listdir(dirpath) #list all files and dirs in the dir
16: for efile in filelist:
17: print efile,"check!!!"
18: filepath = os.path.join(dirpath,efile) #file's absolute path
19: if os.path.isdir(filepath): #if file is a dir
20: print filepath,"is a file"
21: continue
22: else:
23: filehash = file_md5(filepath)
24: if filehash in hashpool:
25: print "exist! delete"
26: os.remove(filepath)
27: else:
28: print "new file"
29: hashpool.append(filehash)
30:
31: if __name__ == "__main__":
32: dirpath = "E://files"
33: file_dedup(dirpath)
这是文件级重删啊,不过缺点是修改了一个文件,其它链接也跟着变了。
没想重删,就只是想把之前爬下来的很多重复文件删了