I have small image hosting and I realized there many duplicate content. I want to eliminate this problem in the future by using checksum or hash code where newly uploaded file will be hashed, compared with existing image hash database, deleted if it already exist and user will be presented with the existing image link. All in one instance
My setup is barebones Node.js+jQuery File Upload+2 directories(one for a forum upload, another one for direct web upload).
What is the best(fast&reliable) hash and database setup for me to do this given the possibilities there might be thousand or million files in each directory? I think MD5 or SHA1 is overkill and might take a lot of resources. I would like to know if there any simpler solution.
Statistics :
~1,000 image uploaded everyday
~400 kb average image size
~35,000 image in the server
~30% duplicated content (tested using MD5)
MD5 is actually quite fast, more than fast enough for your use case. One anecdotal benchmark has it at about ~400 Megabytes per second on a single CPU (source). It wouldn't be the bottleneck in your server processing, and it is a reliable way to check for duplicate files. MD5 is vulnerable to collision attacks, but they must be painstakingly prepared; chance collisions are statistically impossible. It sounds like collisions wouldn't be too great of a problem in your application (but make sure you handle them anyway).
If you truly just want speed to the exclusion of reliability, you could go with CRC. It's not intended to be a true hash, just to detect errors in a byte stream. It has a relatively high collision rate of about 1 in a million. However, it's blazing fast; it's meant to be implemented in hardware on routers.
How about the following approach:
For converting the existing images into that structure, I'm sure a fairly simple shell script using md5sum
, mv
and ln -s
would do the trick.
One other possibility is to use something like MongoDB to store the images in a DB, which may well be easier to cluster.