login about faq

For quite a while I have been using the following code in a C++ project for easily comparing files for any changes:

CkZipCrc zipCrc;

// Get CRC of binary file 1
unsigned long crc1 = zipCrc.FileCrc("binary file 1");

// Get CRC of binary file 2
unsigned long crc2 = zipCrc.FileCrc("binary file 2");

if (crc1 != crc2)
...

I am wondering if this is considered a safe comparison of any type of files. I know there are many opinions about this and that the file sizes and number have a great deal to say. It is a matter of the likeliness of false positives, but should I rewrite the code to be on the safe side?

Above code is used in an automated process that typically compares up to a 1000 files at a time (two directories with the same file names, where some may have changed and dates/attributes cannot be trusted). The files are typically less than 5 MB each, but since I have no upper limit, they could potentially be any size, though I encourage my users to keep files small for different reasons.

It is critical that no errors occur as a result of an invalid comparison, but speed in this matter also has a lot to say. I chose CRC32 because it was fast and well integrated, but the question is whether I should switch to SHA1 or similar?

In case SHA1 is the recommended solution, would the following code the be best way to do it, or is there a built-in method to compare two binary files? (otherwise that would be a nice feature)

CkCrypt2 crypt;
crypt.UnlockComponent("...");

crypt.put_HashAlgorithm("sha1");
crypt.put_EncodingMode("hex");

// Get hash of binary files
const char *hash1 = crypt.hashFileENC("binary file 1");
const char *hash2 = crypt.hashFileENC("binary file 2");

// Compare SHA1 hash values
if (hash1 != hash2)
...

asked Nov 21 '12 at 11:29

roan98dk's gravatar image

roan98dk
326192034

I forgot to add that the hex encoding is not necessary for SHA1 comparison, but would that affect comparison speed?

(Nov 21 '12 at 11:54) roan98dk

I think both CRC32 and SHA-1 are adequate and of similar performance. The chance of false-positive would be one in MAXINT, where MAXINT is the max unsigned integer possible to be held in a 32-bit unsigned int, which would be about 1 in 4 billion. I'm personally OK with those odds. An SHA-1 hash is 20 bytes as opposed to the 4-byte CRC32 hash, so the chance of collision is astronomically lower.

link

answered Nov 21 '12 at 12:38

chilkat's gravatar image

chilkat ♦♦
11.8k316358420

Thanks, that was my understanding as well. Would you consider adding an integrated method that does the compare and simply returns an integer value for equality (only to simplify code and avoid returning temporary objects)?

E.g. CkZipCrc::FileCompare(const char *file1, const char *file2) 
or 
CkCrypt2::HashFileCompare(const char *file1, const char *file2, const char *hashAlgorithm)
link

answered Nov 21 '12 at 13:20

roan98dk's gravatar image

roan98dk
326192034

I would compare files sizes initially to potentially short circuit the SHA1 calls which will save processing time. If the files are different sizes, then obviously the contents are not equal.

If you are worried about collisions, you could do a binary compare a few random internal chunks of bytes of the files too - this might also save time before calculating the SHA1, but this would need to be benchmarked (if you are dealing with lots of large files, it certainly would).

I ran some quick tests on a few hundred files, and just performing the size check optimization reduced the total run time from around 6 seconds to around 200 milliseconds.

link

answered Nov 21 '12 at 14:07

jpbro's gravatar image

jpbro ♦
1.1k2618

edited Nov 21 '12 at 14:10

Thanks, I newer thought of that :-) This adds an additional argument to have an integrated method for comparing files. Sure it is easy to write this myself, but it may help many users to have such a method naturally integrated in Chilkat. E.g. a method that takes your suggestions into consideration and perhaps even automatically changes behavior for different file sizes. If hashAlgorithm in my suggestion is optional, then it could perhaps pick CRC32 for "smaller" files and SHA1 for "larger" files.

(Nov 21 '12 at 14:41) roan98dk
1

That's a great answer, thanks!

(Nov 22 '12 at 21:21) chilkat ♦♦

In the end I decided to go with a quick initial check for file size difference and then I do a check with SHA1. I could also stay with CRC32, but since stability and trust is the key issue here, I prefer to be on the safe side. I suddenly recalled that I have previously experimented with tampering of files while manually adjusting CRC32 to match (very easy). Previously that would have been possible, but no longer (at least not easily).

(Nov 23 '12 at 05:57) roan98dk
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×65
×9
×8
×2
×1

Asked: Nov 21 '12 at 11:29

Seen: 2,464 times

Last updated: Nov 23 '12 at 05:57

powered by OSQA