Naze-Nani MD5, from #spectre-anime ==> The How and Why of MD5 (written by KitLaan) (The text in this document should not be taken as fact. You should use this information with a grain of salt, and a dish of distrust. It should be correct, but you never know. Sorry for the long read. When I write, I write... And typos and grammar mistakes are purely intentional.) Summary: MD5 good. CRC-32 not as good. SHA1 even better. MD5 checker is usually called "md5sum". Search on Google. WinMD5 is also pretty decent. Search on Google. (No source code available) The Actual Thing: So, you see this MD5 thing. What is it? Why aren't we using CRC32 or SFV or whatnot? Well, this document should explain some things. First, lets explain what checksums are. Checksums, or hashes, are the results from hashing algorithms. Essentially, the algorithm takes some data (in this case, a file) and converts it into a string of data. This resulting piece of data has the property that you cannot use it to generate the original data. (The algorithm describes a non-reversible function.) It's like a fingerprint for data. The usual uses of these checksums are for checking the validity of a password, as the only way of breaking it is via a brute-force approach. (i.e. try all the different inputs to try to produce the wanted output.) Needless to say, with a "secure enough" hashing algorithm, this could take forever (or near enough). For example, most password files (shadowed passwords, I think) on Unix (or Unix-based) machines have the passwords hashed. In cryptography (where most of these algorithms are used), the most widely used algorithms are MD4, MD5, SHA, and SHA1. (MD refers to Message Digest, and SHA refers to Secure Hash Algorithm.) In #spectre-anime, we use the MD5 algorithm. More on how it works and how it compares to others later. To get a program that generates MD5 digests (outputs), the easiest method is to search on Google. (For Windows, search "md5sum windows". It's one of the first five results, probably. For Unix/Linux, it's already installed probably.) Now, you may be thinking, "what does cryptography have to do with us?" Well, an alternate use for these secure hashes, is to validate the "correctness" of a piece of data (aka, your file). The same principle applies. Depending on the hash algorithm you use, you are guaranteed that the output is unique to the input (sorta; more on this later). Therefore, by using MD5 (or any hash algorithm), you can make sure your file exactly matches the original file by running the algorithm and making sure your output matches the one from the original. So, if you take the original file (FILE_A), and your copy of it (FILE_B). Now run the hash algorithm on the files, and compare the outputs. (Random garbage used for example.) FILE_A : hash(FILE_A) ==> ABCDEF FILE_B : hash(FILE_B) ==> DEFACD Given these results, what do you know about the two files? They are not the same. Someone may have tampered with them, or more likely, something got corrupted during transmission. So, you have to go re-download the file. *sigh* (Warning. Math-heavy stuff starts to occur after this point.) The next question, "what types of checksum algorithms are used normally?" From my experience (which is probably the same, or very different from other people...) I see MD5 used more. Here's a list of some you may have seen used: - CRC-32 (Cyclic Redundancy Check 32) (CRC-32 is used in SFV, RAR, and others.) CRCs are based on polynomial arithmetic. CRC-32 uses a 32-bit polynomial, and detects stuff based on 32 bits. Glossing over the details, the algorithm may generate a collision with the probability of 1 in 4 billion. This may seem big, but it all depends on the data. (This number is small enough that someone could create/modify a file to match the original CRC). - MD5 (Message Digest 5) MD5 uses a 128-bit output value. This is much longer than others (aka CRC32, which only has 32-bits). By using a long value, the possibilty of data having the same value is greatly reduced. The algorithm was created in 1992 by Ron Rivest, based on MD4. - SHA1 SHA guarantees (I think) that a message (stream of data) that is less than 2^64 bits in length will provide a 160-bit digest, that will take a very long time (almost impossible) to brute-force find (given current-day technology). [I think 2^64 bits is around 2^31 gigabytes. Yes that's big.] - Others. (Yes, I'm lazy.) So, SHA1 is better than MD5 is better than CRC-32. But you sacrifice speed as the bitlength goes up. But at the rate of computing speed increases, it isn't as important. Another thing you may be asking is about bit-length. Essentially, the number of bits tells you how many possible outputs (digests) a hash function can generate. Therefore, a 32-bit algorithm generates 2^32 unique digests, while a 128-bit algorithm generates 2^128 unique digests. You can see what a difference this makes. (Infinite possible inputs, XXX possible outputs. You can see the ratio of the probabilities of overlap.) In other words, CRC-32 has a higher chance of two files (even minute differences) to have a duplicate digest value. MD5 is lower... If you want a good and simple read on probabilities of collision, read the thing on "Taylor's Law of Programming Probability", listed in the References section, below. As to how it works, go read the algorithm or the various pages describing it (listed below). It's too math heavy for me to regurgitate. =P Lastly, use MD5, or whatever else that is even better. Security and paranoia is good. And lastly, go use PGP. :) -KitLaan (kitlaan@twinaxis.com) April 24, 2002 [ blatent promotion: md5check thingy available for mIRC: ] [ http://kitlaan.twinaxis.com/projects/mirc.html ] References: - MD5 Algorithm (RFC1321) http://sunsite.doc.ic.ac.uk/rfc/rfc1321.txt - MD4 Algorithm (RFC1320) http://sunsite.doc.ic.ac.uk/rfc/rfc1320.txt - Kerberos Network Authentication Service (RFC1510) (Describes various hashes and their "reliability") http://sunsite.doc.ic.ac.uk/rfc/rfc1510.txt - CRC (ISO 3309) - SHA1 Algorithm (RFC3174) http://www.ietf.org/rfc/rfc3174.txt - Brute-Forcing MD5? http://www.stack.nl/~galactus/remailers/attack-3.html - Statistical Analysis of MD5 http://www.spiderline.com/data/001/md5/index.html - Taylor's Law of Programming Probability http://www.miketaylor.org.uk/tech/law.html - Cryptography http://www.checksum.org/download/cryptography.pdf