Recently, a post to the full-disclosuremailing list described an update to the well known MD5 collisionproblem. The authors - Marc Stevens, Arjen K. Lenstra, and Benne deWeger - provided a method whereby they can append only a few thousandbytes to two arbitrary files, with the result that both files have thesame MD5 value. This is known as a "chosen prefix collision." Not onlythat, but they produced their proof-of-concept files using one machinein less than two days. If you distribute the work, you can make it go faster.
While what they have achieved is not the same as producing anidentical MD5 for an existing file, it's still not a good thing. Inparticular it causes serious trouble for application white-listingimplementations. Why? Imagine this scenario:
- malware author creates a harmless application.
- malware author creates a malicious application.
- malware author uses the chosen prefix collision method to alter these two applications to produce an identical MD5 value.
- malware author uploads the harmless application to a Web site thathosts a drive-by download exploit. The drive-by download installs theharmless application everywhere.
- harmless application is submitted to a classification server, which white-lists the file because it is harmless.
- malware author uploads the malicious application to a Web site thathosts a drive-by download exploit. The drive-by download installs themalicious application everywhere.
- since the MD5 values are the same, the malicious application nowlooks identical to the harmless application and is not blocked.
And we wouldn't even know that it's happening. Oh.
Suddenly (or not - it has been expected for some time), MD5 isn'tlooking so good anymore. But wait, there's more: the harmlessapplication can even be digitally signed after alteration,making it look even more harmless. And so can the maliciousapplication. That's because plenty of digital signatures are verifiedby using MD5.
How about SHA-1? No, the same kind of problem exists there.
Multiple checksums? Useless if the second algorithm is weaker thanthe first one (MD5 and CRC32, for example). If the second algorithm isstrong, then it is sufficient on its own and the MD5 calculationbecomes a waste of time.
Can we be suspicious about the extra bytes? Not reliably - it mightlook like compressed data in an installer application, or some kind ofsignature.
SHA-2 is looking good.