Fuzzy hashing, also known as similarity hashing, is a technique for detecting data that is similar, but not exactly the same, as other data. This is in contrast to

cryptographic hash functions A cryptographic hash function (CHF) is a hash algorithm (a map of an arbitrary binary string to a binary string with a fixed size of n bits) that has special properties desirable for a cryptographic application: * the probability of a particula ...

, which are designed to have significantly different hashes for even minor differences. Fuzzy hashing has been used to identify malware and has potential for other applications, like data loss prevention and detecting multiple versions of code.

Background

hash function A hash function is any Function (mathematics), function that can be used to map data (computing), data of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output. The values returned by a ...

is a mathematical algorithm which maps arbitrary-sized data to a fixed size output. Many solutions use

SHA-256 SHA-2 (Secure Hash Algorithm 2) is a set of cryptographic hash functions designed by the United States National Security Agency (NSA) and first published in 2001. They are built using the Merkle–Damgård construction, from a one-way compressi ...

to detect duplicates or check for known files within large collection of files. However, cryptographic hash functions cannot be used for determining if a file is ''similar'' to a known file, because one of the requirements of a cryptographic hash function is that a small change to the input should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value (

avalanche effect In cryptography, the avalanche effect is the desirable property of cryptographic algorithms, typically block ciphers and cryptographic hash functions, wherein if an input is changed slightly (for example, flipping a single bit), the output changes ...

) Fuzzy hashing exists to solve this problem of detecting data that is similar, but not exactly the same, as other data. Fuzzy hashing algorithms specifically use algorithms in which two similar inputs will generate two similar hash values. This property is the exact opposite of the avalanche effect desired in cryptographic hash functions. Fuzzy hashing can also be used to detect when one object is contained within another.

Approaches for fuzzy hashing

There are a few approaches used for building fuzzy hash algorithms: * Context Triggered Piecewise Hashing (CTPH), which constructs a hash by splitting the input into multiple pieces, calculating traditional hashes for each piece, and then combining those traditional hashes into a single string. * Locality Sensitive Hashing places similar input items into the same "buckets", which can be used for

data clustering Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some specific sense defined by the analyst) to each o ...

and nearest neighbor searches

Notable fuzzy hashing tools and algorithms

* spamsum is a tool written by

Andrew Tridgell Andrew "Tridge" Tridgell (born 28 February 1967) is an Australian computer programmer. He is the author of and a contributor to the Samba (software), Samba file server, and co-inventor of the rsync algorithm. He has analysed complex proprieta ...

that uses fuzzy hashing to determine whether an email is similar to known spam. It operates by generating a fuzzy hash for an email that it compares against the fuzzy hashes from known spam emails to generate a match result between 0 (complete mismatch) to 100 (perfect match). If the match result is high enough, the email is classified as spam. * Nilsimsa Hash is an anti-spam focused

locality-sensitive hashing In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Si ...

algorithm. * ssdeep is a fuzzy hashing tool based on context-piecewise triggered hashing to compare files. * sdhash is a fuzzy hashing tool based on using

bloom filters In computing, a Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element (mathematics), element is a member of a set (computer science), set. Type I and type ...

to determine whether one file is contained within another or how similar two files are to each other. * TLSH is a locality sensitive hashing scheme for comparing whether files are similar to each other and has been used for malware clustering.

References

{{reflist, refs= {{cite journal, date=May 2014, title=NIST Special Publication 800-168, url=https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-168.pdf, access-date=January 11, 2023, website=NIST Publications, doi=10.6028/NIST.SP.800-168 , last1=Breitinger, first1=Frank {{cite journal , last1=Kornblum , first1=Jesse , title=Identifying almost identical files using context triggered piecewise hashing , url=https://www.sciencedirect.com/science/article/pii/S1742287606000764 , journal=Digital Investigation , year=2006 , volume=3, Supplement , issue=September 2006 , pages=91–97 , doi=10.1016/j.diin.2006.06.015 , access-date=June 30, 2022, doi-access=free {{cite book , last=Roussev , first=Vassil , title=Advances in Digital Forensics VI , chapter=Data Fingerprinting with Similarity Digests , series=IFIP Advances in Information and Communication Technology , publisher=Springer Berlin Heidelberg , publication-place=Berlin, Heidelberg , year=2010 , volume=337 , pages=207–226 , isbn=978-3-642-15505-5 , issn=1868-4238 , doi=10.1007/978-3-642-15506-2_15 {{cite conference , last1=Oliver , first1=Jonathan , last2=Cheng , first2=Chun , last3=Chen , first3=Yanggui , title=2013 Fourth Cybercrime and Trustworthy Computing Workshop , chapter=TLSH -- A Locality Sensitive Hash , publisher=IEEE , year=2013 , pages=7–13 , doi=10.1109/ctc.2013.9 , isbn=978-1-4799-3076-0 , chapter-url=https://github.com/trendmicro/tlsh/blob/master/TLSH_CTC_final.pdf, access-date=December 12, 2022 {{cite conference , last1=Oliver , first1=Jonathan , last2=Hagen , first2=Josiah , title=2021 IEEE 19th International Conference on Embedded and Ubiquitous Computing (EUC) , chapter=Designing the Elements of a Fuzzy Hashing Scheme , publisher=IEEE , year=2021 , pages=1–6 , doi=10.1109/euc53437.2021.00028, isbn=978-1-6654-0036-7 , chapter-url=https://tlsh.org/papersDir/Design_TLSH_2021.pdf , access-date=14 April 2021, archive-date=14 April 2021, archive-url=https://web.archive.org/web/20210414044249/http://tlsh.org/papersDir/Design_TLSH_2021.pdf {{cite web , url=https://www.samba.org/ftp/unpacked/junkcode/spamsum/README , title=spamsum README , website=samba.org , access-date=December 11, 2022 {{cite web , url=https://download.samba.org/pub/unpacked/junkcode/spamsum/spamsum.c , title=spamsum.c , website=samba.org , access-date=December 11, 2022 {{cite conference , last1=Pagani , first1=Fabio , last2=Dell'Amico , first2=Matteo , last3=Balzarotti , first3=Davide , title=Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy , chapter=Beyond Precision and Recall , publisher=ACM , publication-place=New York, NY, USA , date=2018-03-13 , pages=354–365 , doi=10.1145/3176258.3176306, isbn=9781450356329 , chapter-url=https://pagabuc.me/docs/codaspy18_pagani.pdf, access-date=December 12, 2022 {{cite journal, last1=Al-Kuwari , first1=Saif , last2=Davenport , first2=James H., last3=Bradford, first3=Russell J., date=2011, title=Cryptographic Hash Functions: Recent Design Trends and Security Notions, journal=Cryptology ePrint Archive, id=Report 2011/565 , url=https://eprint.iacr.org/2011/565 {{cite book, last1=Sarantinos , first1=Nikolaos , last2=Benzaïd, first2=Chafika, last3=Arabiat, first3=Omar, title=2016 IEEE Trustcom/BigDataSE/ISPA , chapter=Forensic Malware Analysis: The Value of Fuzzy Hashing Algorithms in Identifying Similarities , date=2016, pages=1782–1787 , doi=10.1109/TrustCom.2016.0274 , isbn=978-1-5090-3205-1 , s2cid=32568938 , url=http://roar.uel.ac.uk/5710/1/Forensic%20Malware%20Analysis.pdf , id=10.1109/TrustCom.2016.0274 , chapter-url=https://ieeexplore.ieee.org/document/7847157 {{cite web , url=https://tlsh.org/papersDir/n21_opt_cluster.pdf , title=Fast Clustering of High Dimensional Data Clustering the Malware Bazaar Dataset , website=tlsh.org , access-date=December 11, 2022 {{cite web , url=https://github.com/trendmicro/tlsh/blob/master/TLSH_Introduction.pdf , title=Open Source Similarity Digests DFRWS August 2016 , website=tlsh.org , access-date=December 11, 2022 Anti-spam Computer security Digital forensics Hashing

Background

Approaches for fuzzy hashing

Notable fuzzy hashing tools and algorithms

See also

References