Generally speaking, the Hamming distance between two strings of the same length is the number of positions in which the corresponding symbols are different. This metric, proposed by Richard W. Hamming in his seminal paper “Error Detecting and Error Correcting Codes”, Bell System Technical Journal 26(2):147-160 (1950), also expresses the least number of (intended or unintended) substitutions needed to change one string to another.
Besides being used in computer- and communications-related fields such as information theory, coding theory, and cryptography, the Hamming distance concept has also found its way into genomics for the comparison of genomic sequences. Its importance can also be judged by the fact that modern microprocessors have a specific instruction (POPCNT, “population count”) for counting the number of bits set to 1.
Assuming two bit strings of equal length, x and y, the “obvious” algorithm to calculate the Hamming distance between them is to count the number of “1” bits in the result of the expression “x xor y”, as shown in the following Python code:
def hamming1(x,y): """Calculate the Hamming distance between two bit strings""" assert len(x) == len(y) count,z = 0,int(x,2)^int(y,2) while z: if z&1: count += 1 z >>= 1 return count
While searching the web the other day, I have come across a very elegant algorithm for the same purpose, which can be coded in Python as follows:
def hamming2(x,y): """Calculate the Hamming distance between two bit strings""" assert len(x) == len(y) count,z = 0,int(x,2)^int(y,2) while z: count += 1 z &= z-1 # magic! return count
Apart from being 50% faster than the first one, its simplicity is awesome! After being counted, the lowest-order nonzero bit is cleared like magic! When I started a new search later, this time looking for the algorithm’s authorship, I finally learned that it was proposed by Peter Wegner in 1960 (“A technique for counting ones in a binary computer“, Communications of the ACM 3 (5): 322).
Where did I find this information? The answer was, talking again about “obvious” things, in the very same Wikipedia’s entry for “Hamming distance“! This little gem has been there for at least a year (according to the entry’s history), just waiting to be discovered! Quoting Antoine de Saint Exupéry, “You know you have achieved perfection in design, not when you have nothing more to add, but when you have nothing more to take away.”
Caro João Henrique,
Realmente é uma solução bastante engenhosa e, à primeira vista, cheguei a pensar que se tratava mesmo de “mágica”. Após colocar meus dois neurônios para trabalhar em paralelo, vi então que é uma solução muito engenhosa, mesmo. Quando você subtrai 1 de um número binário, você inverte todos os bits ‘0’ menos significativos, e também inverte o primeiro bit ‘1’ que aparecer (efeito do carry). Fazendo um AND com o valor original, os zeros que foram invertidos não têm efeito, pois são ‘0’ no original, assim permanecendo, já o bit ‘1’ de menor ordem, que virou ‘0’ na subtração, este será zerado pelo AND, gerando uma contagem de 1, que incrementa o contador.
Muito bom! Experimentei em C e PASCAL, com a mesma simplicidade e eficiência. Fiz o programa imprimir, a cada loop, o valor binário de ‘z’, e podemos observar os bits ‘1’ serem “apagados” da direita para a esquerda. Ficou um efeito visual muito bonito.
Muito boa descoberta!
This algorithm works for bits (e.g. 0’s and 1’s) but not for anything else (e.g. strings)
You’re right. Thank you.
I think a lookup table would be faster. For brevity, you could use a lookup table of size 16 for each nibble.
Pingback: python - Distanza di Hamming tra due stringhe binarie non funziona