Reference: Boss 1 | Boss 2
Main purpose: used for text duplication checking and document similarity comparison.
The principle of estimation: the probability that the two minimum hash values ??obtained after random permutation and conversion of two sets are equal is equal to the jaccard similarity of the two sets (for example, there is a in the two sets n times random The sub-minimum hash values ??are equal, the probability of equality = a/n) The key to the minimum hash will not be proved at present
The jaccard similarity we use here is used to compare the similarity between limited sample sets and difference.
Given two sets A and B, the jaccard coefficient is defined as the ratio of the size of the intersection of A and B to the size of the union. The larger the coefficient, the higher the similarity.
When A=?, B=?, jaccard=1.
Minimum hashing first targets the characteristic matrix.
For example, the complete set {a, b, c, d, e}.S1={a, d}, S2={c}, S3={b, d, e}, S4={a, c, d}
This is the characteristic matrix.
PS: Assuming that the document is ABCD, 2 can be used as the cutting feature, AB, BC, CD.
The minimum hash is to perform a row transformation on the feature matrix, and the minimum hash value is the first row number of 1 in the row sorting after the transformation.
For example, now h(S1)=a, h(S2)=c, h(S3)=b, h(S4)=a.
Specifically use the minimum hash algorithm , compress a feature matrix, that is, construct a signature matrix. Each column of the signature matrix is ??the value of n hash functions, and approximately estimates the jaccard value of the original data.
Next, we realize pseudo-row sorting by mapping out the row numbers. It is also a kind of continuous disruption of the order.
Simulate the method of calculating the signature matrix:
Consider line 0 first: you can get:
Calculate sequentially until the last line, only taking the minimum value each time (find the corresponding line and compare the current minimum hash value and the size of the mapped line number ):
The original jaccard similarity can be estimated through the signature matrix.
SIM(S1, S4)=1, while the actual similarity is 4/6=2/3. The above is just an estimate. Under large-scale data [multiple hashes], the estimated value is close to the true value.
I don’t know why the matrices and formulas I drew are gone. Let’s make up for it another day