How are near duplicates determined?
Briefly, the process involves finding how similar each pair of documents is. Documents are broken into “shingles” where a shingle is a set of consecutive words. For example, if the shingles are three-words long, then words 1-3 would be the first shingle, 2-4 would be the second, and so forth. The similarity of two documents is measured as the number of shared shingles divided by the number of unique shingles in the two documents. More similar documents are those that share more shingles.