In Irudiko, a Locality-Sensitive Hashing sketch is basically a synthetized representation of a document whose size is fixed, rather than depending from the document's one. Comparing two given sketches, it is possible to estimate in a linear time, with a rather good approximation, how much two documents are similar.

Irudiko is a C++ library which allows to transform a given document to a smaller sketch. It also implements several ways to make more optimization of a document:
  1. Removal of HTML tags and useless content (e.g. Javascript, CSS, comments) from HTML pages;
  2. Native support for documents written in English and Italian (at least for now), in order to optimize them by means of stopword removal and stemming phases;
  3. Generation of a set of shingle from a given page, in order to estimate similarity in a better way;
  4. Dimensionality reduction, according to the two mostly used given function in the area, min and mod (I suggest to use mod, btw ;);
  5. Recognization and removal of so-called layout information, i.e., all the web content shared by pages coming from common websites.

For more information about the approach used in Irudiko, this is roughly described in a presentation made by mine and available in PDF format here. It is mainly covered by a number of research articles which are cited in my final MSc dissertation (available in Italian only, sorry).