In Irudiko, a Locality-Sensitive Hashing sketch is basically a synthetized representation of a document whose size is fixed, rather than depending from the document's one. Comparing two given sketches, it is possible to estimate in a linear time, with a rather good approximation, how much two documents are similar.
Irudiko is a C++ library which allows to transform a given document to a smaller sketch. It also implements several ways to make more optimization of a document:
- Removal of HTML tags and useless content (e.g. Javascript, CSS, comments) from HTML pages;
- Native support for documents written in English and Italian (at least for now), in order to optimize them by means of stopword removal and stemming phases;
- Generation of a set of shingle from a given page, in order to estimate similarity in a better way;
- Dimensionality reduction, according to the two mostly used given function in the area, min and mod (I suggest to use mod, btw ;);
- Recognization and removal of so-called layout information, i.e., all the web content shared by pages coming from common websites.
For more information about the approach used in Irudiko, this is roughly described in a presentation made by mine and available in PDF format here. It is mainly covered by a number of research articles which are cited in my final MSc dissertation (available in Italian only, sorry).
- How it WORKS
- Download
- About the author (Angelo Romano):