SHANTI - Sciences, Humanities, and Arts Network of Technological Initiatives

PAIR / Profile

Home Page: http://code.google.com/p/text-pair/

Download Page: http://code.google.com/p/text-pair/downloads/list

PAIR (Pairwise Alignment for Intertextual Relations) is a simple implementation of a sequence alignment algorithm for humanities text analysis designed to identify “similar passages” in large collections of texts. These may include direct quotations, plagiarism and other forms of borrowings, commonplace expressions and the like. We are developing two currently distinct streams:

PhiloLine (for PhiloLogic alignment) is the experimental model, written largely by Mark (in very old fashioned perl), designed to perform all-against-all comparisons in documents loaded into a PhiloLogic database. An entire corpus is indexed and compare against itself, or another database, to find text reuse.

Text::Pair is a generalized Perl module version of PAIR, without specific bindings to PhiloLogic, supporting one-against-many comparisons. A corpus is indexed and incoming texts are compared against the entire corpus for text reuse. We have prepared a simple demonstration of Text::Pair which allows you to submit passages to be aligned against 12,000 documents from Project Gutenberg.

 

Tool Type

Data Mining

Interface Languages
English (416)