text_indexing 0.0.1
text_indexing: ^0.0.1 copied to clipboard
Dart library for creating an inverted index on a collection of text documents.
text_indexing #
Dart library for creating an inverted index on a collection of text documents.
THIS PACKAGE IS PRE-RELEASE AND SUBJECT TO DAILY BREAKING CHANGES.
Objective #
The objective of this package is to provide an interface and implementation classes that build and maintain a term dictionary that holds the vocabulary of terms and the frequency of occurrence for each term in the corpus and a postings map that holds the references to the documents for each term. In this implementation, our postings include the positions of the term in the documents to allow search algorithms to derive relevance on a per document basis.
Definitions #
The following definitions are used throughout the documentation:
corpus- the collection ofdocuments for which anindexis maintained.dictionary- is a hash ofterms (vocabulary) to the frequency of occurence in thecorpusdocuments. In this implementation,Dictionaryis a type defintion for a hashmap with thevocabularyas key and thedocumentfrequency as the values.document- a record in thecorpus, that has a unique identifier (docId) in thecorpus's primary key and that contains one or more text fields that are indexed.index- an inverted index used to look updocumentreferences from thecorpusagainst avocabularyofterms. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexedtermin eachdocument.postings- a separate index that records whichdocuments thevocabularyoccurs in. In this implementation we also record the positions of eachtermin thedocumentto create a positional invertedindex.postings list- a record of the positions of atermin adocument. A position of atermrefers to the index of thetermin an array that contains all theterms in thetext.term- a word or phrase that is indexed from thecorpus. Thetermmay differ from the actual word used in the corpus depending on thetokenizerused.text- the indexable content of adocument.token- representation of atermin a text source returned by atokenizer. The token may include information about thetermsuch as its position(s) in the text or frequency of occurrence.tokenizer- a function that returns a collection oftokens fromtext, after applying a character filter,termfilter, stemmer and / or lemmatizer.vocabularyis the collection ofterms/words indexed from thecorpus.
Interface #
The text indexing classes (indexers) in this library inherit from TextIndexer, an interface intended for information retrieval software applications. The TextIndexer interface is consistent with information retrieval theory.
The inverted index is comprised of two artifacts:
- a
Dictionaryis a hashmap ofDictionaryEntrys with thevocabularyas key and thedocumentfrequency as the values; and - a
Postingsa hashmap ofPostingsEntrys with thevocabularyas key and thepostings lists for the linkeddocuments as values.
The Dictionary and Postings can be asynchronous data sources or in-memory hashmaps. The TextIndexer reads and writes to/from these artifacts using the loadTerms, updateDictionary, loadTermPostings and upsertTermPostings asynchronous methods.
The index method indexes text from a document, returning a list of PostingsList that is also emitted by postingsStream. The index method calls emit, passing the list of PostingsList.
The emit method is called by index, and adds an event to the postingsStream.
Listen to postingsStream to update your dictionary and postings map.
Implementing classes override the following fields:
Tokenizeris theTokenizerinstance used by the indexer to parsedocuments to tokens;postingsStreamemits a list ofPostingsListinstances whenever adocumentis indexed.
Implementing classes override the following asynchronous methods:
indexindexestextfrom adocument, returning a list ofPostingsListand adding it to thepostingsStreamby callingemit;emitis called byindex, and adds an event to thepostingsStreamafter updating theDictionaryandPostings;loadTermsreturns aDictionaryfor avocabularyfrom aDictionary;updateDictionarypasses new or updatedDictionaryEntryinstances for persisting to aDictionary;loadTermPostingsreturnsPostingsEntryentities for avocabularyfromPostings; andupsertTermPostingspasses new or updatedPostingsEntryinstances for upserting toPostings.
References #
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press. 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
Install #
In the pubspec.yaml of your flutter project, add the following dependency:
dependencies:
text_indexing: ^0.0.1
In your code file add the following import:
import 'package:text_indexing/text_indexing.dart';
Usage #
Examples are provided for the InMemoryIndexer and PersistedIndexer, two implementations of the TextIndexer interface that inherit from TextIndexerBase.
TextIndexerBase Class #
The TextIndexerBase is an abstract base class that implements the TextIndexer.index and TextIndexer.emit methods.
Subclasses of TextIndexerBase may override the override TextIndexerBase.emit method to perform additional actions whenever a document is indexed.
InMemoryIndexer Class #
The InMemoryIndexer is a subclass of TextIndexerBase that builds and maintains in-memory Dictionary and PostingMap hashmaps. These hashmaps are updated whenever InMemoryIndexer.emit is called at the end of the InMemoryIndexer.index method, so awaiting a call to InMemoryIndexer.index will provide access to the updated InMemoryIndexer.dictionary and InMemoryIndexer.postings collections.
The InMemoryIndexer is suitable for indexing a smaller corpus. An example of the use of InMemoryIndexer is included in the examples.
PersistedIndexer Class #
The PersistedIndexer is a subclass of TextIndexerBase that asynchronously reads and writes dictionary and postings data sources. These data sources are asynchronously updated whenever PersistedIndexer.emit is called by the PersistedIndexer.index method.
The PersistedIndexer is suitable for indexing and searching a large corpus but may incur some latency penalty and processing overhead. An example of the use of PersistedIndexer is included in the package examples.
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.