text_indexing 0.12.0
text_indexing: ^0.12.0 copied to clipboard
Dart library for creating an inverted index on a collection of text documents.
text_indexing #
Dart library for creating an inverted index on a collection of text documents.
THIS PACKAGE IS PRE-RELEASE, IN ACTIVE DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.
Skip to section:
Overview #
This library provides an interface and implementation classes that build and maintain an (inverted, positional, zoned) index for a collection of documents or corpus (see definitions).

The TextIndexer constructs two artifacts:
- a
dictionarythat holds thevocabularyoftermsand the frequency of occurrence for eachtermin thecorpus; and - a
postingsmap that holds a list of references to thedocumentsfor eachterm(thepostings list).
In this implementation, our postings list is a hashmap of the document id (docId) to maps that point to positions of the term in the document's zones. This allows query algorithms to score and rank search results based on the position(s) of a term in document fields, applying different weights to the zones.

Refer to the references to learn more about information retrieval systems and the theory behind this library.
Usage #
In the pubspec.yaml of your flutter project, add the text_indexing dependency.
dependencies:
text_indexing: <latest version>
In your code file add the text_indexing import.
import 'package:text_indexing/text_indexing.dart';
For small collections, instantiate a TextIndexer.inMemory, (optionally passing empty Dictionary and Postings hashmaps), then iterate over a collection of documents to add them to the index.
// - initialize a in-memory [TextIndexer] with defaults for all parameters
final indexer =TextIndexer.inMemory();
// - iterate through the sample data
await Future.forEach(documents.entries, (MapEntry<String, String> doc) async {
// - index each document
await indexer.index(doc.key, doc.value);
});
The examples demonstrate the use of the TextIndexer.inMemory and TextIndexer.async factories.
API #
The API exposes the TextIndexer interface that builds and maintain an index for a collection of documents.
Three implementations of the TextIndexer interface are provided:
- the TextIndexerBase abstract base class implements the
TextIndexer.index,TextIndexer.indexJsonandTextIndexer.emitmethods; - the InMemoryIndexer class is for fast indexing of a smaller corpus using in-memory dictionary and postings hashmaps; and
- the AsyncIndexer class, aimed at working with a larger corpus and asynchronous dictionaries and postings.
To maximise performance of the indexers the API manipulates nested hashmaps of DART core types int and String rather than defining strongly typed object models. To improve code legibility and maintainability the API makes use of type aliases throughout.
Type Aliases #
Dictionaryis an alias forMap<Term, Ft>, a hashmap ofTermtoFt.DictionaryEntryis an alias forMapEntry<Term, Ft>, an entry in aDictionary.DocIdis an alias forString, used whenever a document id is referenced.DocumentPostingsis an alias forMap<DocId, ZonePostings>, a hashmap of document ids toZonePostings.DocumentPostingsEntryis an alias forMapEntry<DocId, ZonePostings>, an entry in aDocumentPostingshashmap.ZonePostingsis an alias forMap<Zone, TermPositions>, a hashmap ofZones toTermPositionsin theZone.FieldPostingsEntryis an alias forMapEntry<Zone, TermPositions>, an entry in aZonePostingshashmap.Ftis an lias forintand denotes the frequency of aTermin an index or indexed object (the term frequency).FtdPostingsis an alias for forMap<String, Map<String, int>>, a hashmap of vocabulary to hashmaps of document id to term frequency in the document.KGramIndexis an alias for for forMap<String, Set<String>>;KGramIndexLoaderis a callback function that aynchronously retrieves a [KGramIndex] subset for a collection of terms from a [KGramIndex] data source, usually persisted storag;KGramIndexUpdateris a callback function that aynchronously updates aKGramIndexrepository;IdFtis an alias fordouble, where it represents the inverse document frequency of a term, defined as idft = log (N / dft), where N is the total number of terms in the index and dft is the document frequency of the term (number of documents that contain the term).IdFtIndexis an alias forMap<String, double>, a hashmap of the vocabulary to the inverse document frequency (Idft) of the term.JSONis an alias forMap<String, dynamic>, a hashmap known as"Java Script Object Notation" (JSON), a common format for persisting data.JsonCollectionis an alias forMap<String, Map<String, dynamic>>, a hashmap ofDocIdtoJSONdocuments.Ptis an alias forint, used to denote the position of aTerminSourceTextindexed object (the term position).TermPositionsis an alias forList<Pt>, an orderedSetof unique zero-basedTermpositions inSourceText, sorted in ascending order.ZoneWeightMapis a map of the zone names to their relative weighting in search results, used by scoring and ranking algorithms in information retrieval systems.
InvertedIndex Interface #
The InvertedIndex is an interface for an inverted, positional zoned index on a collection of documents.
The InvertedIndex exposes the following fields:
ITextAnalyzer analyzeris the text analyser that extracts tokens from text;ZoneWeightMap zonesmaps collection zone/field names to their relative weight in the index;int phraseLengthis the maximum length of phrases in the index vocabulary.phraseLengthmust be greater than 0 and increases the size of the index by a factor equal tophraseLength, so its value must be kept as small as is consistent with efficient and accurate retrieval;int kis the length of k-gram entries in the k-gram index; andFuture<int> vocabularyLengthasynchronoulsy returns the number of entries in the index [Dictionary].
The InvertedIndex exposes the following methods:
getDictionaryAsynchronously retrieves aDictionaryfor a collection ofTerms from aDictionaryrepository;upsertDictionaryinserts entries into aDictionaryrepository, overwriting any existing entries;getKGramIndexAsynchronously retrieves aKGramIndexfor a collection ofKGrams from aKGramIndexrepository;upsertKGramIndexinserts entries into aKGramIndexrepository, overwriting any existing entries;getPostingsasynchronously retrievesPostingsfor a collection ofTerms from aPostingsrepository;upsertPostingsinserts entries into aPostingsrepository, overwriting any existing entries;
getTfIndexreturns hashmap ofTermtoFtfor a collection ofTerms, whereFtis the number of times each of the terms occurs in thecorpus;getFtdPostingsreturn aFtdPostingsfor a collection ofTerms from thePostings, optionally filtered by minimum term frequency; andgetIdFtIndexreturns aIdFtIndexfor a collection ofTerms from theDictionary.
The InvertedIndexMixin implements the InvertedIndex.getTfIndex, InvertedIndex.getFtdPostings and InvertedIndex.getIdFtIndex methods.
TextIndexer Interface #
The text indexing classes (indexers) in this library implement TextIndexer, an interface intended for information retrieval software applications. The design of the TextIndexer interface is consistent with information retrieval theory and is intended to construct and/or maintain two artifacts:
- a hashmap with the vocabulary as key and the document frequency as the values (the
dictionary); and - another hashmap with the vocabulary as key and the postings lists for the linked
documentsas values (thepostings).
The dictionary and postings can be asynchronous data sources or in-memory hashmaps. The TextIndexer reads and writes to/from these artifacts using the TextIndexer.index.
Text or documents can be indexed by calling the following methods:
TextIndexer.indexTextindexes text from a text document;TextIndexer.indexJsonindexes the fields in aJSONdocument; andTextIndexer.indexCollectionindexes the fields of all the documents in s JSON document collection.
Alternatively, pass a stream of documents TextIndexer.documentStream or TextIndexer.collectionStream for indexing whenever either of these streams emit (a) document(s).
Implementing classes override the following asynchronous methods for interacting with the TextIndexer.index:
TextIndexer.indexTextindexes a text document;TextIndexer.indexJsonindexes the fields in a JSON document;TextIndexer.indexCollectionindexes the fields of all the documents in a JSON document collection; andTextIndexer.updateIndexesmethod updates theTextIndexer.indexwith new postings and k-gram entries.
Use one of three factory constructors to instantiate a TextIndexer:
Alternatively, roll your own custom TextIndexer by extending TextIndexerBase.
TextIndexer.index
The TextIndexer.index factory constructor returns a TextIndexer instance, using a InvertedIndex instance passed in as a parameter at instantiation.
TextIndexer.inMemory
The TextIndexer.inMemory factory constructor returns a TextIndexer instance with in-memory Dictionary and Postings maps:
- pass a
analyzertext analyser that extracts tokens from text; - pass an in-memory
dictionaryinstance, otherwise an emptyDictionarywill be initialized; - pass an in-memorykGramIndexinstance, otherwise an emptyKGramIndexwill be initialized;and - pass an in-memory
postingsinstance, otherwise an emptyPostingswill be initialized.
The InMemoryIndexer is suitable for indexing a smaller corpus. The InMemoryIndexer may have latency and processing overhead for large indexes or queries with more than a few terms. Consider running InMemoryIndexer in an isolate to avoid slowing down the main thread.
An example of the use of the TextIndexer.inMemory factory is included in the examples.
TextIndexer.async
The TextIndexer.async factory constructor returns a TextIndexer instance that uses
asynchronous callback functions to access Dictionary and Postings
repositories:
- pass a
analyzertext analyser that extracts tokens from text; dictionaryLoadersynchronously retrieves aDictionaryfor a vocabulary from a data source;dictionaryLengthLoaderasynchronously retrieves the number of terms in the vocabulary (N);dictionaryUpdateris callback that passes aDictionarysubset for persisting to a datastore;kGramIndexLoaderasynchronously retrieves aKGramIndexfor a vocabulary from a data source;kGramIndexUpdateris callback that passes aKGramIndexsubset for persisting to a datastore;postingsLoaderasynchronously retrieves aPostingsfor a vocabulary from a data source; andpostingsUpdaterpasses aPostingssubset for persisting to a datastore.
The AsyncIndexer is suitable for indexing a large corpus but may have latency and processing overhead. Consider running AsyncIndexer in an isolate to avoid slowing down the main thread.
An example of the use of the TextIndexer.async factory is included in the examples.
TextIndexerBase Class #
The TextIndexerBase is an abstract base class that implements the TextIndexer.indexText, TextIndexer.indexJson, TextIndexer.indexCollection and TextIndexer.emit methods and the TextIndexer.postingsStream field. TextIndexerBase also initializes listeners to TextIndexer.documentStream and TextIndexer.collectionStream at instantiation.
The TextIndexerBase.index is updated whenever TextIndexerBase.updateIndexes is called.
Subclasses of TextIndexerBase must implement:
TextIndexer.index; andTextIndexerBase.controller, aBehaviorSubject<Postings>that controls theTextIndex.postingsStream.
InMemoryIndex Class #
The InMemoryIndex is a InvertedIndex interface implementation with in-memory Dictionary and Postings hashmaps:
InMemoryIndex.analyzeris theITextAnalyzerused to tokenize text for theInMemoryIndex;InMemoryIndex.dictionaryis the in-memory term dictionary for the indexer. Pass adictionaryinstance at instantiation, otherwise an emptyDictionarywill be initialized; andInMemoryIndex.postingsis the in-memory postings hashmap for the indexer. Pass apostingsinstance at instantiation, otherwise an emptyPostingswill be initialized.
The InMemoryIndex mixes in InMemoryIndexMixin that can be used in custom indexer classes that use in-memory hashmaps for Postings and Dictionary. InMemoryIndexMixin implements:
vocabularyLengthreturns the number of entries indictionary.getDictionaryretrieves aDictionaryfor a collection ofTerms from the in-memorydictionaryhashmap;upsertDictionaryinserts entries into the in-memorydictionaryhashmap, overwriting any existing entries;getPostingsretrievesPostingsfor a collection ofTerms from the in-memorypostingshashmap;upsertPostingsinserts entries into the in-memorypostingshashmap, overwriting any existing entries;
AsyncCallbackIndex Class #
The AsyncCallbackIndex is a InvertedIndex implementation class
that uses asynchronous callbacks to perform read and write operations on Dictionary and Postings repositories:
AsyncCallbackIndex.analyzeris theITextAnalyzerused to tokenize text for theAsyncCallbackIndex;AsyncCallbackIndex.dictionaryLoadersynchronously retrieves aDictionaryfor a vocabulary from a data source;AsyncCallbackIndex.dictionaryUpdateris callback that passes aDictionarysubset for persisting toDictionaryrepository;AsyncCallbackIndex.postingsLoaderasynchronously retrieves aPostingsfor a vocabulary from a data source; andAsyncCallbackIndex.postingsUpdaterpasses aPostingssubset for persisting to aPostingsrepository.
The AsyncCallbackIndex mixes in AsyncCallbackIndexMixin that can be used in custom indexer classes that use asynchronous repositiories for Postings and Dictionary. The AsyncCallbackIndexMixin exposes five callback function fields that must be overriden:
dictionaryLengthLoaderasynchronously retrieves the number of terms in the vocabulary (N);dictionaryLoaderasynchronously retrieves aDictionaryfor a vocabulary from a data source;dictionaryUpdateris callback that passes aDictionarysubset for persisting to a datastore;postingsLoaderasynchronously retrieves aPostingsfor a vocabulary from a data source; andpostingsUpdaterpasses aPostingssubset for persisting to a datastore.
The AsyncCallbackIndexMixin implements the following methods for operations on asynchronous Dictionary and Postings repositories:
vocabularyLengthcallsdictionaryLengthLoader;getDictionarycallsdictionaryLoader;upsertDictionarycallsdictionaryUpdater;getPostingscallspostingsLoader; andupsertPostingscallspostingsUpdater.
Definitions #
The following definitions are used throughout the documentation:
corpus- the collection ofdocumentsfor which anindexis maintained.dictionary- is a hash ofterms(vocabulary) to the frequency of occurence in thecorpusdocuments.document- a record in thecorpus, that has a unique identifier (docId) in thecorpus's primary key and that contains one or more text fields that are indexed.index- an inverted index used to look updocumentreferences from thecorpusagainst avocabularyofterms. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexedtermin eachdocument's zones.document frequency (dFt)is number of documents in thecorpusthat contain a term.index-elimination- selecting a subset of the entries in an index where thetermis in the collection oftermsin a search phrase.inverse document frequencyoriDftis equal to log (N /dft), where N is the total number of terms in the index. TheIdFtof a rare term is high, whereas the [IdFt] of a frequent term is likely to be low.JSONis an acronym for"Java Script Object Notation", a common format for persisting data.lemmatizer- lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).postings- a separate index that records whichdocumentsthevocabularyoccurs in. In this implementation we also record the positions of eachtermin thetextto create a positional invertedindex.postings list- a record of the positions of atermin adocument. A position of atermrefers to the index of thetermin an array that contains all thetermsin thetext.term- a word or phrase that is indexed from thecorpus. Thetermmay differ from the actual word used in the corpus depending on thetokenizerused.term filter- filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking astemmerand / orlemmatizer.stemmer- stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form (from Wikipedia).stopwords- common words in a language that are excluded from indexing.term frequency (Ft)is the frequency of atermin an index or indexed object.term positionis the zero-based index of atermin an ordered array oftermstokenized from thecorpus.text- the indexable content of adocument.token- representation of atermin a text source returned by atokenizer. The token may include information about thetermsuch as its position(s) (term position) in the text or frequency of occurrence (term frequency).tokenizer- a function that returns a collection oftokens fromtext, after applying a character filter,termfilter, stemmer and / or lemmatizer.vocabulary- the collection oftermsindexed from thecorpus.
References #
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.