text_indexing 0.13.0
text_indexing: ^0.13.0 copied to clipboard
Dart library for creating an inverted index on a collection of text documents.
text_indexing #
Dart library for creating an inverted index on a collection of text documents.
THIS PACKAGE IS PRE-RELEASE, IN ACTIVE DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.
Skip to section:
Overview #
This library provides an interface and implementation classes that build and maintain an (inverted, positional, zoned) index for a collection of documents or corpus (see definitions).

The indexer constructs three inverted index artifacts:
- the
dictionarythat holds thevocabularyoftermsand the frequency of occurrence for eachtermin thecorpus; - the
k-gram indexthat mapsk-gramstotermsin thedictionary; and - the
postingsindex that holds a list of references to thedocumentsfor eachterm(thepostings list).
In this implementation, a postings list is a hashmap of the document id (docId) to maps that point to positions of the term in the document's zones (fields). This allows query algorithms to score and rank search results based on the position(s) of a term in document fields, applying different weights to the zones.

Refer to the references to learn more about information retrieval systems and the theory behind this library.
Usage #
In the pubspec.yaml of your flutter project, add the text_indexing dependency.
dependencies:
text_indexing: <latest version>
In your code file add the text_indexing import.
import 'package:text_indexing/text_indexing.dart';
For small collections, instantiate a TextIndexer.inMemory, (optionally passing empty Dictionary and Postings hashmaps), then iterate over a collection of documents to add them to the index.
// - initialize a in-memory [TextIndexer] with defaults for all parameters
final indexer =TextIndexer.inMemory();
// - iterate through the sample data
await Future.forEach(documents.entries, (MapEntry<String, String> doc) async {
// - index each document
await indexer.index(doc.key, doc.value);
});
The examples demonstrate the use of the TextIndexer.inMemory and TextIndexer.async factories.
API #
The API exposes the [TextIndexer] interface that builds and maintains an [InvertedIndex] for a collection of documents.
To maximise performance of the indexers the API performs lookups on nested hashmaps of DART core types rather than defining strongly typed object models. To improve code legibility the API makes use of type aliases throughout.
InvertedIndex #
The [InvertedIndexMixin] implements the implements the [InvertedIndex.getTfIndex], [InvertedIndex.getFtdPostings] and [InvertedIndex.getIdFtIndex] methods, while three implementation classes are provided:
- the [InMemoryIndex] class is intended for fast indexing of a smaller corpus using in-memory dictionary, k-gram and postings hashmaps;
- the [AsyncCallbackIndex] is intended for working with a larger corpus and an asynchronous index repository in persisted storage. It uses asynchronous callbacks to perform read and write operations on
dictionary,k-gramandpostingsrepositories; and - the [CachedIndex] is intended for working with a larger corpus with an asynchronous index repository in persisted storage. It uses asynchronous callbacks to perform read and write operations on [Dictionary], [KGramIndex] and [Postings] repositories, but keeps a cache of the most popular terms and k-grams in memory for faster indexing and searching.
TextIndexer #
The [TextIndexer] is an nterface for classes that construct and maintain a dictionary, inverted, positional, zoned index and k-gram index for a corpus.
Text or documents can be indexed by calling the following methods:
- [TextIndexer.indexText] indexes text from a text document;
- [TextIndexer.indexJson] indexes the fields in a
JSONdocument; and - [TextIndexer.indexCollection] indexes the fields of all the documents in a JSON document collection.
Use the factory constructor to instantiate a [TextIndexer] with the index of your choice or extend [TextIndexerBase].
Definitions #
The following definitions are used throughout the documentation:
corpus- the collection ofdocumentsfor which anindexis maintained.character filter- filters characters from text in preparation of tokenization.dictionary- is a hash ofterms(vocabulary) to the frequency of occurence in thecorpusdocuments.document- a record in thecorpus, that has a unique identifier (docId) in thecorpus's primary key and that contains one or more text fields that are indexed.index- an inverted index used to look updocumentreferences from thecorpusagainst avocabularyofterms.document frequency (dFt)is number of documents in thecorpusthat contain a term.index-elimination- selecting a subset of the entries in an index where thetermis in the collection oftermsin a search phrase.inverse document frequencyoriDftis equal to log (N /dft), where N is the total number of terms in the index. TheIdFtof a rare term is high, whereas the [IdFt] of a frequent term is likely to be low.JSONis an acronym for"Java Script Object Notation", a common format for persisting data.k-gram- a sequence of (any) k consecutive characters from aterm. A k-gram can start with "$", dentoting the start of the [Term], and end with "$", denoting the end of the [Term]. The 3-grams for "castle" are { $ca, cas, ast, stl, tle, le$ }.lemmatizer- lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).postings- a separate index that records whichdocumentsthevocabularyoccurs in. In this implementation we also record the positions of eachtermin thetextto create a positional invertedindex.postings list- a record of the positions of atermin adocument. A position of atermrefers to the index of thetermin an array that contains all thetermsin thetext.term- a word or phrase that is indexed from thecorpus. Thetermmay differ from the actual word used in the corpus depending on thetokenizerused.term filter- filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking astemmerand / orlemmatizer.stemmer- stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form (from Wikipedia).stopwords- common words in a language that are excluded from indexing.term frequency (Ft)is the frequency of atermin an index or indexed object.term positionis the zero-based index of atermin an ordered array oftermstokenized from thecorpus.text- the indexable content of adocument.token- representation of atermin a text source returned by atokenizer. The token may include information about thetermsuch as its position(s) (term position) in the text or frequency of occurrence (term frequency).token filter- returns a subset oftokensfrom the tokenizer output.tokenizer- a function that returns a collection oftokens fromtext, after applying a character filter,termfilter, stemmer and / or lemmatizer.vocabulary- the collection oftermsindexed from thecorpus.
References #
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.