text_analysis 0.7.0
text_analysis: ^0.7.0 copied to clipboard
Text analyzer that extracts tokens from text for use in full-text search queries and indexes.
text_analysis #
Text analyzer that extracts tokens from text for use in full-text search queries and indexes.
THIS PACKAGE IS PRE-RELEASE, IN ACTIVE DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.
Skip to section:
Overview #
To tokenize text in preparation of constructing a dictionary from a corpus of documents in an information retrieval system.
The tokenization process comprises the following steps:
- a
term splittersplits text to a list of terms at appropriate places like white-space and mid-sentence punctuation; - a
character filtermanipulates terms prior to stemming and tokenization (e.g. changing case and / or removing non-word characters); - a
term filtermanipulates the terms by splitting compound or hyphenated terms or applying stemming and lemmatization. ThetermFiltercan also filter outstopwords; and - the
tokenizerconverts the resulting terms to a collection oftokensthat contain the term and a pointer to the position of the term in the source text.

Refer to the references to learn more about information retrieval systems and the theory behind this library.
Usage #
In the pubspec.yaml of your flutter project, add the following dependency:
dependencies:
text_analysis: <latest version>
In your code file add the following import:
import 'package:text_analysis/text_analysis.dart';
Basic English text analysis can be performed by using a TextAnalyzer instance with the default configuration and no token filter:
/// Use a TextAnalyzer instance to tokenize the [text] using the default
/// [English] configuration.
final document = await TextAnalyzer().tokenize(text);
For more complex text analysis:
- implement a
TextAnalyzerConfigurationfor a different language or tokenizing non-language documents; - implement a custom
ITextAnalyzeror extendTextAnalyzerBase; and/or - pass in a
TokenFilterfunction to aTextAnalyzerto manipulate the tokens after tokenization as shown in the examples.
API #
The key members of the text_analysis library are briefly described in this section. Please refer to the documentation for details.
Skip to:
Type definitions #
The API uses the following function type definitions and type aliases to improve code readability:
CharacterFilteris a function that manipulates terms prior to stemming and tokenization (e.g. changing case and / or removing non-word characters);Ftis an lias forintand denotes the frequency of aTermin an index or indexed object (the term frequency).IdFtis an alias fordouble, where it represents the inverse document frequency of a term, defined as idft = log (N / dft), where N is the total number of terms in the index and dft is the document frequency of the term (number of documents that contain the term).JsonTokenizeris a function that returnsTokencollection from the fields in a JSON document hashmap ofZoneto value;SentenceSplitteris a function that returns a list of sentences fromSourceText. In English, theSourceTextis split at sentence endings marks such as periods, question marks and exclamation marks;SourceText,ZoneandTermare all aliases for the DART core typeStringwhen used in different contexts;StopWordsis an alias forSet<String>;TermFilteris a function that manipulates aTermcollection by splitting compound or hyphenated terms or applying stemming and lemmatization. TheTermFiltercan also filter outstopwords;TermSplitteris a function that splitsSourceTextto an orderd list ofTermat appropriate places like white-space and mid-sentence punctuation;TokenFilteris a function that returns a subset of aTokencollection, preserving its sort order; andTokenizeris a function that convertsSourceTextto aTokencollection, preserving the order of theTerminstances.
Object models #
The text_analysis library includes the following object-model classes:
- a
Tokenrepresents aTermpresent in aTextSourcewith itspositionand optionalzone name. - a
Sentencerepresents aTextSourcenot containing sentence ending punctuation such as periods, question-marks and exclamations, except where used in tokens, identifiers or other terms; and - A
TextSourcerepresents aTextSourcethat has been analyzed to enumerateSentenceandTokencollections.
Interfaces #
The text_analysis library exposes two interfaces:
- the TextAnalyzerConfiguration interface; and
- the ITextAnalyzer interface.
TextAnalyzerConfiguration Interface
The TextAnalyzerConfiguration interface exposes language-specific properties and methods used in text analysis:
- a
TextAnalyzerConfiguration.sentenceSplittersplits the text at sentence endings such as periods, exclamations and question marks or line endings; - a
TextAnalyzerConfiguration.termSplitterto split the text into terms; - a
TextAnalyzerConfiguration.characterFilterto remove non-word characters. - a
TextAnalyzerConfiguration.termFilterto apply a stemmer/lemmatizer or stopword list.
ITextAnalyzer Interface
The ITextAnalyzer is an interface for a text analyser class that extracts tokens from text for use in full-text search queries and indexes:
ITextAnalyzer.configurationis aTextAnalyzerConfigurationused by theITextAnalyzerto tokenize source text.- Provide a
ITextAnalyzer.tokenFilterto manipulate tokens or restrict tokenization to tokens that meet criteria for either index or count; - the
ITextAnalyzer.tokenizefunction tokenizes text to aTextSourceobject that contains all theTokens in the text; and - the
ITextAnalyzer.tokenizeJsonfunction tokenizes a JSON hashmap to aTextSourceobject that contains all theTokens in the document.
Implementation classes #
The latest version provides the following implementation classes:
- the English class implements TextAnalyzerConfiguration and provides text analysis configuration properties for the English language;
- the TextAnalyzerBase abstract class implements
ITextAnalyzer.tokenize; and - the TextAnalyzer class extends TextAnalyzerBase and implements
ITextAnalyzer.tokenFilterandITextAnalyzer.configurationas final fields with their values passed in as (optional) parameters (with defaults) at initialization.
English class
A basic TextAnalyzerConfiguration implementation for English language analysis.
The termFilter applies the following algorithm:
- apply the
characterFilterto the term; - if the resulting term is empty or contained in
kStopWords, return an empty collection; else - insert the filterered term in the return value;
- split the term at commas, periods, hyphens and apostrophes unless preceded and ended by a number;
- if the term can be split, add the split terms to the return value, unless the (split) terms are in
kStopWordsor are empty strings.
The characterFilter function:
- returns the term if it can be parsed as a number; else
- converts the term to lower-case;
- changes all quote marks to single apostrophe +U0027;
- removes enclosing quote marks;
- changes all dashes to single standard hyphen;
- removes all non-word characters from the term;
- replaces all characters except letters and numbers from the end of the term.
The sentenceSplitter inserts_kSentenceDelimiter at sentence breaks and then splits the source text into a list of sentence strings (sentence breaks are characters that match English.reLineEndingSelector or English.reSentenceEndingSelector). Empty sentences are removed.
TextAnalyzerBase Class
The TextAnalyzerBase class implements the ITextAnalyzer.tokenize method:
- tokenizes source text using the
configuration; - manipulates the output by applying
tokenFilter; and, finally - returns a
TextSourceenumerating the source text,Sentencecollection andTokencollection.
TextAnalyzer Class
The TextAnalyzer class extends TextAnalyzerBase:
- implements
configurationandtokenFilteras final fields passed in as optional parameters at instantiation; configurationis used by theTextAnalyzerto tokenize source text and defaults toEnglish.configuration; and- provide nullable function
tokenFilterif you want to manipulate tokens or restrict tokenization to tokens that meet specific criteria. The default isTextAnalyzer.defaultTokenFilter, that applies thePorter2Stemmer).
Definitions #
The following definitions are used throughout the documentation:
corpus- the collection ofdocumentsfor which anindexis maintained.character filter- filters characters from text in preparation of tokenization.dictionary- is a hash ofterms(vocabulary) to the frequency of occurence in thecorpusdocuments.document- a record in thecorpus, that has a unique identifier (docId) in thecorpus's primary key and that contains one or more text zones/fields that are indexed.index- an inverted index used to look updocumentreferences from thecorpusagainst avocabularyofterms. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexedtermin eachdocument.lemmatizer- lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).postings- a separate index that records whichdocumentsthevocabularyoccurs in. In this implementation we also record the positions of eachtermin thetextto create a positional invertedindex.postings list- a record of the positions of atermin adocument. A position of atermrefers to the index of thetermin an array that contains all thetermsin thetext.term- a word or phrase that is indexed from thecorpus. Thetermmay differ from the actual word used in the corpus depending on thetokenizerused.term filter- filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking astemmerand / orlemmatizer.stemmer- stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form (from Wikipedia).stopwords- common words in a language that are excluded from indexing.text- the indexable content of adocument.token- representation of atermin a text source returned by atokenizer. The token may include information about thetermposition in the source text.token filter- returns a subset oftokensfrom the tokenizer output.tokenizer- a function that returns a collection oftokens from the terms in a text source after applying acharacter filterandterm filter.vocabulary- the collection oftermsindexed from thecorpus.
References #
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.