text_analysis 0.2.0+1
text_analysis: ^0.2.0+1 copied to clipboard
Text analyzer that extracts tokens from text for use in full-text search queries and indexes.
text_analysis #
Text analyzer that extracts tokens from text for use in full-text search queries and indexes.
THIS PACKAGE IS PRE-RELEASE, IN ACTIVE DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.
Objective #
To tokenize text in preparation of constructing a dictionary from a corpus of documents in an information retrieval system.
The tokenization process comprises the following steps:
- a
term splittersplits text to a list of terms at appropriate places like white-space and mid-sentence punctuation; - a
character filtermanipulates terms prior to stemming and tokenization (e.g. changing case and / or removing non-word characters); - a
term filtermanipulates the terms by splitting compound or hyphenated terms or applying stemming and lemmatization. ThetermFiltercan also filter outstopwords; and - the
tokenizerconverts the resulting terms to a collection oftokensthat contain the term and a pointer to the position of the term in the source text.

The design of the text analyzer is consistent with information retrieval theory.
Definitions #
The following definitions are used throughout the documentation:
corpus- the collection ofdocumentsfor which anindexis maintained.character filter- filters characters from text in preparation of tokenization.dictionary- is a hash ofterms(vocabulary) to the frequency of occurence in thecorpusdocuments.document- a record in thecorpus, that has a unique identifier (docId) in thecorpus's primary key and that contains one or more text fields that are indexed.index- an inverted index used to look updocumentreferences from thecorpusagainst avocabularyofterms. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexedtermin eachdocument.lemmatizer- lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).postings- a separate index that records whichdocumentsthevocabularyoccurs in. In this implementation we also record the positions of eachtermin thetextto create a positional invertedindex.postings list- a record of the positions of atermin adocument. A position of atermrefers to the index of thetermin an array that contains all thetermsin thetext.term- a word or phrase that is indexed from thecorpus. Thetermmay differ from the actual word used in the corpus depending on thetokenizerused.term filter- filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking astemmerand / orlemmatizer.stemmer- stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form (from Wikipedia).stopwords- common words in a language that are excluded from indexing.text- the indexable content of adocument.token- representation of atermin a text source returned by atokenizer. The token may include information about thetermposition in the source text.token filter- returns a subset oftokensfrom the tokenizer output.tokenizer- a function that returns a collection oftokens from the terms in a text source after applying acharacter filterandterm filter.vocabulary- the collection oftermsindexed from thecorpus.
Interfaces #
The package exposes two interfaces:
- the
ITextAnalyzerinterface; and - the
TextAnalyzerConfigurationinterface.
Interface ITextAnalyzer #
The ITextAnalyzer is an interface for a text analyser class that extracts tokens from text for use in full-text search queries and indexes
ITextAnalyzer.configuration is a TextAnalyzerConfiguration used by the [ITextAnalyzer] to tokenize source text.
Provide a ITextAnalyzer.tokenFilter to manipulate tokens or restrict tokenization to tokens that meet criteria for either index or count.
The tokenize function tokenizes source text using the ITextAnalyzer.configuration and then manipulates the output by applying ITextAnalyzer.tokenFilter.
Interface TextAnalyzerConfiguration #
The TextAnalyzerConfiguration interface exposes language-specific properties and methods used in text analysis:
- a
TextAnalyzerConfiguration.sentenceSplittersplits the text at sentence endings such as periods, exclamations and question marks or line endings; - a
TextAnalyzerConfiguration.termSplitterto split the text into terms; - a
TextAnalyzerConfiguration.characterFilterto remove non-word characters. - a
TextAnalyzerConfiguration.termFilterto apply a stemmer/lemmatizer or stopword list.
Implementations #
The latest version provides the following implementation classes:
- implementation class
English, implementsTextAnalyzerConfigurationand provides text analysis configuration properties for the English language; and - the
TextAnalyzerBaseabstract class implementsITextAnalyzer.tokenize; and - the
TextAnalyzerclass extendsTextAnalyzerBaseand implementsITextAnalyzer.tokenFilterandITextAnalyzer.configurationas final fields with their values passed in as (optional) parameters (with defaults) at initialization.
Refer to the package API reference for more details.
Usage #
Basic English text analysis can be performed by using a TextAnalyzer instance with the default configuration and no token filter:
/// Use a TextAnalyzer instance to tokenize the [text] using the default
/// English configuration.
final document = await TextAnalyzer().tokenize(text);
For more complex requirements, override TextAnalyzerConfiguration and/or pass in a TokenFilter function to manipulate the tokens after tokenization as shown in the examples.
Install #
In the pubspec.yaml of your flutter project, add the following dependency:
dependencies:
text_analysis: <latest version>
In your code file add the following import:
import 'package:text_analysis/text_analysis.dart';
Examples #
Examples are provided.
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.
References #
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia