text_analysis 0.15.0
text_analysis: ^0.15.0 copied to clipboard
Text analyzer that tokenize text, compute readibility scores for a document and evaluate similarity of terms.
Tokenize text, compute document readbility and compare terms. #
THIS PACKAGE IS PRE-RELEASE, and SUBJECT TO DAILY BREAKING CHANGES.
Skip to section:
Overview #
The text_analysis package provides methods to tokenize text, compute readibility scores for a document and evaluate similarity of terms. It is intended to be used in Natural Language Processing (NLP) as part of an information retrieval system.
It is split into four (4) libraries:
- text_analysis is the core library that exports the tokenization, analysis and string similarity functions;
- extensions exports extension methods also provided as static methods of the
TextSimilarityclass; - package_exports exports the
porter_2_stemmerpackage; and - type_definitions exports all the typedefs used in this package.
Refer to the references to learn more about information retrieval systems and the theory behind this library.
Tokenization
Tokenization comprises the following steps:
- a
term splittersplits text to a list of terms at appropriate places like white-space and mid-sentence punctuation; - a
character filtermanipulates terms prior to stemming and tokenization (e.g. changing case and / or removing non-word characters); - a
term filtermanipulates the terms by splitting compound or hyphenated terms or applying stemming and lemmatization. ThetermFiltercan also filter outstopwords; and - the
tokenizerconverts the resulting terms to a collection oftokensthat contain the term and a pointer to the position of the term in the source text.
A String extension method Set<KGram> kGrams([int k = 2]) that parses a set of k-grams of length k from a term. The default k-gram length is 3 (tri-gram).

Readibility #
The TextDocument enumerates a text document's paragraphs, sentences, terms and tokens and computes readability measures:
- the average number of words in each sentence;
- the average number of syllables for words;
- the
Flesch reading ease score, a readibility measure calculated from sentence length and word length on a 100-point scale; and Flesch-Kincaid grade level, a readibility measure relative to U.S. school grade level.
String Comparison #
The following measures of term similarity are provided:
Damerau–Levenshtein distanceis the minimum number of single-character edits (transpositions, insertions, deletions or substitutions) required to change oneterminto another;edit similarityis a normalized measure ofDamerau–Levenshtein distanceon a scale of 0.0 to 1.0, calculated by dividing the the difference between the maximum edit distance (sum of the length of the two terms) and the computededitDistance, by the maximum edit distance;length distancereturns the absolute value of the difference in length between two terms;length similarityreturns the similarity in length between two terms on a scale of 0.0 to 1.0 on a log scale (1 - the log of the ratio of the term lengths);Jaccard similaritymeasures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets; andtermSimilarityreturns a similarity index value between 0.0 and 1.0, product ofedit similarity,Jaccard similarityandlength similarity. A term similarity of 1.0 means the two terms are identical in all respects.
Functions that return the term similarity measures are provided by static methods of the TermSimilarity class.
Usage #
In the pubspec.yaml of your flutter project, add the following dependency:
dependencies:
text_analysis: <latest version>
In your code file add the text_analysis library import. This will also import the Porter2Stemmer class from the porter_2_stemmer package.
import 'package:text_analysis/text_analysis.dart';
To use the package's extensions and/or type definitions, also add any of the following imports:
import 'package:text_analysis/extensions.dart';
import 'package:text_analysis/type_definitions.dart';
Basic English tokenization can be performed by using a TextTokenizer instance with the default text analyzer and no token filter:
// Use a TextTokenizer instance to tokenize the text using the default
// English analyzer.
final document = await TextTokenizer().tokenize(text);
To analyze text or a document, hydrate a TextDocument to obtain the text statistics and readibility scores:
// get some sample text
final sample =
'The Australian platypus is seemingly a hybrid of a mammal and reptilian creature.';
// hydrate the TextDocument
final textDoc = await TextDocument.analyze(sourceText: sample);
// print the `Flesch reading ease score`
print(
'Flesch Reading Ease: ${textDoc.fleschReadingEaseScore().toStringAsFixed(1)}');
// prints "Flesch Reading Ease: 37.5"
For more complex text analysis:
- implement a
TextAnalyzerfor a different language or non-language documents; - implement a custom
TextTokenizeror extendTextTokenizerBase; and/or - pass in a
TokenFilterfunction to aTextTokenizerto manipulate the tokens after tokenization as shown in the examples; and/or extend TextDocumentBase.
To compare terms, call the required extension on the term, or the static method from the TermSimilarity class:
// define a misspelt term
const term = 'bodrer';
// a collection of auto-correct options
const candidates = [
'bord',
'board',
'broad',
'boarder',
'border',
'brother',
'bored'
];
// get a list of the terms orderd by descending similarity
final matches = term.matches(candidates);
// same as TermSimilarity.matches(term, candidates))
// print matches
print('Ranked matches: $matches');
// prints:
// Ranked matches: [border, boarder, bored, brother, board, bord, broad]
//
Please see the examples for more details.
API #
The key interfaces of the text_analysis library are briefly described in this section. Please refer to the documentation for details.
The API contains a fair amount of boiler-plate, but we aim to make the code as readable, extendable and re-usable as possible:
- We use an
interface > implementation mixin > base-class > implementation class pattern:- the
interfaceis an abstract class that exposes fields and methods but contains no implementation code. Theinterfacemay expose a factory constructor that returns animplementation classinstance; - the
implementation mixinimplements theinterfaceclass methods, but not the input fields; - the
base-classis an abstract class with theimplementation mixinand exposes a default, unnamed generative const constructor for sub-classes. The intention is thatimplementation classesextend thebase class, overriding theinterfaceinput fields with final properties passed in via a const generative constructor.
- the
- To maximise performance of the indexers the API performs lookups in nested hashmaps of DART core types. To improve code legibility the API makes use of type aliases, callback function definitions and extensions. The typedefs and extensions are not exported by the text_analysis library, but can be found in the type_definitions and extensions mini-libraries. Import these libraries seperately if needed.
TermSimilarity
The TermSimilarity class provides the following static methods used for (case-insensitive) comparison of terms:
editDistancereturns theDamerau–Levenshtein distance, the minimum number of single-character edits (transpositions, insertions, deletions or substitutions) required to change oneterminto another;editSimilarityreturns a normalized measure ofDamerau–Levenshtein distanceon a scale of 0.0 to 1.0, calculated by dividing the the difference between the maximum edit distance (sum of the length of the two terms) and the computededitDistance, by by the maximum edit distance;lengthDistancereturns the absolute value of the difference in length between two terms;lengthSimilarityreturns the similarity in length between two terms on a scale of 0.0 to 1.0 on a log scale (1 - the log of the ratio of the term lengths);jaccardSimilarityreturns the Jaccard Similarity Index of two terms;termSimilarityreturns a similarity index value between 0.0 and 1.0, product ofeditSimilarity,jaccardSimilarityandlengthSimilarity. A term similarity of 1.0 means the two terms are identical in all respects, except case;
To compare one term with a collection of other terms, the following methods are also provided:
editDistanceyMapreturns a hashmap oftermsto theireditSimilaritywith a term;editSimilarityMapreturns a hashmap oftermsto theireditSimilaritywith a term;lengthSimilarityMapreturns a hashmap oftermsto theirlengthSimilaritywith a term;jaccardSimilarityMapreturns a hashmap oftermsto Jaccard Similarity Index with a term;termSimilarityMapreturns a hashmap oftermsto termSimilarity with a term; andmatchesreturns the best matches fromtermsfor a term, in descending order of term similarity (best match first).
Term comparisons are NOT case-sensitive.
The TextSimilarity uses extension methods that can be imported from the extensions library.
TextAnalyzer
The TextAnalyzer interface exposes language-specific properties and methods used in text analysis:
- characterFilter is a function that manipulates text prior to stemming and tokenization;
- termFilter is a filter function that returns a collection of terms from a term. It returns an empty collection if the term is to be excluded from analysis or, returns multiple terms if the term is split (at hyphens) and / or, returns modified term(s), such as applying a stemmer algorithm;
- termSplitter returns a list of terms from text;
- sentenceSplitter splits text into a list of sentences at sentence and line endings;
- paragraphSplitter splits text into a list of paragraphs at line endings;
- stemmer is a language-specific function that returns the stem of a term;
- lemmatizer is a language-specific function that returns the lemma of a term;
- termExceptions is a hashmap of words to token terms for special words that should not be re-capitalized, stemmed or lemmatized;
- stopWords are terms that commonly occur in a language and that do not add material value to the analysis of text; and
- syllableCounter returns the number of syllables in a word or text.
The English implementation of TextAnalyzer is included in this library.
TextTokenizer
The TextTokenizer extracts tokens from text for use in full-text search queries and indexes. It uses a TextAnalyzer and token filter in the tokenize and tokenizeJson methods that return a list of tokens from text or a document.
An unnamed factory constructor hydrates an implementation class or extend TextTokenizerBase.
TextDocument #
The TextDocument object model enumerates a text document's paragraphs, sentences, terms and tokens and provides functions that return text analysis measures:
- averageSentenceLength is the average number of words in sentences;
- averageSyllableCount is the average number of syllables per word in terms;
- wordCount the total number of words in the sourceText;
- fleschReadingEaseScore is a readibility measure calculated from sentence length and word length on a 100-point scale. The higher the score, the easier it is to understand the document;
- fleschKincaidGradeLevel is a readibility measure relative to U.S. school grade level. It is also calculated from sentence length and word length .
The TextDocumentMixin implements the TextDocument.averageSentenceLength, TextDocument.averageSyllableCount, TextDocument.wordCount, TextDocument.fleschReadingEaseScore and TextDocument.fleschKincaidGradeLevel methods.
A TextDocument can be hydrated with the unnamed factory constructor or using the TextDocument.analyze or TextDocument.analyzeJson static methods. Alternatively, extend TextDocumentBase class.
Definitions #
corpus- the collection ofdocumentsfor which anindexis maintained.character filter- filters characters from text in preparation of tokenization.Damerau–Levenshtein distance- a metric for measuring theedit distancebetween twotermsby counting the minimum number of operations (insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change oneterminto the other.dictionary- is a hash ofterms(vocabulary) to the frequency of occurence in thecorpusdocuments.document- a record in thecorpus, that has a unique identifier (docId) in thecorpus's primary key and that contains one or more text fields that are indexed.document frequency (dFt)- the number of documents in thecorpusthat contain a term.edit distance- a measure of how dissimilar two terms are by counting the minimum number of operations required to transform one string into the other (Wikipedia (7)).Flesch reading ease score- a readibility measure calculated from sentence length and word length on a 100-point scale. The higher the score, the easier it is to understand the document (Wikipedia(6)).Flesch-Kincaid grade level- a readibility measure relative to U.S. school grade level. It is also calculated from sentence length and word length (Wikipedia(6)).index- an inverted index used to look updocumentreferences from thecorpusagainst avocabularyofterms.index-elimination- selecting a subset of the entries in an index where thetermis in the collection oftermsin a search phrase.inverse document frequency (iDft)- is a normalized measure of how rare atermis in the corpus. It is defined aslog (N / dft), where N is the total number of terms in the index. TheiDftof a rare term is high, whereas theiDftof a frequent term is likely to be low.Jaccard indexmeasures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets (from Wikipedia).JSONis an acronym for"Java Script Object Notation", a common format for persisting data.k-gram- a sequence of (any) k consecutive characters from aterm. Ak-gramcan start with "$", denoting the start of the term, and end with "$", denoting the end of the term. The 3-grams for "castle" are { $ca, cas, ast, stl, tle, le$ }.lemmatizer- lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).Natural language processing (NLP)is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data (from Wikipedia).postings- a separate index that records whichdocumentsthevocabularyoccurs in. In a positionalindex, the postings also records the positions of eachtermin thetextto create a positional invertedindex.postings list- a record of the positions of atermin adocument. A position of atermrefers to the index of thetermin an array that contains all thetermsin thetext. In a zonedindex, thepostings listsrecords the positions of eachtermin thetextazone.term- a word or phrase that is indexed from thecorpus. Thetermmay differ from the actual word used in the corpus depending on thetokenizerused.term filter- filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking astemmerand / orlemmatizer.stemmer- stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form (from Wikipedia).stopwords- common words in a language that are excluded from indexing.term frequency (Ft)is the frequency of atermin an index or indexed object.term positionis the zero-based index of atermin an ordered array oftermstokenized from thecorpus.text- the indexable content of adocument.token- representation of atermin a text source returned by atokenizer. The token may include information about thetermsuch as its position(s) (term position) in the text or frequency of occurrence (term frequency).token filter- returns a subset oftokensfrom the tokenizer output.tokenizer- a function that returns a collection oftokens fromtext, after applying a character filter,termfilter, stemmer and / or lemmatizer.vocabulary- the collection oftermsindexed from thecorpus.zoneis the field or zone of a document that a term occurs in, used for parametric indexes or where scoring and ranking of search results attribute a higher score to documents that contain a term in a specific zone (e.g. the title rather that the body of a document).
References #
- Manning, Raghavan and Schütze, "Introduction to Information Retrieval", Cambridge University Press, 2008
- University of Cambridge, 2016 "Information Retrieval", course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), "Inverted Index", from Wikipedia, the free encyclopedia
- Wikipedia (2), "Lemmatisation", from Wikipedia, the free encyclopedia
- Wikipedia (3), "Stemming", from Wikipedia, the free encyclopedia
- Wikipedia (4), "Synonym", from Wikipedia, the free encyclopedia
- Wikipedia (5), "Jaccard Index", from Wikipedia, the free encyclopedia
- Wikipedia (6), "Flesch–Kincaid readability tests", from Wikipedia, the free encyclopedia
- Wikipedia (7), "Edit distance", from Wikipedia, the free encyclopedia
- Wikipedia (8), "Damerau–Levenshtein distance", from Wikipedia, the free encyclopedia
- Wikipedia (9), "Natural language processing", from Wikipedia, the free encyclopedia
Issues #
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.
