dart_wordpiece

A pure Dart implementation of the BERT-compatible WordPiece tokenizer.

Converts raw text into the three integer sequences expected by BERT-style ONNX models — input_ids, attention_mask, and token_type_ids — with zero dependencies and no Flutter requirement.

Features

✅ WordPiece algorithm — greedy longest-match-first subword encoding
✅ Single & pair encoding — encode() and encodePair() for QA / NLI tasks
✅ Batch encoding — encodeAll(List<String>)
✅ Text normalization — lowercase, punctuation removal, stopword filtering
✅ Three vocab loading strategies — from File, from String, from Map
✅ Configurable — max length, special tokens, stopwords, normalization toggle
✅ ONNX-ready — Int64List getters for direct tensor creation
✅ Pure Dart — works in Flutter, CLI, and server-side Dart

Getting started

Add to your pubspec.yaml:

dependencies:
  dart_wordpiece: ^1.1.0

Usage

1. Load vocabulary

import 'package:dart_wordpiece/dart_wordpiece.dart';

// From a file (dart:io)
final vocab = await VocabLoader.fromFile(File('/path/to/vocab.txt'));

// From a Flutter asset string
final raw = await rootBundle.loadString('assets/vocab.txt');
final vocab = VocabLoader.fromString(raw);

// From an in-memory map (useful for tests)
final vocab = VocabLoader.fromMap({'[PAD]': 0, '[UNK]': 1, ...});

2. Create tokenizer

// Default BERT configuration (maxLength=64, no stopwords)
final tokenizer = WordPieceTokenizer(vocab: vocab);

// Custom configuration
final tokenizer = WordPieceTokenizer(
  vocab: vocab,
  config: TokenizerConfig(
    maxLength: 128,
    stopwords: {'what', 'is', 'the', 'a', 'an'},
    normalizeText: true,   // lowercase + remove punctuation
  ),
);

3. Encode a single sequence

final output = tokenizer.encode('What is Flutter?');

print(output.inputIds);      // [101, 2054, 2003, 14246, 2102, 1029, 102, 0, …]
print(output.attentionMask); // [1, 1, 1, 1, 1, 1, 1, 0, …]
print(output.tokenTypeIds);  // [0, 0, 0, 0, 0, 0, 0, 0, …]
print(output.realLength);    // 7  (non-padding positions)

4. Encode a sentence pair

Use for BERT-based question answering or natural language inference:

final output = tokenizer.encodePair(
  'Flutter is a cross-platform UI toolkit.',  // segment A
  'What is Flutter?',                          // segment B
);

// Format:  [CLS] <A> [SEP] <B> [SEP] [PAD]…
// typeIds:   0    0    0    1    1    0…

5. Batch encoding

final outputs = tokenizer.encodeAll([
  'What is Flutter?',
  'Dart is fast',
  'How to use isolates?',
]);
// Each output has the same length (maxLength) → stack into a batch tensor.

6. Inspect raw token strings

tokenizer.tokenize('unaffable');
// → ['[CLS]', 'un', '##aff', '##able', '[SEP]']

7. Feed directly to an ONNX model

// package:onnxruntime integration
final out = tokenizer.encode(query);
final inputs = {
  'input_ids':      OrtValueTensor.createTensorWithDataList(out.inputIdsInt64,       [1, out.length]),
  'attention_mask': OrtValueTensor.createTensorWithDataList(out.attentionMaskInt64,  [1, out.length]),
  'token_type_ids': OrtValueTensor.createTensorWithDataList(out.tokenTypeIdsInt64,   [1, out.length]),
};
final results = session.run(OrtRunOptions(), inputs);

API reference

`WordPieceTokenizer`

Member	Description
`WordPieceTokenizer({vocab, config})`	Main constructor
`WordPieceTokenizer.fromFile(file, {config})`	Async factory from `dart:io` File
`WordPieceTokenizer.fromString(content, {config})`	Sync factory from String
`encode(text)` → `TokenizerOutput`	Encode single sequence
`encodePair(textA, textB)` → `TokenizerOutput`	Encode sentence pair
`encodeAll(texts)` → `List<TokenizerOutput>`	Batch encode
`tokenize(text)` → `List<String>`	Raw token strings (no padding)
`tokenToId(token)` → `int?`	Look up token ID
`idToToken(id)` → `String?`	Look up token string — O(1)
`vocabSize`	Number of tokens in vocabulary

`TokenizerOutput`

Member	Type	Description
`inputIds`	`List<int>`	Vocabulary IDs
`attentionMask`	`List<int>`	1 = real token, 0 = padding
`tokenTypeIds`	`List<int>`	0 = segment A, 1 = segment B
`length`	`int`	Always equals `maxLength`
`realLength`	`int`	Non-padding positions
`inputIdsInt64`	`Int64List`	Ready for ONNX tensor
`attentionMaskInt64`	`Int64List`	Ready for ONNX tensor
`tokenTypeIdsInt64`	`Int64List`	Ready for ONNX tensor

`TokenizerConfig`

Parameter	Default	Description
`maxLength`	`64`	Output sequence length (includes special tokens)
`specialTokens`	`SpecialTokens.bert()`	`[CLS]`, `[SEP]`, `[PAD]`, `[UNK]`, `##`
`stopwords`	`{}`	Words removed before tokenization
`normalizeText`	`true`	Lowercase + remove punctuation

`VocabLoader`

Method	Description
`VocabLoader.fromFile(File)`	Async load from file
`VocabLoader.fromString(String)`	Sync parse from vocab text
`VocabLoader.fromMap(Map)`	Wrap pre-built map

WordPiece algorithm

Split text into whitespace-delimited words.
For each word, find the longest prefix present in the vocabulary.
Emit the prefix as a token; prepend ## to the remaining suffix and repeat.
If no single-character match exists, emit [UNK] for the whole word.

"unaffable"  →  ["un", "##aff", "##able"]
"playing"    →  ["play", "##ing"]
"xyz123"     →  ["[UNK]"]   (if none of the sub-strings are in vocab)

Where to get `vocab.txt`

Download pre-trained BERT vocabularies from HuggingFace:

Model	Link	Language	Size
BERT Base Uncased	vocab.txt	English	~100K tokens
BERT Base Cased	vocab.txt	English (cased)	~28K tokens
BERT Multilingual	vocab.txt	104 languages	~120K tokens
RuBERT (Russian)	vocab.txt	Russian	~119K tokens
DistilBERT	vocab.txt	English	~30K tokens

Quick download (in a terminal):

curl -o vocab.txt https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt

For Flutter apps, place vocab.txt in assets/ and update pubspec.yaml:

flutter:
  assets:
    - assets/vocab.txt

Multilingual support

The tokenizer works with any language — just use a multilingual vocab:

// Load multilingual BERT vocab
final raw = await rootBundle.loadString('assets/multilingual_vocab.txt');
final vocab = VocabLoader.fromString(raw);
final tokenizer = WordPieceTokenizer(vocab: vocab);

// Works with English, Chinese, Russian, Arabic, etc.
tokenizer.encode('Hello World');  // English
tokenizer.encode('你好世界');      // Chinese
tokenizer.encode('Привет мир');  // Russian
tokenizer.encode('مرحبا بالعالم');  // Arabic

The TextNormalizer uses Unicode-aware regex (\p{L}, \p{N}), so punctuation removal and letter detection work correctly for all writing systems.

Compatibility

Compatible with vocabularies from:

bert-base-uncased
bert-base-cased
distilbert-base-uncased
bert-base-multilingual-cased
cointegrated/rubert-tiny2
Any model that follows the HuggingFace vocab.txt format

Performance

Single tokenization: ~1–5 ms (on modern hardware)
Batch of 100 queries: ~50–200 ms
Memory footprint: Vocab size × ~8 bytes (e.g., 30K tokens = ~240 KB)
No external dependencies — pure Dart with optimized regex patterns

Contributing

Issues and pull requests are welcome! Please:

Run dart analyze and ensure no warnings
Run dart test and ensure all tests pass
Follow the Dart style guide

License

MIT

dart_wordpiece

Features

Getting started

Usage

1. Load vocabulary

2. Create tokenizer

3. Encode a single sequence

4. Encode a sentence pair

5. Batch encoding

6. Inspect raw token strings

7. Feed directly to an ONNX model

API reference

`WordPieceTokenizer`

`TokenizerOutput`

`TokenizerConfig`

`VocabLoader`

WordPiece algorithm

Where to get `vocab.txt`

Multilingual support

Compatibility

Performance

Contributing

License

Libraries

dart_wordpiece package

dart_wordpiece

Features

Getting started

Usage

1. Load vocabulary

2. Create tokenizer

3. Encode a single sequence

4. Encode a sentence pair

5. Batch encoding

6. Inspect raw token strings

7. Feed directly to an ONNX model

API reference

WordPieceTokenizer

TokenizerOutput

TokenizerConfig

VocabLoader

WordPiece algorithm

Where to get vocab.txt

Multilingual support

Compatibility

Performance

Contributing

License

Libraries

dart_wordpiece package

`WordPieceTokenizer`

`TokenizerOutput`

`TokenizerConfig`

`VocabLoader`

Where to get `vocab.txt`