dart_sentencepiece_tokenizer 1.3.0
dart_sentencepiece_tokenizer: ^1.3.0 copied to clipboard
A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.
Changelog #
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
1.3.0 - 2026-02-02 #
Added #
- Streaming API (HuggingFace TextStreamer Compatible)
BaseStreamer- Abstract interface for streaming token decoders withput()andend()methodsTextStreamer- HuggingFace TextStreamer-compatible class for real-time LLM token decodingput(int tokenId)- Add tokens as they are generatedend()- Signal end of generation and flush remaining contentonFinalizedTextcallback for custom text handlingskipSpecialTokensoption to filter BOS/EOS/PAD tokensskipPromptoption to skip initial prompt tokenspromptLengthoption to skip multiple prompt tokens- Word boundary heuristics for clean text emission (newlines, CJK, spaces)
SentencePieceTokenizer.createTextStreamer()- Factory for TextStreamerSentencePieceTokenizer.decodeStream()- Stream-based token decodingSentencePieceTokenizer.decodeWithCallback()- Callback-based token decoding
Usage Examples #
TextStreamer (HuggingFace-compatible):
final streamer = tokenizer.createTextStreamer();
for (final id in llmOutput) {
streamer.put(id);
}
streamer.end();
// With custom callback
final streamer = tokenizer.createTextStreamer(
onFinalizedText: (text, {required streamEnd}) {
myTextController.append(text);
if (streamEnd) myTextController.complete();
},
);
Stream-based decoding:
final textStream = tokenizer.decodeStream(llmTokenStream);
await for (final chunk in textStream) {
stdout.write(chunk);
}
Callback-based decoding:
tokenizer.decodeWithCallback(
tokenIds,
(chunk) => stdout.write(chunk),
);
1.2.2 - 2025-01-28 #
Changed #
- Extracted duplicate surrogate pair decoding logic in
Trieinto shared_decodeCodePointhelper - Cached computed
sequenceIdsinEncodingto avoid O(n) recomputation on repeated access - Added merge cache size limit (10,000 entries) to
BpeAlgorithmandBpeAlgorithmOptimizedto prevent unbounded memory growth - Replaced manual loops with
fillRangefor padding initialization inEncoding.withPadding()
1.2.1 - 2025-01-28 #
Changed #
- Optimized batch
addTokens()to use single typed array allocation instead of per-token expansion (O(N) instead of O(N²)) - Added input validation and defensive error handling in JSON deserialization (
TokenizerJsonLoader) - Consolidated duplicate
_kMaxInputLengthconstant declarations
1.2.0 - 2025-01-17 #
Added #
-
JSON Serialization - HuggingFace-compatible tokenizer.json format
toJson()- Serialize tokenizer to JSON stringsaveToJson()/saveToJsonSync()- Save to fileTokenizerJsonLoader.fromJsonString()- Load from JSON stringTokenizerJsonLoader.fromJsonFile()/fromJsonFileSync()- Load from file
-
Dynamic Token Addition API
addTokens(List<String>)- Add new tokens to vocabularyaddSpecialTokens(Map<String, String>)- Add special tokens (pad, mask, etc.)getAddedVocab()- Get map of dynamically added tokensisAddedToken(String)- Check if token was added dynamicallygetVocab({withAddedTokens})- Get full vocabulary as Map<String, int>
-
HuggingFace-compatible Methods
tokenize(String)- Returns ListtokenizeBatch(List<String>)- Batch tokenization
-
Optimized BPE Algorithm (
BpeAlgorithmOptimized)- O(n log n) complexity using priority queue (heap)
- ~35% faster than original algorithm on medium-length text
Changed #
SpVocabularynow uses growable list for dynamic token addition support
1.1.0 - 2025-01-04 #
1.0.0 - 2025-01-03 #
Added #
- Initial release of dart_sentencepiece_tokenizer
- Pure Dart implementation with zero external dependencies
- Support for BPE (Byte Pair Encoding) algorithm used by Gemma models
- Support for Unigram algorithm used by Llama models
- Viterbi algorithm implementation for optimal Unigram segmentation
- Byte fallback support for handling unknΩown characters
- Unicode-aware Trie for efficient vocabulary lookup
- Memory-efficient typed arrays (Int32List, Uint8List) for encodings
Features #
-
SentencePieceTokenizer- Main tokenizer classfromBytes()- Load from protobuf bytesfromModelFile()/fromModelFileSync()- Load from .model fileencode()- Encode single textencodeBatch()- Encode multiple textsencodeBatchParallel()- Parallel batch encoding using IsolatesencodePair()- Encode text pairs for sequence classificationencodePairBatch()- Batch encode text pairsdecode()/decodeBatch()- Decode token IDs back to text
-
Encodingclass with:ids- Token IDs (Int32List)tokens- Token stringstypeIds- Segment type IDs (Uint8List)attentionMask- Attention mask (Uint8List)specialTokensMask- Special token indicators (Uint8List)offsets- Character offsets for each tokenwithPadding()/withTruncation()- Post-processing methodstruncatePair()- Static method for pair truncation
-
Predefined configurations:
SentencePieceConfig.llama- Llama-style (BOS only)SentencePieceConfig.gemma- Gemma-style (BOS + EOS)
-
Truncation strategies:
longestFirst- Truncate longer sequence firstonlyFirst- Only truncate first sequenceonlySecond- Only truncate second sequencedoNotTruncate- No truncation
-
Padding options:
- Left/right padding direction
- Fixed length or pad to longest
- Pad to multiple of N
Performance #
- Efficient Trie-based vocabulary lookup
- Memory-optimized typed arrays reduce memory usage by ~78%
- Parallel batch processing with configurable chunk size
- Lazy evaluation where possible
Compatibility #
- Dart SDK 3.10.7+
- Compatible with Llama, Gemma, and other SentencePiece models
- HuggingFace-compatible API design