dart_sentencepiece_tokenizer 1.1.0
dart_sentencepiece_tokenizer: ^1.1.0 copied to clipboard
A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.
Changelog #
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
1.1.0 - 2025-01-04 #
Changed #
- BPE Algorithm Optimization: Replaced list-based merge operations with doubly-linked list structure for O(1) merge operations
- Merge Cache: Added caching for pair lookups to avoid redundant string concatenation and vocabulary checks
Added #
- Input Size Validation: Added maximum input length limit (500,000 characters) to prevent OOM errors
encode()andencodePair()now throwArgumentErrorfor oversized inputs- Validation applies to all encoding methods including batch operations
Removed #
- Removed unused
TokenResultclass fromtokenization_algorithm.dart
Performance #
- BPE tokenization now uses O(1) node merging instead of O(n) list reconstruction
- Merge pair lookups are cached, reducing redundant vocabulary checks
1.0.0 - 2025-01-04 #
Added #
- Initial release of dart_sentencepiece_tokenizer
- Pure Dart implementation with zero external dependencies
- Support for BPE (Byte Pair Encoding) algorithm used by Gemma models
- Support for Unigram algorithm used by Llama models
- Viterbi algorithm implementation for optimal Unigram segmentation
- Byte fallback support for handling unknown characters
- Unicode-aware Trie for efficient vocabulary lookup
- Memory-efficient typed arrays (Int32List, Uint8List) for encodings
Features #
-
SentencePieceTokenizer- Main tokenizer classfromBytes()- Load from protobuf bytesfromModelFile()/fromModelFileSync()- Load from .model fileencode()- Encode single textencodeBatch()- Encode multiple textsencodeBatchParallel()- Parallel batch encoding using IsolatesencodePair()- Encode text pairs for sequence classificationencodePairBatch()- Batch encode text pairsdecode()/decodeBatch()- Decode token IDs back to text
-
Encodingclass with:ids- Token IDs (Int32List)tokens- Token stringstypeIds- Segment type IDs (Uint8List)attentionMask- Attention mask (Uint8List)specialTokensMask- Special token indicators (Uint8List)offsets- Character offsets for each tokenwithPadding()/withTruncation()- Post-processing methodstruncatePair()- Static method for pair truncation
-
Predefined configurations:
SentencePieceConfig.llama- Llama-style (BOS only)SentencePieceConfig.gemma- Gemma-style (BOS + EOS)
-
Truncation strategies:
longestFirst- Truncate longer sequence firstonlyFirst- Only truncate first sequenceonlySecond- Only truncate second sequencedoNotTruncate- No truncation
-
Padding options:
- Left/right padding direction
- Fixed length or pad to longest
- Pad to multiple of N
Performance #
- Efficient Trie-based vocabulary lookup
- Memory-optimized typed arrays reduce memory usage by ~78%
- Parallel batch processing with configurable chunk size
- Lazy evaluation where possible
Compatibility #
- Dart SDK 3.10.7+
- Compatible with Llama, Gemma, and other SentencePiece models
- HuggingFace-compatible API design