dart_sentencepiece_tokenizer 1.2.0
dart_sentencepiece_tokenizer: ^1.2.0 copied to clipboard
A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.
Changelog #
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
1.2.0 - 2025-01-17 #
Added #
-
JSON Serialization - HuggingFace-compatible tokenizer.json format
toJson()- Serialize tokenizer to JSON stringsaveToJson()/saveToJsonSync()- Save to fileTokenizerJsonLoader.fromJsonString()- Load from JSON stringTokenizerJsonLoader.fromJsonFile()/fromJsonFileSync()- Load from file
-
Dynamic Token Addition API
addTokens(List<String>)- Add new tokens to vocabularyaddSpecialTokens(Map<String, String>)- Add special tokens (pad, mask, etc.)getAddedVocab()- Get map of dynamically added tokensisAddedToken(String)- Check if token was added dynamicallygetVocab({withAddedTokens})- Get full vocabulary as Map<String, int>
-
HuggingFace-compatible Methods
tokenize(String)- Returns ListtokenizeBatch(List<String>)- Batch tokenization
-
Optimized BPE Algorithm (
BpeAlgorithmOptimized)- O(n log n) complexity using priority queue (heap)
- ~35% faster than original algorithm on medium-length text
Changed #
SpVocabularynow uses growable list for dynamic token addition support
1.1.0 - 2025-01-04 #
1.0.0 - 2025-01-03 #
Added #
- Initial release of dart_sentencepiece_tokenizer
- Pure Dart implementation with zero external dependencies
- Support for BPE (Byte Pair Encoding) algorithm used by Gemma models
- Support for Unigram algorithm used by Llama models
- Viterbi algorithm implementation for optimal Unigram segmentation
- Byte fallback support for handling unknown characters
- Unicode-aware Trie for efficient vocabulary lookup
- Memory-efficient typed arrays (Int32List, Uint8List) for encodings
Features #
-
SentencePieceTokenizer- Main tokenizer classfromBytes()- Load from protobuf bytesfromModelFile()/fromModelFileSync()- Load from .model fileencode()- Encode single textencodeBatch()- Encode multiple textsencodeBatchParallel()- Parallel batch encoding using IsolatesencodePair()- Encode text pairs for sequence classificationencodePairBatch()- Batch encode text pairsdecode()/decodeBatch()- Decode token IDs back to text
-
Encodingclass with:ids- Token IDs (Int32List)tokens- Token stringstypeIds- Segment type IDs (Uint8List)attentionMask- Attention mask (Uint8List)specialTokensMask- Special token indicators (Uint8List)offsets- Character offsets for each tokenwithPadding()/withTruncation()- Post-processing methodstruncatePair()- Static method for pair truncation
-
Predefined configurations:
SentencePieceConfig.llama- Llama-style (BOS only)SentencePieceConfig.gemma- Gemma-style (BOS + EOS)
-
Truncation strategies:
longestFirst- Truncate longer sequence firstonlyFirst- Only truncate first sequenceonlySecond- Only truncate second sequencedoNotTruncate- No truncation
-
Padding options:
- Left/right padding direction
- Fixed length or pad to longest
- Pad to multiple of N
Performance #
- Efficient Trie-based vocabulary lookup
- Memory-optimized typed arrays reduce memory usage by ~78%
- Parallel batch processing with configurable chunk size
- Lazy evaluation where possible
Compatibility #
- Dart SDK 3.0.0+
- Compatible with Llama, Gemma, and other SentencePiece models
- HuggingFace-compatible API design