dart_sentencepiece_tokenizer 1.0.0
dart_sentencepiece_tokenizer: ^1.0.0 copied to clipboard
A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.
Changelog #
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
1.0.0 - 2024-01-03 #
Added #
- Initial release of dart_sentencepiece_tokenizer
- Pure Dart implementation with zero external dependencies
- Support for BPE (Byte Pair Encoding) algorithm used by Gemma models
- Support for Unigram algorithm used by Llama models
- Viterbi algorithm implementation for optimal Unigram segmentation
- Byte fallback support for handling unknown characters
- Unicode-aware Trie for efficient vocabulary lookup
- Memory-efficient typed arrays (Int32List, Uint8List) for encodings
Features #
-
SentencePieceTokenizer- Main tokenizer classfromBytes()- Load from protobuf bytesfromModelFile()/fromModelFileSync()- Load from .model fileencode()- Encode single textencodeBatch()- Encode multiple textsencodeBatchParallel()- Parallel batch encoding using IsolatesencodePair()- Encode text pairs for sequence classificationencodePairBatch()- Batch encode text pairsdecode()/decodeBatch()- Decode token IDs back to text
-
Encodingclass with:ids- Token IDs (Int32List)tokens- Token stringstypeIds- Segment type IDs (Uint8List)attentionMask- Attention mask (Uint8List)specialTokensMask- Special token indicators (Uint8List)offsets- Character offsets for each tokenwithPadding()/withTruncation()- Post-processing methodstruncatePair()- Static method for pair truncation
-
Predefined configurations:
SentencePieceConfig.llama- Llama-style (BOS only)SentencePieceConfig.gemma- Gemma-style (BOS + EOS)
-
Truncation strategies:
longestFirst- Truncate longer sequence firstonlyFirst- Only truncate first sequenceonlySecond- Only truncate second sequencedoNotTruncate- No truncation
-
Padding options:
- Left/right padding direction
- Fixed length or pad to longest
- Pad to multiple of N
Performance #
- Efficient Trie-based vocabulary lookup
- Memory-optimized typed arrays reduce memory usage by ~78%
- Parallel batch processing with configurable chunk size
- Lazy evaluation where possible
Compatibility #
- Dart SDK 3.0.0+
- Compatible with Llama, Gemma, and other SentencePiece models
- HuggingFace-compatible API design