dart_sentencepiece_tokenizer 1.2.0 copy "dart_sentencepiece_tokenizer: ^1.2.0" to clipboard
dart_sentencepiece_tokenizer: ^1.2.0 copied to clipboard

A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.

Changelog #

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

1.2.0 - 2025-01-17 #

Added #

  • JSON Serialization - HuggingFace-compatible tokenizer.json format

    • toJson() - Serialize tokenizer to JSON string
    • saveToJson() / saveToJsonSync() - Save to file
    • TokenizerJsonLoader.fromJsonString() - Load from JSON string
    • TokenizerJsonLoader.fromJsonFile() / fromJsonFileSync() - Load from file
  • Dynamic Token Addition API

    • addTokens(List<String>) - Add new tokens to vocabulary
    • addSpecialTokens(Map<String, String>) - Add special tokens (pad, mask, etc.)
    • getAddedVocab() - Get map of dynamically added tokens
    • isAddedToken(String) - Check if token was added dynamically
    • getVocab({withAddedTokens}) - Get full vocabulary as Map<String, int>
  • HuggingFace-compatible Methods

    • tokenize(String) - Returns List
    • tokenizeBatch(List<String>) - Batch tokenization
  • Optimized BPE Algorithm (BpeAlgorithmOptimized)

    • O(n log n) complexity using priority queue (heap)
    • ~35% faster than original algorithm on medium-length text

Changed #

  • SpVocabulary now uses growable list for dynamic token addition support

1.1.0 - 2025-01-04 #

Added #

  • Input length validation (max 500,000 characters) to prevent OOM
  • Example usage file (example/example.dart)

Changed #

  • Improved BPE algorithm efficiency
  • Enhanced error messages for input validation

1.0.0 - 2025-01-03 #

Added #

  • Initial release of dart_sentencepiece_tokenizer
  • Pure Dart implementation with zero external dependencies
  • Support for BPE (Byte Pair Encoding) algorithm used by Gemma models
  • Support for Unigram algorithm used by Llama models
  • Viterbi algorithm implementation for optimal Unigram segmentation
  • Byte fallback support for handling unknown characters
  • Unicode-aware Trie for efficient vocabulary lookup
  • Memory-efficient typed arrays (Int32List, Uint8List) for encodings

Features #

  • SentencePieceTokenizer - Main tokenizer class

    • fromBytes() - Load from protobuf bytes
    • fromModelFile() / fromModelFileSync() - Load from .model file
    • encode() - Encode single text
    • encodeBatch() - Encode multiple texts
    • encodeBatchParallel() - Parallel batch encoding using Isolates
    • encodePair() - Encode text pairs for sequence classification
    • encodePairBatch() - Batch encode text pairs
    • decode() / decodeBatch() - Decode token IDs back to text
  • Encoding class with:

    • ids - Token IDs (Int32List)
    • tokens - Token strings
    • typeIds - Segment type IDs (Uint8List)
    • attentionMask - Attention mask (Uint8List)
    • specialTokensMask - Special token indicators (Uint8List)
    • offsets - Character offsets for each token
    • withPadding() / withTruncation() - Post-processing methods
    • truncatePair() - Static method for pair truncation
  • Predefined configurations:

    • SentencePieceConfig.llama - Llama-style (BOS only)
    • SentencePieceConfig.gemma - Gemma-style (BOS + EOS)
  • Truncation strategies:

    • longestFirst - Truncate longer sequence first
    • onlyFirst - Only truncate first sequence
    • onlySecond - Only truncate second sequence
    • doNotTruncate - No truncation
  • Padding options:

    • Left/right padding direction
    • Fixed length or pad to longest
    • Pad to multiple of N

Performance #

  • Efficient Trie-based vocabulary lookup
  • Memory-optimized typed arrays reduce memory usage by ~78%
  • Parallel batch processing with configurable chunk size
  • Lazy evaluation where possible

Compatibility #

  • Dart SDK 3.0.0+
  • Compatible with Llama, Gemma, and other SentencePiece models
  • HuggingFace-compatible API design
3
likes
0
points
212
downloads

Publisher

verified publisherbrodykim.work

Weekly Downloads

A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.

Repository (GitHub)
View/report issues

Topics

#nlp #sentencepiece #tokenizer #machine-learning #llm

License

unknown (license)

More

Packages that depend on dart_sentencepiece_tokenizer