dart_sentencepiece_tokenizer 1.1.0 copy "dart_sentencepiece_tokenizer: ^1.1.0" to clipboard
dart_sentencepiece_tokenizer: ^1.1.0 copied to clipboard

A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.

Changelog #

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

1.1.0 - 2025-01-04 #

Changed #

  • BPE Algorithm Optimization: Replaced list-based merge operations with doubly-linked list structure for O(1) merge operations
  • Merge Cache: Added caching for pair lookups to avoid redundant string concatenation and vocabulary checks

Added #

  • Input Size Validation: Added maximum input length limit (500,000 characters) to prevent OOM errors
    • encode() and encodePair() now throw ArgumentError for oversized inputs
    • Validation applies to all encoding methods including batch operations

Removed #

  • Removed unused TokenResult class from tokenization_algorithm.dart

Performance #

  • BPE tokenization now uses O(1) node merging instead of O(n) list reconstruction
  • Merge pair lookups are cached, reducing redundant vocabulary checks

1.0.0 - 2025-01-04 #

Added #

  • Initial release of dart_sentencepiece_tokenizer
  • Pure Dart implementation with zero external dependencies
  • Support for BPE (Byte Pair Encoding) algorithm used by Gemma models
  • Support for Unigram algorithm used by Llama models
  • Viterbi algorithm implementation for optimal Unigram segmentation
  • Byte fallback support for handling unknown characters
  • Unicode-aware Trie for efficient vocabulary lookup
  • Memory-efficient typed arrays (Int32List, Uint8List) for encodings

Features #

  • SentencePieceTokenizer - Main tokenizer class

    • fromBytes() - Load from protobuf bytes
    • fromModelFile() / fromModelFileSync() - Load from .model file
    • encode() - Encode single text
    • encodeBatch() - Encode multiple texts
    • encodeBatchParallel() - Parallel batch encoding using Isolates
    • encodePair() - Encode text pairs for sequence classification
    • encodePairBatch() - Batch encode text pairs
    • decode() / decodeBatch() - Decode token IDs back to text
  • Encoding class with:

    • ids - Token IDs (Int32List)
    • tokens - Token strings
    • typeIds - Segment type IDs (Uint8List)
    • attentionMask - Attention mask (Uint8List)
    • specialTokensMask - Special token indicators (Uint8List)
    • offsets - Character offsets for each token
    • withPadding() / withTruncation() - Post-processing methods
    • truncatePair() - Static method for pair truncation
  • Predefined configurations:

    • SentencePieceConfig.llama - Llama-style (BOS only)
    • SentencePieceConfig.gemma - Gemma-style (BOS + EOS)
  • Truncation strategies:

    • longestFirst - Truncate longer sequence first
    • onlyFirst - Only truncate first sequence
    • onlySecond - Only truncate second sequence
    • doNotTruncate - No truncation
  • Padding options:

    • Left/right padding direction
    • Fixed length or pad to longest
    • Pad to multiple of N

Performance #

  • Efficient Trie-based vocabulary lookup
  • Memory-optimized typed arrays reduce memory usage by ~78%
  • Parallel batch processing with configurable chunk size
  • Lazy evaluation where possible

Compatibility #

  • Dart SDK 3.10.7+
  • Compatible with Llama, Gemma, and other SentencePiece models
  • HuggingFace-compatible API design
0
likes
160
points
164
downloads

Publisher

verified publisherbrodykim.work

Weekly Downloads

A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.

Repository (GitHub)
View/report issues

Topics

#nlp #sentencepiece #tokenizer #machine-learning #llm

Documentation

API reference

License

MIT (license)

More

Packages that depend on dart_sentencepiece_tokenizer