flutter_gemma 0.14.1 copy "flutter_gemma: ^0.14.1" to clipboard
flutter_gemma: ^0.14.1 copied to clipboard

A Flutter plugin for running Gemma and other LLMs locally on Android, iOS, Web, and Desktop. Supports multimodal vision, audio, function calling, thinking mode, GPU acceleration, text embeddings, and [...]

# Flutter Gemma

CI Tests Release Build pub package Ask DeepWiki

ko-fi

The plugin supports not only Gemma, but also other models. Here's the full list of supported models: Gemma 4 E2B/E4B, Gemma3n E2B/E4B, FastVLM 0.5B, Gemma-3 1B, Gemma 3 270M, FunctionGemma 270M, Qwen3 0.6B, Qwen 2.5, Phi-4 Mini, DeepSeek R1, SmolLM 135M.

*Note: The flutter_gemma plugin supports Gemma 4 and Gemma3n (with multimodal vision and audio support), FastVLM (vision), Gemma-3, FunctionGemma, Qwen3, Qwen 2.5, Phi-4, DeepSeek R1 and SmolLM. Desktop platforms (macOS, Windows, Linux) require .litertlm model format.

Gemma is a family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models

gemma_github_cover

Bring the power of Google's lightweight Gemma language models directly to your Flutter applications. With Flutter Gemma, you can seamlessly incorporate advanced AI capabilities into your Flutter applications, all without relying on external servers.

There is an example of using:

gemma_github_gif

Features #

  • Local Execution: Run Gemma models directly on user devices for enhanced privacy and offline functionality.
  • Platform Support: Compatible with iOS, Android, Web, macOS, Windows, and Linux platforms.
  • ๐Ÿ–ฅ๏ธ Desktop Support: Native desktop apps (macOS, Windows, Linux) with GPU acceleration via LiteRT-LM, called directly from Dart through dart:ffi โ€” no JVM/JRE bundling. See DESKTOP_SUPPORT.md for details.
  • ๐Ÿ–ผ๏ธ Multimodal Support: Text + Image input with Gemma3n vision models
  • ๐ŸŽ™๏ธ Audio Input: Record and send audio messages with Gemma3n E2B/E4B models (Android, iOS device, Desktop)
  • ๐Ÿ› ๏ธ Function Calling: Enable your models to call external functions and integrate with other services (supported by select models)
  • ๐Ÿง  Thinking Mode: View the reasoning process of DeepSeek and Gemma 4 models with thinking blocks
  • ๐Ÿ›‘ Stop Generation: Cancel text generation mid-process on Android, iOS, Web, and Desktop
  • โš™๏ธ Backend Switching: Choose between CPU and GPU backends for each model individually in the example app
  • ๐Ÿ” Advanced Model Filtering: Filter models by features (Multimodal, Function Calls, Thinking) with expandable UI
  • ๐Ÿ“Š Model Sorting: Sort models alphabetically, by size, or use default order in the example app
  • LoRA Support: Efficient fine-tuning and integration of LoRA (Low-Rank Adaptation) weights for tailored AI behavior.
  • ๐Ÿ“ฅ Enhanced Downloads: Smart retry logic with exponential backoff for reliable model downloads
  • ๐Ÿ”ง Download Reliability: Automatic restart logic for interrupted downloads (resume not supported by HuggingFace CDN)
  • ๐Ÿ“ฑ Android Foreground Service: Large downloads (>500MB) automatically use foreground service to bypass 9-minute timeout
  • ๐Ÿ”ง Model Replace Policy: Configurable model replacement system (keep/replace) with automatic model switching
  • ๐Ÿ“Š Text Embeddings: Generate vector embeddings from text using EmbeddingGemma and Gecko models
  • ๐Ÿ”ง Unified Model Management: Single system for managing both inference and embedding models with automatic validation
  • ๐Ÿ’พ Web Persistent Caching: Models persist across browser restarts using Cache API (Web only)

What's new in 0.14.1 #

  • ๐Ÿ› ๏ธ Gemma 4 native function calling โ€” ModelType.gemma4 routes tool definitions through the LiteRT-LM SDK's chat-template path (minja). The SDK renders native <|tool>declaration:...<tool|> tokens, the model emits <|tool_call>...<tool_call|>, and the SDK parses the response into structured tool_calls JSON. flutter_gemma surfaces it as FunctionCallResponse โ€” no Dart-side prompt engineering required.

What's new in 0.14.0 #

  • ๐Ÿ–ฅ๏ธ Desktop rewritten on dart:ffi โ€” no JVM, no gRPC, no separate server. Native libs auto-fetched at build time.
  • ๐ŸŽ iOS Metal GPU for .litertlm models on physical devices via FFI.
  • ๐Ÿง Linux GPU (Vulkan/WebGPU) and ๐ŸชŸ Windows GPU (DirectX 12) ready out of the box.
  • ๐Ÿค– Android โ€” Kotlin LiteRtLm dependency removed; FFI used exclusively for .litertlm.

See CHANGELOG.md for the full release history.

Model File Types #

Flutter Gemma supports different model file formats, which are grouped into two types based on how chat templates are handled:

Type 1: MediaPipe-Managed Templates #

  • .task files: MediaPipe-optimized format for mobile (Android/iOS)
  • .litertlm files: LiteRT-LM format for Android, iOS, and Desktop platforms

Both formats have identical behavior โ€” MediaPipe handles chat templates internally.

Type 2: Manual Template Formatting #

  • .bin files: Standard binary format
  • .tflite files: LiteRT format (formerly TensorFlow Lite)

Both formats require manual chat template formatting in your code.

Note: The plugin automatically detects the file extension and applies appropriate formatting. When specifying ModelFileType in your code:

  • Use ModelFileType.task for .task and .litertlm files (same behavior)
  • Use ModelFileType.binary for .bin and .tflite files (same behavior)

Format by Platform #

Format Android iOS Web Desktop Use Case
.task โœ… โœ… โœ… โŒ Older models (Gemma3n, Gemma 3, DeepSeek, Qwen 2.5, Phi-4)
.litertlm โœ… โœ… ยน โŒ โœ… Newer models (Gemma 4, Qwen3, FastVLM + desktop for all)
-web.task โŒ โŒ โœ… โŒ Web-specific builds (e.g. Gemma 4, Gemma3n)
.bin โœ… โœ… โœ… โŒ Manual chat template formatting required
.tflite โœ… โœ… โœ… โœ… Embeddings only (EmbeddingGemma, Gecko)

ยน iOS .litertlm runs on the FFI engine โ€” vision and audio supported on physical devices. The Simulator stays CPU-only because Metal sim has a 256 MB single-allocation cap.

Model Capabilities #

The example app offers a curated list of models, each suited for different tasks. Here's a breakdown of the models available and their capabilities:

Model Family Best For Function Calling Thinking Mode Vision Languages Size
Gemma 4 E2B Next-gen multimodal chat โ€” text, image, audio โœ… โœ… โœ… Multilingual 2.4GB
Gemma 4 E4B Next-gen multimodal chat โ€” text, image, audio โœ… โœ… โœ… Multilingual 4.3GB
Gemma3n On-device multimodal chat and image analysis โœ… โŒ โœ… Multilingual 3-6GB
FastVLM 0.5B Fast vision-language inference โŒ โŒ โœ… Multilingual 0.5GB
Phi-4 Mini Advanced reasoning and instruction following โœ… โŒ โŒ Multilingual 3.9GB
DeepSeek R1 High-performance reasoning and code generation โœ… โœ… โŒ Multilingual 1.7GB
Qwen3 0.6B Compact multilingual chat with function calling โœ… โœ… โŒ Multilingual 586MB
Qwen 2.5 Strong multilingual chat and instruction following โœ… โŒ โŒ Multilingual 0.5-1.6GB
Gemma 3 1B Balanced and efficient text generation โœ… โŒ โŒ Multilingual 0.5GB
Gemma 3 270M Ideal for fine-tuning (LoRA) for specific tasks โŒ โŒ โŒ Multilingual 0.3GB
FunctionGemma 270M Specialized for function calling on-device โœ… โŒ โŒ Multilingual 284MB
SmolLM 135M Ultra-compact, resource-constrained devices โŒ โŒ โŒ English 135MB

ModelType Reference #

When installing models, you need to specify the correct ModelType. Use this table to find the right type for your model:

Model Family ModelType Examples
Gemma 4 ModelType.gemma4 Gemma 4 E2B, Gemma 4 E4B (native function-call tokens)
Gemma 3 / Gemma3n ModelType.gemmaIt Gemma 3 1B, Gemma 3 270M, Gemma3n E2B/E4B
DeepSeek ModelType.deepSeek DeepSeek R1
Qwen 2.5 ModelType.qwen Qwen 2.5 1.5B, Qwen 2.5 0.5B
Qwen 3 ModelType.qwen3 Qwen3 0.6B
FunctionGemma ModelType.functionGemma FunctionGemma 270M IT
Phi ModelType.phi Phi-4 Mini
General ModelType.general FastVLM 0.5B, SmolLM 135M

Note: Gemma 4 uses ModelType.gemma4 (introduced in 0.14.1) so its native <\|tool_call>...<tool_call\|> tokens are routed through the LiteRT-LM SDK's chat-template path. For Gemma 3 and earlier, keep ModelType.gemmaIt.

Usage Example:

// Gemma models
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url).install();

// DeepSeek models
await FlutterGemma.installModel(modelType: ModelType.deepSeek)
  .fromNetwork(url).install();

// Phi-4 (uses general type)
await FlutterGemma.installModel(modelType: ModelType.general)
  .fromNetwork(url).install();

Installation #

  1. Add flutter_gemma to your pubspec.yaml:

    dependencies:
      flutter_gemma: latest_version
    
  2. Run flutter pub get to install.

Setup #

โš ๏ธ Important: Complete platform-specific setup before using the plugin.

  1. Download Model and optionally LoRA Weights: Obtain a model from the Supported Models section or HuggingFace
  1. Platform specific setup:

iOS

  • Set minimum iOS version in Podfile:
platform :ios, '16.0'  # Required for MediaPipe GenAI
  • Enable file sharing in Info.plist:
<key>UIFileSharingEnabled</key>
<true/>
  • Add network access description in Info.plist (for development):
<key>NSLocalNetworkUsageDescription</key>
<string>This app requires local network access for model inference services.</string>
  • Enable performance optimization in Info.plist (optional):
<key>CADisableMinimumFrameDurationOnPhone</key>
<true/>
  • Add memory entitlements in Runner.entitlements (for large models):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>com.apple.developer.kernel.extended-virtual-addressing</key>
	<true/>
	<key>com.apple.developer.kernel.increased-memory-limit</key>
	<true/>
	<key>com.apple.developer.kernel.increased-debugging-memory-limit</key>
	<true/>
</dict>
</plist>
  • Change the linking type of pods to static in Podfile:
use_frameworks! :linkage => :static
  • Setup LiteRT-LM dylib symlinks in ios/Podfile post_install block. LiteRT-LM's gpu_registry calls dlopen("libLiteRtMetalAccelerator.dylib") by basename at runtime. Native Assets bundles the dylibs as .frameworks, so each framework also needs a flat lib*.dylib symlink alongside it (required for GPU on physical iOS devices):
post_install do |installer|
  installer.pods_project.targets.each do |target|
    flutter_additional_ios_build_settings(target)
  end

  # flutter_gemma: create lib*.dylib symlinks next to the bundled
  # .framework so LiteRT-LM's gpu_registry can dlopen by basename.
  installer.aggregate_targets.each do |aggregate_target|
    aggregate_target.user_targets.each do |user_target|
      phase_name = '[flutter_gemma] Setup LiteRT-LM iOS'
      existing = user_target.shell_script_build_phases.find { |p| p.name == phase_name }
      phase = existing || user_target.new_shell_script_build_phase(phase_name)
      phase.shell_script = <<~SHELL
        set -e
        FRAMEWORKS="${BUILT_PRODUCTS_DIR}/${PRODUCT_NAME}.app/Frameworks"
        if [ ! -d "${FRAMEWORKS}" ]; then
          echo "[flutter_gemma] no Frameworks/ in ${PRODUCT_NAME}.app โ€” skipping"
          exit 0
        fi
        for base in LiteRtMetalAccelerator LiteRtTopKMetalSampler GemmaModelConstraintProvider; do
          src="${base}.framework/${base}"
          if [ ! -e "${FRAMEWORKS}/${src}" ]; then
            echo "[flutter_gemma] ${FRAMEWORKS}/${src} missing โ€” Native Assets did not bundle it"
            continue
          fi
          dst="${FRAMEWORKS}/lib${base}.dylib"
          if [ ! -e "${dst}" ]; then
            ln -sf "${src}" "${dst}"
            echo "[flutter_gemma] symlinked lib${base}.dylib -> ${src}"
          fi
        done
      SHELL
    end
  end
end

Android

  • If you want to use a GPU to work with the model, you need to add OpenGL support in the manifest.xml. If you plan to use only the CPU, you can skip this step.

Add to 'AndroidManifest.xml' above tag </application>

 <uses-native-library
     android:name="libOpenCL.so"
     android:required="false"/>
 <uses-native-library android:name="libOpenCL-car.so" android:required="false"/>
 <uses-native-library android:name="libOpenCL-pixel.so" android:required="false"/>
  • For release builds with ProGuard/R8 enabled, the plugin automatically includes necessary ProGuard rules. If you encounter issues with UnsatisfiedLinkError or missing classes in release builds, ensure your proguard-rules.pro includes:
# MediaPipe
-keep class com.google.mediapipe.** { *; }
-dontwarn com.google.mediapipe.**

# Protocol Buffers
-keep class com.google.protobuf.** { *; }
-dontwarn com.google.protobuf.**

# RAG functionality
-keep class com.google.ai.edge.localagents.** { *; }
-dontwarn com.google.ai.edge.localagents.**

Web

  • Web currently works only GPU backend models, CPU backend models are not supported by MediaPipe yet

  • Model compatibility: Mobile .task models often don't work on web. Use web-specific variants: -web.task or .litertlm files. Check model repository for web-compatible versions.

  • Add dependencies to index.html file in web folder

  <script type="module">
  import { FilesetResolver, LlmInference } from 'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@0.10.27';
  window.FilesetResolver = FilesetResolver;
  window.LlmInference = LlmInference;
  </script>

Desktop (macOS, Windows, Linux)

โš ๏ธ Desktop Model Format

Desktop platforms use LiteRT-LM format only (.litertlm files). MediaPipe .task and .bin models used on mobile/web are NOT compatible with desktop.

Since 0.14.0 desktop inference and embeddings both use the LiteRT-LM C API via dart:ffi directly in the Dart process โ€” no JVM, no gRPC, no separate server. Native libraries are downloaded by hook/build.dart (Native Assets) at build time and bundled into the app automatically.

Platform Architecture GPU Acceleration Status
macOS arm64 (Apple Silicon) Metal โœ… Ready
macOS x86_64 (Intel) - โŒ Not Supported
Windows x86_64 DirectX 12 โœ… Ready
Windows arm64 - โŒ Not Supported
Linux x86_64 Vulkan โœ… Ready
Linux arm64 Vulkan โœ… Ready

macOS Setup:

The plugin uses Flutter Native Assets to bundle LiteRT-LM dylibs as .frameworks. The LiteRT-LM runtime, however, calls dlopen("libLiteRtMetalAccelerator.dylib") by basename at runtime, so each framework also needs a flat lib*.dylib symlink alongside it. Add this to your macos/Podfile post_install block:

post_install do |installer|
  installer.pods_project.targets.each do |target|
    flutter_additional_macos_build_settings(target)
  end

  # flutter_gemma: create lib*.dylib symlinks next to the bundled
  # .framework so LiteRT-LM's gpu_registry can dlopen by basename.
  installer.aggregate_targets.each do |aggregate_target|
    aggregate_target.user_targets.each do |user_target|
      phase_name = '[flutter_gemma] Setup LiteRT-LM macOS'
      existing = user_target.shell_script_build_phases.find { |p| p.name == phase_name }
      phase = existing || user_target.new_shell_script_build_phase(phase_name)
      phase.shell_script = <<~SHELL
        set -e
        FRAMEWORKS="${BUILT_PRODUCTS_DIR}/${PRODUCT_NAME}.app/Contents/Frameworks"
        if [ ! -d "${FRAMEWORKS}" ]; then
          echo "[flutter_gemma] no Contents/Frameworks/ in ${PRODUCT_NAME}.app โ€” skipping"
          exit 0
        fi
        for base in LiteRtMetalAccelerator LiteRtTopKMetalSampler GemmaModelConstraintProvider; do
          src="${base}.framework/Versions/Current/${base}"
          if [ ! -e "${FRAMEWORKS}/${src}" ]; then
            echo "[flutter_gemma] ${FRAMEWORKS}/${src} missing โ€” Native Assets did not bundle it"
            continue
          fi
          dst="${FRAMEWORKS}/lib${base}.dylib"
          if [ ! -e "${dst}" ]; then
            ln -sf "${src}" "${dst}"
            echo "[flutter_gemma] symlinked lib${base}.dylib -> ${src}"
          fi
        done
      SHELL
    end
  end
end

Add to macos/Runner/DebugProfile.entitlements and Release.entitlements:

<key>com.apple.security.cs.disable-library-validation</key>
<true/>

Windows Setup:

No additional configuration required. hook/build.dart (Native Assets) downloads LiteRtLm.dll + companion DLLs + the DXC runtime (dxil.dll, dxcompiler.dll v1.9.2602) from the GitHub release on first build, verifies them via SHA256, and bundles them next to your app.exe. End users need the Microsoft Visual C++ Redistributable 2019+ (download) โ€” most modern Windows 10/11 systems already have it.

Linux Setup:

No additional configuration required. Build dependencies:

sudo apt install clang cmake ninja-build libgtk-3-dev

For GPU acceleration, ensure Vulkan drivers are installed:

sudo apt install vulkan-tools libvulkan1

๐Ÿ“š Full Desktop Documentation โ†’

Quick Start #

โš ๏ธ Important: Complete platform setup before running this code.

1. Install a Model (One Time) #

import 'package:flutter_gemma/flutter_gemma.dart';

// Install model
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
).fromNetwork(
  'https://huggingface.co/google/gemma-3-2b-it/resolve/main/gemma-3-2b-it-gpu-int8.task',
  token: 'your_hf_token',
).withProgress((progress) {
  print('Downloading: ${progress.percentage}%');
}).install();

2. Create and Use Model (Multiple Times) #

// Create model with specific configuration
final model = await FlutterGemma.getActiveModel(
  maxTokens: 2048,
  preferredBackend: PreferredBackend.gpu,
);

// Use model
final chat = await model.createChat();
await chat.addQueryChunk(Message.text(
  text: 'Explain quantum computing',
  isUser: true,
));
final response = await chat.generateChatResponse();

// Cleanup
await model.close();

System Instructions #

Control model behavior with a system-level instruction:

final chat = await model.createChat(
  systemInstruction: 'You are a concise assistant. Always respond in bullet points.',
);

Platform support:

  • Android .litertlm / Desktop: Passed natively via ConversationConfig.systemInstruction
  • Android .task / iOS / Web: Prepended to first user message as fallback

3. Multiple Instances from Same Model #

// Install once
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url).install();

// Create multiple instances
final quickModel = await FlutterGemma.getActiveModel(maxTokens: 512);
final deepModel = await FlutterGemma.getActiveModel(maxTokens: 4096);
// Both use the SAME model file!

Installation Sources #

// Network
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork('https://example.com/model.task', token: 'optional')
  .install();

// Flutter assets
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromAsset('assets/models/model.task')
  .install();

// Native bundle
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromBundled('model.task')
  .install();

// External file
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromFile('/path/to/model.task')
  .install();

Modern API vs Legacy API #

Benefits:

  • โœ… Cleaner, more intuitive
  • โœ… Type-safe ModelSource
  • โœ… Automatic active model management
  • โœ… Install once, create many instances

Usage:

await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url).install();
final model = await FlutterGemma.getActiveModel(maxTokens: 2048);

Legacy API โš ๏ธ Deprecated #

โš ๏ธ DEPRECATED: This API is maintained for backwards compatibility only. New projects should use the Modern API above.

Still works but requires manual ModelType specification:

final model = await FlutterGemmaPlugin.instance.createModel(
  modelType: ModelType.gemmaIt,  // Must specify every time
  maxTokens: 2048,
);

Initialize Flutter Gemma #

Add to your main.dart:

import 'package:flutter_gemma/core/api/flutter_gemma.dart';

void main() {
  WidgetsFlutterBinding.ensureInitialized();

  // Optional: Initialize with HuggingFace token for gated models
  FlutterGemma.initialize(
    huggingFaceToken: const String.fromEnvironment('HUGGINGFACE_TOKEN'),
    maxDownloadRetries: 10,
  );

  runApp(MyApp());
}

Configuration Options:

  • huggingFaceToken: Authentication token for gated models (Gemma3n, EmbeddingGemma)
  • maxDownloadRetries: Number of retry attempts for failed downloads (default: 10)
  • webStorageMode: (Web only) Storage strategy for model files (default: cacheApi)
    • WebStorageMode.cacheApi: Cache API with Blob URLs (for models <2GB)
    • WebStorageMode.streaming: OPFS streaming (for large models >2GB like E4B, 7B)
    • WebStorageMode.none: No caching (ephemeral mode for testing)

Example:

FlutterGemma.initialize(
  huggingFaceToken: const String.fromEnvironment('HUGGINGFACE_TOKEN'),
  maxDownloadRetries: 10,
  webStorageMode: WebStorageMode.streaming,  // For large models (>2GB)
);

Next Steps:

HuggingFace Authentication ๐Ÿ” #

Many models require authentication to download from HuggingFace. Never commit tokens to version control.

This is the most secure way to handle tokens in development and production.

Step 1: Create config template file config.json.example:

{
  "HUGGINGFACE_TOKEN": ""
}

Step 2: Copy and add your token:

cp config.json.example config.json
# Edit config.json and add your token from https://huggingface.co/settings/tokens

Step 3: Add to .gitignore:

# Never commit tokens!
config.json

Step 4: Run with config:

flutter run --dart-define-from-file=config.json

Step 5: Access in code:

void main() {
  WidgetsFlutterBinding.ensureInitialized();

  // Read from environment (populated by --dart-define-from-file)
  const token = String.fromEnvironment('HUGGINGFACE_TOKEN');

  // Initialize with token (optional if all models are public)
  FlutterGemma.initialize(
    huggingFaceToken: token.isNotEmpty ? token : null,
  );

  runApp(MyApp());
}

Alternative: Environment Variables #

export HUGGINGFACE_TOKEN=hf_your_token_here
flutter run --dart-define=HUGGINGFACE_TOKEN=$HUGGINGFACE_TOKEN

Alternative: Per-Download Token #

// Pass token directly for specific downloads
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(
    'https://huggingface.co/google/gemma-3n-E2B-it-litert-preview/resolve/main/gemma-3n-E2B-it-int4.task',
    token: 'hf_your_token_here',  // โš ๏ธ Not recommended - use config.json
  )
  .install();

Which Models Require Authentication? #

Common gated models:

  • โœ… Gemma3n (E2B, E4B) - google/ repos are gated
  • โœ… Gemma 3 1B - litert-community/ requires access
  • โœ… Gemma 3 270M - litert-community/ requires access
  • โœ… EmbeddingGemma - litert-community/ requires access

Public models (no auth needed):

  • โŒ DeepSeek, Qwen3, Qwen 2.5, SmolLM, Phi-4, FastVLM - Public repos

Get your token: https://huggingface.co/settings/tokens

Grant access to gated repos: Visit model page โ†’ "Request Access" button

Model Sources ๐Ÿ“ฆ #

Flutter Gemma supports multiple model sources with different capabilities:

Source Type Platform Progress Resume Authentication Use Case
NetworkSource All โœ… Detailed โš ๏ธ Server-dependent โœ… Supported HuggingFace, CDNs, private servers
AssetSource All โš ๏ธ End only โŒ No โŒ N/A Models bundled in app assets
BundledSource All โš ๏ธ End only โŒ No โŒ N/A Native platform resources
FileSource Mobile only โš ๏ธ End only โŒ No โŒ N/A User-selected files (file picker)

NetworkSource - Internet Downloads #

Downloads models from HTTP/HTTPS URLs with full progress tracking and authentication.

Features:

  • โœ… Progress tracking (0-100%)
  • โš ๏ธ Resume after interruption (server-dependent, not supported by HuggingFace CDN)
  • โœ… HuggingFace authentication
  • โœ… Smart retry logic with exponential backoff
  • โœ… Background downloads on mobile
  • โœ… Cancellable downloads with CancelToken
  • โœ… Android foreground service for large downloads (>500MB)

Example:

// Public model
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork('https://example.com/model.bin')
  .withProgress((progress) => print('$progress%'))
  .install();

// Private model with authentication
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(
    'https://huggingface.co/google/gemma-3n-E2B-it-litert-preview/resolve/main/model.task',
    token: 'hf_...',  // Or use FlutterGemma.initialize(huggingFaceToken: ...)
  )
  .withProgress((progress) => setState(() => _progress = progress))
  .install();

Android Foreground Service (Large Downloads):

Android has a 9-minute background execution limit. For large models (>500MB), you can use foreground service mode which shows a notification but bypasses this timeout:

// Auto-detect based on file size (>500MB = foreground) - DEFAULT
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url)  // foreground: null (auto-detect)
  .install();

// Force foreground mode (always show notification)
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url, foreground: true)
  .install();

// Force background mode (may fail for large files)
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url, foreground: false)
  .install();

Foreground Parameter:

  • null (default): Auto-detect based on file size. Files >500MB use foreground service.
  • true: Always use foreground service (shows notification, no timeout)
  • false: Never use foreground service (subject to 9-minute timeout)

Note: iOS uses native URLSession which handles long downloads automatically - no foreground service needed.

Cancelling Downloads:

Use CancelToken to cancel downloads in progress:

import 'package:flutter_gemma/core/model_management/cancel_token.dart';

// Create cancel token
final cancelToken = CancelToken();

// Start download with cancel token
final future = FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(url)
  .withCancelToken(cancelToken)  // โ† Pass cancel token via builder
  .withProgress((progress) => print('Progress: $progress%'))
  .install();

// Cancel download from another part of your code
// (e.g., user pressed cancel button)
cancelToken.cancel('User cancelled download');

// Handle cancellation
try {
  await future;
  print('Download completed');
} catch (e) {
  if (CancelToken.isCancel(e)) {
    print('Download was cancelled by user');
  } else {
    print('Download failed: $e');
  }
}

// Check if cancelled
if (cancelToken.isCancelled) {
  print('Reason: ${cancelToken.cancelReason}');
}

CancelToken Features:

  • โœ… Non-breaking: Optional parameter, existing code works without changes
  • โœ… Works with network downloads (inference + embedding models)
  • โœ… Cancels ALL files in multi-file downloads (embedding: model + tokenizer)
  • โœ… Platform-independent (Mobile + Web)
  • โœ… Throws DownloadCancelledException for proper error handling
  • โœ… Thread-safe cancellation

AssetSource - Flutter Assets #

Copies models from Flutter assets (declared in pubspec.yaml).

Features:

  • โœ… No network required
  • โœ… Fast installation (local copy)
  • โš ๏ธ Increases app size significantly
  • โœ… Works offline

Example:

// 1. Add to pubspec.yaml
// assets:
//   - models/gemma-2b-it.bin

// 2. Install from asset
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromAsset('models/gemma-2b-it.bin')
  .install();

BundledSource - Native Resources #

Production-Ready Offline Models: Include small models directly in your app bundle for instant availability without downloads.

Use Cases:

  • โœ… Offline-first applications (works without internet from first launch)
  • โœ… Small models (Gemma 3 270M ~300MB)
  • โœ… Core features requiring guaranteed availability
  • โš ๏ธ Not for large models (increases app size significantly)

Platform Setup:

Android (android/app/src/main/assets/models/)

# Place your model file
android/app/src/main/assets/models/gemma-3-270m-it.task

iOS (Add to Xcode project)

  1. Drag model file into Xcode project
  2. Check "Copy items if needed"
  3. Add to target membership

Web (Static files in web/ directory)

# Place model files in web/ directory
example/web/gemma-3-270m-it.task

# Files are automatically copied to build/web/ during production build
flutter build web

โš ๏ธ Web Platform Limitation:

  • Production only: Bundled resources work ONLY in production builds (flutter build web)
  • Debug mode: Files in web/ are NOT served by flutter run dev server
  • For development: Use NetworkSource or AssetSource instead

Features:

  • โœ… Zero network dependency
  • โœ… No installation delay
  • โœ… No storage permission needed
  • โœ… Direct path usage (no file copying)

Example:

await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromBundled('gemma-3-270m-it.task')
  .install();

App Size Impact:

  • SmolLM 135M: ~135MB
  • Gemma 3 270M: ~300MB
  • Qwen3 0.6B: ~586MB
  • Consider hosting large models for download instead

FileSource - External Files (Mobile Only) #

References external files (e.g., user-selected via file picker).

Features:

  • โœ… No copying (references original file)
  • โœ… Protected from cleanup
  • โŒ Web not supported (no local file system)

Example:

// Mobile only - after user selects file with file_picker
final path = '/data/user/0/com.app/files/model.task';
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromFile(path)
  .install();

Important: On web, FileSource only works with URLs or asset paths, not local file system paths.

Migration from Legacy to Modern API ๐Ÿ”„ #

If you're upgrading from the Legacy API, here are common migration patterns:

Installing Models #

Legacy API Modern API
// Network download
final spec = MobileModelManager.createInferenceSpec(
  name: 'model.bin',
  modelUrl: 'https://example.com/model.bin',
);

await FlutterGemmaPlugin.instance.modelManager
  .downloadModelWithProgress(spec, token: token)
  .listen((progress) {
    print('${progress.overallProgress}%');
  });
// Network download
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(
    'https://example.com/model.bin',
    token: token,
  )
  .withProgress((progress) {
    print('$progress%');
  })
  .install();
// From assets
await modelManager.installModelFromAssetWithProgress(
  'model.bin',
  loraPath: 'lora.bin',
).listen((progress) {
  print('$progress%');
});
// From assets
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromAsset('model.bin')
  .withProgress((progress) {
    print('$progress%');
  })
  .install();

// LoRA weights can be installed with the model
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromAsset('model.bin')
  .withLoraFromAsset('lora.bin')
  .install();

Checking Model Installation #

Legacy API Modern API
final spec = MobileModelManager.createInferenceSpec(
  name: 'model.bin',
  modelUrl: url,
);

final isInstalled = await FlutterGemmaPlugin
  .instance.modelManager
  .isModelInstalled(spec);
final isInstalled = await FlutterGemma
  .isModelInstalled('model.bin');

Key Migration Notes #

  • โœ… Simpler imports: Use package:flutter_gemma/core/api/flutter_gemma.dart
  • โœ… Builder pattern: Chain methods for cleaner code
  • โœ… Callback-based progress: Simpler than streams for most cases
  • โœ… Type-safe sources: Compile-time validation of source types
  • โš ๏ธ Breaking change: Progress values are now int (0-100) instead of DownloadProgress object
  • โš ๏ธ Separate files: Model and LoRA weights installed independently

Model Creation and Inference #

Modern API (Recommended):

// Create model with runtime configuration
final inferenceModel = await FlutterGemma.getActiveModel(
  maxTokens: 2048,
  preferredBackend: PreferredBackend.gpu,
);

final chat = await inferenceModel.createChat();
await chat.addQueryChunk(Message.text(text: 'Hello!', isUser: true));
final response = await chat.generateChatResponse();

Legacy API (Still supported):

// Works with both Legacy and Modern installation methods
final inferenceModel = await FlutterGemmaPlugin.instance.createModel(
  modelType: ModelType.gemmaIt,
  preferredBackend: PreferredBackend.gpu,
  maxTokens: 2048,
);

final chat = await inferenceModel.createChat();
await chat.addQueryChunk(Message.text(text: 'Hello!', isUser: true));
final response = await chat.generateChatResponse();

Usage (Legacy API) โš ๏ธ DEPRECATED #

The pre-Modern stream-based API (FlutterGemmaPlugin.instance.modelManager, installModelFromAsset, downloadModelFromNetworkWithProgress, etc.) is still supported but deprecated. New projects should use the Modern API above.

๐Ÿ“š Full Legacy API reference: docs/LEGACY_API.md

๐Ÿ–ผ๏ธ Message Types #

The plugin now supports different types of messages:

// Text only
final textMessage = Message.text(text: "Hello!", isUser: true);

// Text + Image
final multimodalMessage = Message.withImage(
  text: "What's in this image?",
  imageBytes: imageBytes,
  isUser: true,
);

// Image only
final imageMessage = Message.imageOnly(imageBytes: imageBytes, isUser: true);

// Tool response (for function calling)
final toolMessage = Message.toolResponse(
  toolName: 'change_background_color',
  response: {'status': 'success', 'color': 'blue'},
);

// System information message
final systemMessage = Message.systemInfo(text: "Function completed successfully");

// Thinking content (for DeepSeek models)
final thinkingMessage = Message.thinking(text: "Let me analyze this problem...");

// Check if message contains image
if (message.hasImage) {
  print('This message contains an image');
}

// Create a copy of message
final copiedMessage = message.copyWith(text: "Updated text");

๐Ÿ’ฌ Response Types #

The model can return different types of responses depending on capabilities:

// Handle different response types
chat.generateChatResponseAsync().listen((response) {
  if (response is TextResponse) {
    // Regular text token from the model
    print('Text token: ${response.token}');
    // Use response.token to update your UI incrementally
    
  } else if (response is FunctionCallResponse) {
    // Model wants to call a function (Gemma3n, DeepSeek, Qwen2.5)
    print('Function: ${response.name}');
    print('Arguments: ${response.args}');
    
    // Execute the function and send response back
    _handleFunctionCall(response);
  } else if (response is ThinkingResponse) {
    // Model's reasoning process (DeepSeek models only)
    print('Thinking: ${response.content}');
    
    // Show thinking process in UI
    _showThinkingBubble(response.content);
  }
});

Response Types:

  • TextResponse: Contains a text token (response.token) for regular model output
  • FunctionCallResponse: Contains function name (response.name) and arguments (response.args) when the model wants to call a function
  • ThinkingResponse: Contains the model's reasoning process (response.content) for DeepSeek models with thinking mode enabled

๐ŸŽฏ Supported Models #

Platform Support #

Model Size Desktop Mobile Web
Gemma 4 E2B 2.4GB โœ… โœ… โœ…
Gemma 4 E4B 4.3GB โœ… โœ… โœ…
Gemma3n E2B 3.1GB โœ… โœ… โœ…
Gemma3n E4B 6.5GB โœ… โœ… โœ…
FastVLM 0.5B 0.5GB โœ… โŒ โŒ
Gemma-3 1B 0.5GB โœ… โœ… โœ…
Gemma 3 270M 0.3GB โœ… โœ… โœ…
FunctionGemma 270M 284MB โœ… โœ… โŒ
Qwen3 0.6B 586MB โœ… โœ… โœ…
Qwen 2.5 1.5B 1.6GB โœ… โœ… โŒ
Qwen 2.5 0.5B 0.5GB โŒ โœ… โŒ
SmolLM 135M 135MB โŒ โœ… โŒ
Phi-4 Mini 3.9GB โœ… โœ… โœ…
DeepSeek R1 1.7GB โŒ โœ… โŒ

๐Ÿ“Š Text Embedding Models #

All embedding models generate 768-dimensional vectors. The numbers in names (64/256/512/1024/2048) indicate maximum input sequence length in tokens, not embedding dimension.

Model Parameters Dimensions Max Seq Length Size Best For Auth Required
Gecko 64 110M 768D 64 tokens 110MB Short queries, real-time search โŒ
Gecko 256 110M 768D 256 tokens 114MB Balanced speed/accuracy โŒ
Gecko 512 110M 768D 512 tokens 116MB Medium context documents โŒ
EmbeddingGemma 256 300M 768D 256 tokens 179MB High accuracy, short context โœ…
EmbeddingGemma 512 300M 768D 512 tokens 179MB High accuracy, medium context โœ…
EmbeddingGemma 1024 300M 768D 1024 tokens 183MB Long documents, detailed content โœ…
EmbeddingGemma 2048 300M 768D 2048 tokens 196MB Very long documents โœ…

Performance Comparison (Android Pixel 8 with GPU acceleration):

  • Gecko 64: ~109ms/doc embedding, 130ms search (โšก fastest - 2.6x faster than EmbeddingGemma)
  • EmbeddingGemma 256: ~286ms/doc embedding, 342ms search (๐ŸŽฏ more accurate - 300M vs 110M params)

Use Cases:

  • โœ… Gecko 64: Real-time search, mobile apps, short queries (โ‰ค64 tokens), fast inference
  • โœ… Gecko 256/512: Balanced use cases, general-purpose embeddings, good speed/quality tradeoff
  • โœ… EmbeddingGemma 256/512: High-quality embeddings, semantic search, better accuracy
  • โœ… EmbeddingGemma 1024/2048: Long documents, detailed content, research papers, articles

๐Ÿ› ๏ธ Model Function Calling Support #

Function calling is currently supported by the following models:

โœ… Models with Function Calling Support #

  • Gemma 4 (E2B, E4B) - Full function calling support
  • Gemma3n (E2B, E4B) - Full function calling support
  • Gemma 3 1B - Function calling support
  • FunctionGemma 270M - Google's specialized function calling model
  • DeepSeek R1 - Function calling + thinking mode support
  • Qwen models (0.5B, 0.6B, 1.5B) - Full function calling support
  • Phi-4 Mini - Advanced reasoning with function calling support

โŒ Models WITHOUT Function Calling Support #

  • Gemma 3 270M - Text generation only
  • SmolLM 135M - Text generation only
  • FastVLM 0.5B - Vision model, no function calling

Important Notes:

  • When using unsupported models with tools, the plugin will log a warning and ignore the tools
  • Models will work normally for text generation even if function calling is not supported
  • Check the supportsFunctionCalls property in your model configuration

Platform Support Details ๐ŸŒ #

Feature Comparison #

Feature Android iOS Web Desktop Notes
Text Generation โœ… Full โœ… Full โœ… Full โœ… Full All models supported
Image Input (Multimodal) โœ… Full โœ… Full โœ… Full โš ๏ธ Broken (#684) macOS: model hallucinates
Audio Input โœ… Full โœ… Full โŒ Not supported โœ… Full Gemma3n E2B/E4B
Function Calling โœ… Full โœ… Full โœ… Full โŒ Not supported LiteRT-LM limitation
Thinking Mode โœ… Full โœ… Full โœ… Full โœ… Full DeepSeek & Gemma 4
Stop Generation โœ… Full โœ… Full โœ… Full โœ… Full Cancel mid-process
GPU Acceleration โœ… Full โœ… Full โœ… Full โš ๏ธ Partial macOS GPU broken
NPU Acceleration โœ… Full โŒ Not supported โŒ Not supported โŒ Not supported Android only (.litertlm)
CPU Backend โœ… Full โœ… Full โŒ Not supported โœ… Full MediaPipe limitation
Streaming Responses โœ… Full โœ… Full โœ… Full โœ… Full Real-time generation
LoRA Support โœ… Full โœ… Full โœ… Full โŒ Not supported LiteRT-LM limitation
Text Embeddings โœ… Full โœ… Full โœ… Full โœ… Full EmbeddingGemma, Gecko
VectorStore (RAG) โœ… SQLite โœ… SQLite โœ… SQLite WASM โœ… SQLite Semantic search, RAG
File Downloads โœ… Background โœ… Background โœ… In-memory โœ… Background Platform-specific
Asset Loading โœ… Full โœ… Full โœ… Full โŒ Not supported Flutter assets N/A
Bundled Resources โœ… Full โœ… Full โœ… Full โŒ Not supported Native bundles only
External Files (FileSource) โœ… Full โœ… Full โŒ Not supported โœ… Full No local FS on web

Web Platform Specifics #

Authentication

  • Required for gated models: Gemma3n, Gemma 3 1B/270M, EmbeddingGemma
  • Configuration: Use FlutterGemma.initialize(huggingFaceToken: '...') or pass token per-download
  • Storage: Tokens stored in browser memory (not localStorage)

File Handling

  • Downloads: Creates blob URLs in browser memory (no actual files)
  • Storage: IndexedDB via WebFileSystemService
  • FileSource: Only works with HTTP/HTTPS URLs or assets/ paths
  • Local file paths: โŒ Not supported (browser security restriction)

Web Storage Modes (v0.12.1+)

Three Storage Modes:

1. Cache API Mode (default, WebStorageMode.cacheApi):

  • Uses browser Cache API with Blob URLs
  • Models persist across browser restarts
  • Best for models <2GB

2. Streaming Mode (WebStorageMode.streaming):

  • Uses OPFS with ReadableStream
  • Bypasses browser 2GB ArrayBuffer limit
  • Required for large models (E4B 4GB+, 7B, 27B)
  • Requires Chrome 86+, Edge 86+, Safari 15.2+

3. Ephemeral Mode (WebStorageMode.none):

  • Models stored in memory only
  • Cleared when browser closes
  • For testing/demos
// Default: Cache API for small models
FlutterGemma.initialize(webStorageMode: WebStorageMode.cacheApi);

// Streaming for large models (>2GB)
FlutterGemma.initialize(webStorageMode: WebStorageMode.streaming);

// Check if streaming is supported
final supported = await FlutterGemma.isStreamingSupported();

Backend Support

CORS Configuration

  • Required for custom servers: Enable CORS headers on your model hosting server
  • Firebase Storage: See CORS configuration docs
  • HuggingFace: CORS already configured correctly

Memory Limitations

  • Large models: May hit browser memory limits (2GB typical)
  • Recommended: Use smaller models (1B-2B) for web platform
  • Best models for web:
    • Gemma 3 270M (300MB)
    • Gemma 3 1B (500MB-1GB)
    • Gemma3n E2B (3GB) - requires 6GB+ device RAM

Browser Cache Storage Limits

Browser Max Model Size Notes
Chrome/Firefox ~2 GB ArrayBuffer limit
Safari ~50 MB โš ๏ธ Not suitable

Mobile Platform Specifics #

Android

  • GPU Support: Requires OpenGL libraries in AndroidManifest.xml
  • ProGuard: Automatic rules included for release builds
  • Storage: Local file system in app documents directory

iOS

  • Minimum version: iOS 16.0 required for MediaPipe GenAI
  • Memory entitlements: Required for large models (see Setup section)
  • Linking: Static linking required (use_frameworks! :linkage => :static)
  • Storage: Local file system in app documents directory
  • Embedding models: Supported via TensorFlowLiteC โ€” no extra Podfile configuration needed

The full and complete example you can find in example folder

Important Considerations #

  • Model Size: Larger models (such as 7b and 7b-it) might be too resource-intensive for on-device inference.
  • Function Calling Support: Gemma3n and DeepSeek models support function calling. Other models will ignore tools and show a warning.
  • Thinking Mode: Only DeepSeek models support thinking mode. Enable with isThinking: true and modelType: ModelType.deepSeek.
  • Multimodal Models: Gemma3n models with vision support require more memory and are recommended for devices with 8GB+ RAM.
  • iOS Memory Requirements: Large models require memory entitlements in Runner.entitlements and minimum iOS 16.0.
  • LoRA Weights: They provide efficient customization without the need for full model retraining.
  • Development vs. Production: For production apps, do not embed the model or LoRA weights within your assets. Instead, load them once and store them securely on the device or via a network drive.
  • Web Models: Currently, Web support is available only for GPU backend models. Multimodal support is fully implemented.
  • Image Formats: The plugin automatically handles common image formats (JPEG, PNG, etc.) when using Message.withImage().

๐Ÿ›Ÿ Troubleshooting #

Multimodal Issues:

  • Ensure you're using a multimodal model (Gemma3n E2B/E4B)
  • Set supportImage: true when creating model and chat
  • Check device memory - multimodal models require more RAM

Performance:

  • Use GPU backend for better performance with multimodal models
  • Consider using CPU backend for text-only models on lower-end devices

Memory Issues:

  • iOS: Ensure Runner.entitlements contains memory entitlements (see iOS setup)
  • iOS: Set minimum platform to iOS 16.0 in Podfile
  • Reduce maxTokens if experiencing memory issues
  • Use smaller models (1B-2B parameters) for devices with <6GB RAM
  • Close sessions and models when not needed
  • Monitor token usage with sizeInTokens()

iOS Build Issues:

  • Ensure minimum iOS version is set to 16.0 in Podfile
  • Use static linking: use_frameworks! :linkage => :static
  • Clean and reinstall pods: cd ios && pod install --repo-update
  • Check that all required entitlements are in Runner.entitlements

Advanced Usage #

ModelThinkingFilter (Advanced) #

For advanced users who need to manually process model responses, the ModelThinkingFilter class provides utilities for cleaning model outputs:

import 'package:flutter_gemma/core/extensions.dart';

// Clean response based on model type
String cleanedResponse = ModelThinkingFilter.cleanResponse(
  rawResponse,
  ModelType.deepSeek
);

// The filter automatically removes model-specific tokens like:
// - <end_of_turn> tags (Gemma models)
// - <think>...</think> blocks (DeepSeek)
// - <|channel>thought\n...<channel|> blocks (Gemma 4 E2B/E4B)
// - Extra whitespace and formatting

This is automatically handled by the chat API, but can be useful for custom inference implementations.

โ˜• Support the Project #

If you find Flutter Gemma useful and want to support its development, consider buying me a coffee! Your support helps me:

  • ๐Ÿ”ง Maintain and improve the plugin
  • ๐Ÿ“š Keep documentation up-to-date
  • ๐Ÿ› Fix bugs and resolve issues faster
  • โœจ Add new features and model support
  • ๐Ÿงช Test on more devices and platforms

ko-fi

Every contribution, no matter how small, makes a difference. Thank you for your support! ๐Ÿ’™

327
likes
130
points
11.1k
downloads

Documentation

API reference

Publisher

verified publishermobilepeople.dev

Weekly Downloads

A Flutter plugin for running Gemma and other LLMs locally on Android, iOS, Web, and Desktop. Supports multimodal vision, audio, function calling, thinking mode, GPU acceleration, text embeddings, and on-device RAG.

Repository (GitHub)
View/report issues
Contributing

License

MIT (license)

Dependencies

background_downloader, code_assets, crypto, dart_sentencepiece_tokenizer, ffi, flutter, flutter_web_plugins, hooks, large_file_handler, local_hnsw, path, path_provider, plugin_platform_interface, shared_preferences, sqlite3

More

Packages that depend on flutter_gemma

Packages that implement flutter_gemma