Multimodal Tutorial

Step-by-step guide to using RAG-lite TS with mixed text and image content

This tutorial walks you through setting up and using the Chameleon Multimodal Architecture to search across both text documents and images using natural language queries.

Overview
Quick Start
Setting Up Multimodal Mode
Processing Mixed Content
Advanced Reranking Strategies
Performance Optimization
Troubleshooting

Overview

The simplified multimodal architecture enables RAG-lite TS to:

Process mixed content: Text documents (Markdown, PDF, DOCX) and images (JPG, PNG, GIF, WebP)
Cross-modal search: Find images using text queries and text using image descriptions
Automatic mode detection: Set mode once during ingestion, automatically detected during search
Intelligent reranking: Multiple strategies for optimal search results across content types
Seamless experience: Same CLI commands work for both text-only and multimodal content
Reliable operation: No fallback mechanisms - each mode works predictably or fails clearly

Supported Content Types

Text Content:

Markdown files (.md, .mdx)
Text files (.txt)
PDF documents (.pdf)
Word documents (.docx)

Image Content (Multimodal Mode):

JPEG images (.jpg, .jpeg)
PNG images (.png)
GIF images (.gif)
WebP images (.webp)

Quick Start

1. Basic Multimodal Setup

# Ingest mixed content in multimodal mode
raglite ingest ./mixed-content/ --mode multimodal

# Search across both text and images (mode auto-detected)
raglite search "architecture diagram"
raglite search "user interface mockup"
raglite search "API documentation"

2. Programmatic Usage

import { IngestionPipeline, SearchEngine } from 'rag-lite-ts';

// 1. Ingest mixed content
const ingestion = new IngestionPipeline('./db.sqlite', './index.bin', {
  mode: 'multimodal',
  embeddingModel: 'Xenova/clip-vit-base-patch32',
  rerankingStrategy: 'text-derived'
});

await ingestion.ingestDirectory('./mixed-content/');
await ingestion.cleanup();

// 2. Search (mode auto-detected from database)
const search = new SearchEngine('./index.bin', './db.sqlite');
const results = await search.search('system architecture');

// Results include both text and image matches
for (const result of results) {
  console.log(`${result.contentType}: ${result.document.source}`);
  console.log(`Score: ${result.score.toFixed(2)}`);
  if (result.metadata?.dimensions) {
    console.log(`Image: ${result.metadata.dimensions.width}x${result.metadata.dimensions.height}`);
  }
}

await search.cleanup();

Setting Up Multimodal Mode

CLI Configuration

# Basic multimodal ingestion
raglite ingest ./docs/ --mode multimodal

# With specific model and reranking strategy
raglite ingest ./docs/ --mode multimodal \
  --model Xenova/clip-vit-base-patch32 \
  # Reranking automatically uses text-derived for multimodal mode

# Search automatically detects multimodal mode
raglite search "flowchart showing process"

Environment Variables

# Set multimodal defaults
export RAG_EMBEDDING_MODEL="Xenova/clip-vit-base-patch32"
export RAG_BATCH_SIZE="8"
export RAG_CHUNK_SIZE="300"

# Ingest and search with CLI flags
raglite ingest ./content/ --mode multimodal --model Xenova/clip-vit-base-patch32
raglite search "diagram"

Processing Mixed Content

Directory Structure Example

content/
├── docs/
│   ├── api-guide.md
│   ├── user-manual.pdf
│   └── setup-instructions.docx
├── images/
│   ├── architecture-diagram.png
│   ├── ui-mockup.jpg
│   └── flowchart.gif
└── mixed/
    ├── tutorial.md
    ├── screenshot1.png
    └── screenshot2.png

Processing Workflow

# Ingest entire directory structure
raglite ingest ./content/ --mode multimodal

# The system will:
# 1. Process text files normally (chunking, embedding)
# 2. Generate descriptions for images using image-to-text models
# 3. Extract image metadata (dimensions, file size, format)
# 4. Store everything with appropriate content types

What Happens During Processing

Text Files:

Chunked into optimal sizes
Embedded using CLIP text encoder
Stored with content_type='text'

Image Files:

Described using Xenova/vit-gpt2-image-captioning
Metadata extracted using Sharp
Embedded as single chunks with content_type='image'
Original image path preserved in metadata

Example Processing Output

raglite ingest ./content/ --mode multimodal

# Output:
# Processing text files...
# - docs/api-guide.md: 15 chunks created
# - docs/user-manual.pdf: 23 chunks created
# 
# Processing images...
# - images/architecture-diagram.png: "A diagram showing system architecture with connected components"
# - images/ui-mockup.jpg: "A user interface mockup showing login form and navigation"
# 
# Summary:
# - 45 text chunks processed
# - 8 images processed with descriptions
# - Mode 'multimodal' stored in database

Advanced Reranking Strategies

Multimodal mode supports several reranking strategies to optimize search results:

Text-Derived Reranking (Default)

Converts images to text descriptions, then applies cross-encoder reranking:

raglite ingest ./content/ --mode multimodal

How it works:

Images converted to text using Xenova/vit-gpt2-image-captioning
Cross-encoder reranking applied to all text (including image descriptions)
Original image content restored in results

Best for: High-quality semantic matching across text and images

Disabled Reranking

Uses only vector similarity scores:

raglite ingest ./content/ --mode multimodal --no-rerank

Best for: Maximum speed when reranking quality isn't critical

Comparing Strategies

import { IngestionPipeline, SearchEngine } from 'rag-lite-ts';

// Test different strategies
const strategies = ['text-derived', 'disabled'];

for (const strategy of strategies) {
  const ingestion = new IngestionPipeline(
    `./db-${strategy}.sqlite`, 
    `./index-${strategy}.bin`, 
    {
      mode: 'multimodal',
      rerankingStrategy: strategy
    }
  );
  
  await ingestion.ingestDirectory('./content/');
  await ingestion.cleanup();
  
  const search = new SearchEngine(`./index-${strategy}.bin`, `./db-${strategy}.sqlite`);
  const results = await search.search('system architecture diagram');
  
  console.log(`\n${strategy} strategy results:`);
  results.slice(0, 3).forEach((r, i) => {
    console.log(`${i+1}. ${r.document.source} (${r.score.toFixed(2)})`);
  });
  
  await search.cleanup();
}

Performance Optimization

Model Selection

For Speed:

# Use CLIP patch32 (faster, good quality)
raglite ingest ./content/ --mode multimodal --model Xenova/clip-vit-base-patch32

For Quality:

# Use CLIP patch16 (slower, better quality)
raglite ingest ./content/ --mode multimodal --model Xenova/clip-vit-base-patch16

Batch Processing

# Optimize batch sizes for your system
export RAG_BATCH_SIZE="8"  # Conservative for multimodal
export RAG_BATCH_SIZE="16" # Aggressive if you have enough RAM

Memory Management

import { IngestionPipeline } from 'rag-lite-ts';

// Process large collections in batches
const directories = ['./batch1/', './batch2/', './batch3/'];

for (const dir of directories) {
  const ingestion = new IngestionPipeline('./db.sqlite', './index.bin', {
    mode: 'multimodal',
    batchSize: 8  // Conservative batch size
  });
  
  await ingestion.ingestDirectory(dir);
  await ingestion.cleanup(); // Important: cleanup between batches
}

Content Organization Tips

For optimal performance, organize your content thoughtfully:

# Process text-heavy directories first
raglite ingest ./docs/ --mode multimodal

# Then add image collections separately
raglite ingest ./images/ --mode multimodal

# Use smaller batch sizes for memory-intensive processing
export RAG_BATCH_SIZE="4"
raglite ingest ./large-images/ --mode multimodal

Troubleshooting

Common Issues

"Image processing failed"

Cause: Image file corrupted or unsupported format

Solution:

# Check image files
file ./images/*.png

# Try with known good images
raglite ingest ./test-data/images/ --mode multimodal

# Check supported formats: JPG, PNG, GIF, WebP

"Out of memory during image processing"

Cause: Large images or too many processed simultaneously

Solution:

# Reduce batch size
export RAG_BATCH_SIZE="4"

# Process smaller directories
raglite ingest ./images/batch1/ --mode multimodal
raglite ingest ./images/batch2/ --mode multimodal

"Mode mismatch detected"

Cause: Trying to search multimodal content with text-only index

Solution:

# Rebuild with multimodal mode
raglite ingest ./content/ --mode multimodal --rebuild-if-needed

"No image descriptions generated"

Cause: Image-to-text model failed to load

Solution:

# Clear model cache and retry
rm -rf ~/.raglite/models/
raglite ingest ./content/ --mode multimodal

# Check internet connection for model download
ping huggingface.co

Debug Mode

# Enable debug logging for multimodal processing
DEBUG=1 raglite ingest ./content/ --mode multimodal

# Look for these patterns:
# "Processing image: ./image.png"
# "Generated description: A diagram showing..."
# "Extracted metadata: {width: 1920, height: 1080}"

Performance Monitoring

# Monitor memory usage during processing
top -p $(pgrep -f raglite)

# Time multimodal operations
time raglite ingest ./content/ --mode multimodal
time raglite search "architecture diagram"

Validation

# Verify multimodal content was processed
raglite search "test" --top-k 20

# Check for both text and image results
# Look for content_type in results

Finding Images with Text Queries

The power of multimodal mode is the ability to find images using natural language descriptions:

# Ingest mixed content in multimodal mode
raglite ingest ./content/ --mode multimodal

# Find images using text descriptions
raglite search "red sports car" --content-type image
raglite search "mountain sunset landscape" --content-type image
raglite search "architecture diagram" --content-type image

Image-to-Image Search

You can also search using image files directly to find semantically similar images:

# Find images similar to a reference image
raglite search ./reference-photo.jpg

# Find similar images with custom result count
raglite search ./diagram.png --top-k 10

# Image search with content type filtering
raglite search ./photo.jpg --content-type image

# Image search works with standard options (reranking disabled to preserve visual similarity)
raglite search ./reference.png --content-type image --top-k 10

Key Features:

Direct image queries: Use image files as search input instead of text
Find similar images: System finds semantically similar images in your collection
Cross-modal results: Can return both images and related text content
All options supported: Works with --top-k, --content-type, --rerank, etc.

Programmatic usage:

import { SearchEngine } from 'rag-lite-ts';

const search = new SearchEngine('./multimodal.bin', './multimodal.sqlite');

// Find images using text query
const results = await search.search('red sports car', { top_k: 5 });

// Filter to only image results
const imageResults = results.filter(r => r.contentType === 'image');

console.log('Images matching "red sports car":');
imageResults.forEach((result, i) => {
  console.log(`${i + 1}. ${result.document.source} (${result.score.toFixed(2)})`);
  console.log(`   Description: ${result.content}`);
  if (result.metadata?.dimensions) {
    console.log(`   Size: ${result.metadata.dimensions.width}x${result.metadata.dimensions.height}`);
  }
});

await search.cleanup();

Searching Across Both Content Types

Search for content regardless of type, then separate results:

import { SearchEngine } from 'rag-lite-ts';

const search = new SearchEngine('./multimodal.bin', './multimodal.sqlite');

// Search for content about vehicles
const results = await search.search('vehicles and transportation', {
  top_k: 10,
  rerank: true
});

// Separate by content type
const textResults = results.filter(r => r.contentType === 'text');
const imageResults = results.filter(r => r.contentType === 'image');

console.log(`Found ${textResults.length} text documents and ${imageResults.length} images`);

console.log('\nText Documents:');
textResults.forEach((result, i) => {
  console.log(`${i + 1}. ${result.document.source} (${result.score.toFixed(2)})`);
  console.log(`   ${result.content.substring(0, 100)}...`);
});

console.log('\nImages:');
imageResults.forEach((result, i) => {
  console.log(`${i + 1}. ${result.document.source} (${result.score.toFixed(2)})`);
  console.log(`   ${result.content}`);
});

await search.cleanup();

Understanding Semantic Similarity

CLIP models understand semantic concepts across modalities:

# These queries work because CLIP understands visual and semantic concepts

# Color-based search
raglite search "bright red color" --content-type image

# Abstract concepts
raglite search "adventure and exploration"

# Visual attributes
raglite search "modern minimalist design" --content-type image

# Compositional concepts
raglite search "person standing on mountain peak" --content-type image

Advanced Examples

Documentation Site with Screenshots

# Perfect for documentation with UI screenshots
raglite ingest ./docs/ --mode multimodal

# Search for UI elements
raglite search "login button screenshot"
raglite search "navigation menu example"

Technical Diagrams Collection

# Optimize for technical diagrams and flowcharts
raglite ingest ./diagrams/ --mode multimodal

# Search works with semantic understanding
raglite search "database schema diagram"
raglite search "network topology chart"

Mixed Content Knowledge Base

import { IngestionPipeline, SearchEngine } from 'rag-lite-ts';

// Complete knowledge base with text and visual content
const ingestion = new IngestionPipeline('./kb.sqlite', './kb-index.bin', {
  mode: 'multimodal',
  embeddingModel: 'Xenova/clip-vit-base-patch32',
  rerankingStrategy: 'text-derived',
  chunkSize: 300,
  chunkOverlap: 60
});

// Ingest everything
await ingestion.ingestDirectory('./knowledge-base/');
await ingestion.cleanup();

// Intelligent search across all content
const search = new SearchEngine('./kb-index.bin', './kb.sqlite');

// These queries work across both text and images
const queries = [
  'user authentication flow',
  'system architecture overview', 
  'API endpoint documentation',
  'error handling examples'
];

for (const query of queries) {
  console.log(`\nSearching: "${query}"`);
  const results = await search.search(query, { top_k: 5 });
  
  results.forEach((result, i) => {
    console.log(`${i+1}. [${result.contentType}] ${result.document.source}`);
    console.log(`   Score: ${result.score.toFixed(2)}`);
    console.log(`   Preview: ${result.content.substring(0, 100)}...`);
  });
}

await search.cleanup();

This tutorial covers the complete multimodal workflow. Start with the Quick Start section, then explore the advanced features as needed for your specific use case.

Table of Contents​

Overview​

Supported Content Types​

Quick Start​

1. Basic Multimodal Setup​

2. Programmatic Usage​

Setting Up Multimodal Mode​

CLI Configuration​

Environment Variables​

Processing Mixed Content​

Directory Structure Example​

Processing Workflow​

What Happens During Processing​

Example Processing Output​

Advanced Reranking Strategies​

Text-Derived Reranking (Default)​

Disabled Reranking​

Comparing Strategies​

Performance Optimization​

Model Selection​

Batch Processing​

Memory Management​

Content Organization Tips​

Troubleshooting​

Common Issues​

"Image processing failed"​

"Out of memory during image processing"​

"Mode mismatch detected"​

"No image descriptions generated"​

Debug Mode​

Performance Monitoring​

Validation​

Cross-Modal Search Examples​

Finding Images with Text Queries​

Image-to-Image Search​

Searching Across Both Content Types​

Understanding Semantic Similarity​

Advanced Examples​

Documentation Site with Screenshots​

Technical Diagrams Collection​

Mixed Content Knowledge Base​

Table of Contents

Overview

Supported Content Types

Quick Start

1. Basic Multimodal Setup

2. Programmatic Usage

Setting Up Multimodal Mode

CLI Configuration

Environment Variables

Processing Mixed Content

Directory Structure Example

Processing Workflow

What Happens During Processing

Example Processing Output

Advanced Reranking Strategies

Text-Derived Reranking (Default)

Disabled Reranking

Comparing Strategies

Performance Optimization

Model Selection

Batch Processing

Memory Management

Content Organization Tips

Troubleshooting

Common Issues

"Image processing failed"

"Out of memory during image processing"

"Mode mismatch detected"

"No image descriptions generated"

Debug Mode

Performance Monitoring

Validation

Cross-Modal Search Examples

Finding Images with Text Queries

Image-to-Image Search

Searching Across Both Content Types

Understanding Semantic Similarity

Advanced Examples

Documentation Site with Screenshots

Technical Diagrams Collection

Mixed Content Knowledge Base