Unified Content System Guide

The unified content system enables RAG-lite to handle content from multiple sources - both filesystem and memory - while providing format-adaptive retrieval for different client types. This system enables MCP (Model Context Protocol) integration while maintaining RAG-lite's simple, local-first philosophy.

Overview
Memory Ingestion
Content Retrieval
Configuration
Storage Management
Error Handling
Performance Optimization
Troubleshooting
API Reference

Overview

What is the Unified Content System?

The unified content system extends RAG-lite's capabilities to support:

Memory-based ingestion: Process content directly from buffers without temporary files
Format-adaptive retrieval: Serve content as file paths (CLI) or base64 data (MCP)
Dual storage strategy: Keep filesystem content in place, store memory content efficiently
Universal content IDs: Stable identifiers for all content regardless of source

Key Benefits

MCP Integration: Enables real-time content processing for AI agents
Backward Compatibility: Existing code continues to work unchanged
Simple API: No ceremony - same easy patterns as existing RAG-lite
Local-First: All content stored locally with no external dependencies

Memory Ingestion

Basic Memory Ingestion

import { IngestionPipeline } from 'rag-lite-ts';

const ingestion = new IngestionPipeline('./db.sqlite', './vector-index.bin');

// Ingest content from memory buffer
const content = Buffer.from('# Machine Learning Guide\n\nThis is a comprehensive guide...');
const contentId = await ingestion.ingestFromMemory(content, {
  displayName: 'ML Guide.md',
  contentType: 'text/markdown'
});

console.log(`Content ingested with ID: ${contentId}`);

Advanced Memory Ingestion

// Ingest with full metadata
const contentId = await ingestion.ingestFromMemory(imageBuffer, {
  displayName: 'architecture-diagram.png',
  contentType: 'image/png',
  originalPath: '/uploads/diagrams/architecture.png' // Optional reference
});

// Ingest JSON configuration
const configBuffer = Buffer.from(JSON.stringify({
  model: 'sentence-transformers/all-MiniLM-L6-v2',
  chunkSize: 400
}));

const configId = await ingestion.ingestFromMemory(configBuffer, {
  displayName: 'config.json',
  contentType: 'application/json'
});

Content Type Detection

The system automatically detects content types using:

Magic number detection (most reliable)
Extension-based detection (from display name)
Content analysis (for text content)

// Automatic detection from buffer content
const contentId = await ingestion.ingestFromMemory(pdfBuffer, {
  displayName: 'document.pdf'
  // contentType automatically detected as 'application/pdf'
});

// Explicit content type (recommended for accuracy)
const contentId = await ingestion.ingestFromMemory(htmlBuffer, {
  displayName: 'page.html',
  contentType: 'text/html'
});

Supported Content Types

Text Formats:

text/plain, text/markdown, text/html
application/json, application/xml
text/csv, application/javascript

Document Formats:

application/pdf
application/vnd.openxmlformats-officedocument.wordprocessingml.document (DOCX)
application/msword (DOC)

Image Formats:

image/jpeg, image/png, image/gif
image/webp, image/bmp, image/tiff
image/svg+xml

Content Retrieval

Basic Content Retrieval

import { SearchEngine } from 'rag-lite-ts';

const search = new SearchEngine('./vector-index.bin', './db.sqlite');

// Search returns content IDs
const results = await search.search('machine learning concepts');

// Retrieve content in different formats
for (const result of results) {
  const contentId = result.document.contentId;
  
  // For CLI clients - get file path
  const filePath = await search.getContent(contentId, 'file');
  console.log(`File available at: ${filePath}`);
  
  // For MCP clients - get base64 data
  const base64Data = await search.getContent(contentId, 'base64');
  console.log(`Base64 length: ${base64Data.length}`);
}

Batch Content Retrieval

// Efficient batch retrieval for multiple items
const contentIds = results.map(r => r.document.contentId);

const batchResults = await search.getContentBatch(
  contentIds.map(id => ({ contentId: id, format: 'base64' }))
);

batchResults.forEach(result => {
  if (result.success) {
    console.log(`Content ${result.contentId}: ${result.content?.length} bytes`);
  } else {
    console.error(`Failed to retrieve ${result.contentId}: ${result.error}`);
  }
});

Content Metadata

// Get content information without loading the full content
const metadata = await search.getContentMetadata(contentId);

console.log({
  displayName: metadata.displayName,
  contentType: metadata.contentType,
  fileSize: metadata.fileSize,
  storageType: metadata.storageType,
  createdAt: metadata.createdAt
});

Configuration

Basic Configuration

import { IngestionPipeline, SearchEngine } from 'rag-lite-ts';

// Configure content system during ingestion
const ingestion = new IngestionPipeline('./db.sqlite', './vector-index.bin', {
  contentSystem: {
    maxFileSize: '100MB',           // Individual file limit
    maxContentDirSize: '5GB',       // Total content directory limit
    contentDir: '.raglite/content', // Storage location
    enableDeduplication: true       // Hash-based deduplication
  }
});

Configuration Options

interface ContentSystemConfig {
  // Size limits (accepts strings like "50MB", "2GB" or numbers in bytes)
  maxFileSize: number | string;        // Default: "50MB"
  maxContentDirSize: number | string;  // Default: "2GB"
  
  // Storage settings
  contentDir: string;                  // Default: ".raglite/content"
  enableDeduplication: boolean;        // Default: true
  enableStorageTracking: boolean;      // Default: true
  
  // Storage thresholds (percentages)
  storageWarningThreshold: number;     // Default: 75 (warn at 75% full)
  storageErrorThreshold: number;       // Default: 95 (reject at 95% full)
}

Environment-Specific Configuration

// Development configuration
const devConfig = {
  contentSystem: {
    maxFileSize: '10MB',
    maxContentDirSize: '500MB',
    storageWarningThreshold: 80
  }
};

// Production configuration
const prodConfig = {
  contentSystem: {
    maxFileSize: '100MB',
    maxContentDirSize: '10GB',
    storageWarningThreshold: 70,
    storageErrorThreshold: 90
  }
};

Storage Management

Storage Statistics

import { ContentManager } from 'rag-lite-ts/core';

const contentManager = new ContentManager(db, config);

// Get comprehensive storage statistics
const stats = await contentManager.getStorageStats();

console.log({
  contentDirectory: {
    files: stats.contentDirectory.totalFiles,
    sizeMB: stats.contentDirectory.totalSizeMB,
    averageFileSize: stats.contentDirectory.averageFileSize
  },
  filesystemReferences: {
    count: stats.filesystemReferences.totalRefs,
    sizeMB: stats.filesystemReferences.totalSizeMB
  },
  limits: {
    usagePercent: stats.limits.currentUsagePercent,
    remainingMB: stats.limits.remainingSpaceMB
  }
});

Storage Cleanup

// Remove orphaned files (files without metadata references)
const orphanedResult = await contentManager.removeOrphanedFiles();
console.log(`Removed ${orphanedResult.removedFiles.length} orphaned files`);
console.log(`Freed ${orphanedResult.freedSpace} bytes`);

// Remove duplicate content (same hash, different files)
const duplicateResult = await contentManager.removeDuplicateContent();
console.log(`Removed ${duplicateResult.removedFiles.length} duplicate files`);
console.log(`Freed ${duplicateResult.freedSpace} bytes`);

Storage Monitoring

// Get storage limit status and recommendations
const status = await contentManager.getStorageLimitStatus();

console.log(`Storage usage: ${status.currentUsagePercent}%`);
console.log(`Can accept new content: ${status.canAcceptContent}`);

if (status.isNearWarningThreshold) {
  console.log('⚠️ Storage recommendations:');
  status.recommendations.forEach(rec => console.log(`  ${rec}`));
}

Generate Storage Report

// Human-readable storage report
const report = await contentManager.generateStorageReport();
console.log(report);

// Output:
// === RAG-lite Content Storage Report ===
// 
// Content Directory:
//   Files: 42
//   Size: 156.7 MB
//   Average file size: 3.7 MB
// 
// Filesystem References:
//   References: 128
//   Total size: 2.1 GB
// 
// Overall Usage:
//   Total content items: 170
//   Total storage used: 2.3 GB
//   Storage efficiency: 100%
// 
// Storage Limits:
//   Content directory limit: 2048.0 MB
//   Current usage: 7.7%
//   Remaining space: 1891.3 MB

Error Handling

Common Error Scenarios

Content Not Found

try {
  const content = await search.getContent(contentId, 'file');
} catch (error) {
  if (error.message.includes('Content not found')) {
    console.log('Content may have been moved or deleted. Please re-ingest.');
    // Handle re-ingestion workflow
  }
}

Storage Limit Exceeded

try {
  const contentId = await ingestion.ingestFromMemory(largeBuffer, metadata);
} catch (error) {
  if (error.message.includes('Storage limit exceeded')) {
    console.log('Storage full. Running cleanup...');
    await contentManager.removeOrphanedFiles();
    await contentManager.removeDuplicateContent();
    
    // Retry ingestion
    const contentId = await ingestion.ingestFromMemory(largeBuffer, metadata);
  }
}

Invalid Content Format

try {
  const content = await search.getContent(contentId, 'xml'); // Invalid format
} catch (error) {
  if (error.message.includes('Unsupported format')) {
    console.log('Use "file" or "base64" format');
    const content = await search.getContent(contentId, 'base64');
  }
}

Error Recovery Patterns

// Robust content retrieval with fallback
async function getContentSafely(contentId: string, preferredFormat: 'file' | 'base64') {
  try {
    return await search.getContent(contentId, preferredFormat);
  } catch (error) {
    if (error.message.includes('Content not found')) {
      // Try to get metadata for better error reporting
      try {
        const metadata = await search.getContentMetadata(contentId);
        throw new Error(`Content "${metadata.displayName}" is no longer available. Please re-ingest.`);
      } catch {
        throw new Error(`Content ${contentId} not found. Please re-ingest.`);
      }
    }
    throw error; // Re-throw other errors
  }
}

Performance Optimization

Memory Management

// The system automatically manages memory for large content
// No special handling needed, but be aware of these optimizations:

// Large files (>10MB) use streaming operations automatically
const contentId = await ingestion.ingestFromMemory(largeBuffer, {
  displayName: 'large-document.pdf'
  // Streaming operations used automatically
});

// Batch operations are optimized for performance
const results = await search.getContentBatch(requests);
// Uses concurrent processing with resource limits

Performance Statistics

// Monitor performance for optimization
const perfStats = contentManager.getPerformanceStats();

console.log({
  hashCache: {
    hitRate: perfStats.hashCache.hitRate,
    size: perfStats.hashCache.size
  },
  operations: {
    averageDuration: perfStats.operations.averageDuration,
    averageSpeed: perfStats.operations.averageSpeed,
    errorRate: perfStats.operations.errorRate
  }
});

// Clear caches if needed (e.g., after bulk operations)
contentManager.clearPerformanceCaches();

Optimization Guidelines

Use batch operations for multiple content retrievals
Enable deduplication to save storage space
Monitor storage usage and run cleanup regularly
Use appropriate size limits for your use case
Prefer streaming for large content (handled automatically)

Troubleshooting

Common Issues

High Memory Usage

Symptoms: System becomes slow during content operations Causes: Large files, many concurrent operations Solutions:

// Reduce batch size for large operations
const smallBatches = chunkArray(contentIds, 10); // Process 10 at a time
for (const batch of smallBatches) {
  await search.getContentBatch(batch.map(id => ({ contentId: id, format: 'base64' })));
}

// Clear performance caches periodically
contentManager.clearPerformanceCaches();

Storage Directory Issues

Symptoms: "Failed to create content directory" errors Causes: Permission issues, disk space, invalid paths Solutions:

// Check and fix permissions
import { access, mkdir } from 'fs/promises';
import { constants } from 'fs';

try {
  await access('.raglite/content', constants.W_OK);
} catch {
  await mkdir('.raglite/content', { recursive: true, mode: 0o755 });
}

Content Retrieval Failures

Symptoms: "Content not found" errors for existing content Causes: Moved files, corrupted metadata, storage issues Solutions:

// Verify content exists before retrieval
const exists = await search.verifyContentExists(contentId);
if (!exists) {
  console.log('Content needs to be re-ingested');
  // Implement re-ingestion logic
}

// Check storage statistics for issues
const stats = await contentManager.getStorageStats();
if (stats.contentDirectory.totalFiles === 0) {
  console.log('Content directory may be corrupted or moved');
}

Debug Mode

// Enable detailed logging for troubleshooting
process.env.DEBUG = 'rag-lite:content';

// This will log:
// - Content ingestion operations
// - Storage operations
// - Performance metrics
// - Error details

Performance Troubleshooting

// Monitor operation timing
const startTime = Date.now();
const contentId = await ingestion.ingestFromMemory(buffer, metadata);
const duration = Date.now() - startTime;

if (duration > 5000) { // Slow operation (>5s)
  console.log(`Slow ingestion detected: ${duration}ms`);
  console.log(`Buffer size: ${buffer.length} bytes`);
  
  // Check performance stats
  const stats = contentManager.getPerformanceStats();
  console.log(`Cache hit rate: ${stats.hashCache.hitRate}%`);
}

API Reference

IngestionPipeline Methods

class IngestionPipeline {
  // Memory ingestion
  async ingestFromMemory(
    content: Buffer, 
    metadata: MemoryContentMetadata
  ): Promise<string>;
  
  // Existing methods (unchanged)
  async ingestDirectory(path: string, options?: IngestionOptions): Promise<IngestionResult>;
  async ingestFile(path: string, options?: IngestionOptions): Promise<IngestionResult>;
}

interface MemoryContentMetadata {
  displayName: string;        // User-friendly name
  contentType?: string;       // MIME type (auto-detected if not provided)
  originalPath?: string;      // Optional reference to original location
}

SearchEngine Methods

class SearchEngine {
  // Content retrieval
  async getContent(contentId: string, format?: 'file' | 'base64'): Promise<string>;
  async getContentBatch(requests: ContentRequest[]): Promise<ContentResult[]>;
  async getContentMetadata(contentId: string): Promise<ContentMetadata>;
  async verifyContentExists(contentId: string): Promise<boolean>;
  
  // Existing methods (unchanged)
  async search(query: string, options?: SearchOptions): Promise<SearchResult[]>;
}

interface ContentRequest {
  contentId: string;
  format: 'file' | 'base64';
}

interface ContentResult {
  contentId: string;
  success: boolean;
  content?: string;
  error?: string;
}

ContentManager Methods (Internal)

class ContentManager {
  // Storage management
  async getStorageStats(): Promise<StorageStats>;
  async getStorageLimitStatus(): Promise<StorageLimitStatus>;
  async generateStorageReport(): Promise<string>;
  
  // Cleanup operations
  async removeOrphanedFiles(): Promise<CleanupResult>;
  async removeDuplicateContent(): Promise<CleanupResult>;
  
  // Performance monitoring
  getPerformanceStats(): PerformanceStats;
  clearPerformanceCaches(): void;
}

Enhanced SearchResult

interface SearchResult {
  text: string;
  score: number;
  document: {
    id: number;
    source: string;        // Display name (backward compatibility)
    title: string;
    contentId: string;     // Universal content identifier (NEW)
  };
}

The unified content system provides a powerful foundation for modern AI workflows while maintaining RAG-lite's core philosophy of simplicity and local-first operation. It enables seamless integration with MCP clients while preserving full backward compatibility with existing code.

Table of Contents​

Overview​

What is the Unified Content System?​

Key Benefits​

Memory Ingestion​

Basic Memory Ingestion​

Advanced Memory Ingestion​

Content Type Detection​

Supported Content Types​

Content Retrieval​

Basic Content Retrieval​

Batch Content Retrieval​

Content Metadata​

Configuration​

Basic Configuration​

Configuration Options​

Environment-Specific Configuration​

Storage Management​

Storage Statistics​

Storage Cleanup​

Storage Monitoring​

Generate Storage Report​

Error Handling​

Common Error Scenarios​

Content Not Found​

Storage Limit Exceeded​

Invalid Content Format​

Error Recovery Patterns​

Performance Optimization​

Memory Management​

Performance Statistics​

Optimization Guidelines​

Troubleshooting​

Common Issues​

High Memory Usage​

Storage Directory Issues​

Content Retrieval Failures​

Debug Mode​

Performance Troubleshooting​

API Reference​

IngestionPipeline Methods​

SearchEngine Methods​

ContentManager Methods (Internal)​

Enhanced SearchResult​

Table of Contents