Skip to main content

Document Preprocessing Guide

Complete guide to document preprocessing in RAG-lite TS, including content type handling, configuration options, and optimization strategies.

Table of Contents

Overview

RAG-lite TS includes a powerful preprocessing system that handles various content types beyond plain Markdown, ensuring optimal embedding generation and search quality.

Why Preprocessing Matters

  • Clean Embeddings: Remove noise that could confuse semantic understanding
  • Content Extraction: Extract meaningful text from complex formats
  • Consistent Processing: Handle different file types uniformly
  • Optimization: Balance between information preservation and embedding quality

Supported File Types

Markdown (.md)

Standard Markdown files with full syntax support.

Processing:

  • Headers, lists, links, emphasis preserved
  • Code blocks handled based on configuration
  • Tables converted to structured text

Text (.txt)

Plain text files processed as-is.

Processing:

  • Minimal processing
  • Line breaks and paragraphs preserved
  • No special syntax handling

MDX (.mdx)

Markdown with JSX components, common in documentation sites.

Processing:

  • JSX components handled based on configuration
  • Markdown content processed normally
  • Import/export statements filtered

Example input:

# API Documentation

<ApiEndpoint method="POST" url="/auth/login">
Login endpoint for user authentication
</ApiEndpoint>

Regular markdown content here.

<CodeExample language="javascript">
const response = await fetch('/auth/login');
</CodeExample>

PDF (.pdf)

Portable Document Format files.

Processing:

  • Text extraction from PDF content
  • Metadata preservation (title, author)
  • Page breaks handled as paragraph boundaries
  • Images and complex layouts simplified

DOCX (.docx)

Microsoft Word documents.

Processing:

  • Text content extraction
  • Heading structure preserved
  • Tables converted to text
  • Comments and track changes ignored

Preprocessing Modes

Strict Mode

Maximum cleanup for the cleanest possible embeddings.

Characteristics:

  • Removes all JSX components
  • Strips Mermaid diagrams entirely
  • Removes code blocks
  • Minimal noise in embeddings

Best for:

  • Pure content search
  • Non-technical documentation
  • When embedding quality is critical

Configuration:

preprocessing: {
mode: 'strict',
overrides: {
mdx: 'strip', // Remove JSX components
mermaid: 'strip', // Remove diagrams
code: 'strip' // Remove code blocks
}
}

Balanced Mode (Default)

Practical compromise between information and cleanliness.

Characteristics:

  • Replaces JSX with [component removed] placeholders
  • Replaces diagrams with [diagram removed] placeholders
  • Preserves code blocks for technical documentation
  • Good balance for most use cases

Best for:

  • Technical documentation
  • Mixed content types
  • General-purpose search

Configuration:

preprocessing: {
mode: 'balanced',
overrides: {
mdx: 'placeholder', // Replace with placeholders
mermaid: 'placeholder', // Replace with placeholders
code: 'keep' // Preserve code blocks
}
}

Rich Mode

Preserves maximum information for comprehensive search.

Characteristics:

  • Keeps JSX components for potential semantic extraction
  • Extracts Mermaid diagram relationships as text
  • Preserves all code blocks
  • Maximum information retention

Best for:

  • Code documentation
  • Technical specifications
  • When completeness is more important than cleanliness

Configuration:

preprocessing: {
mode: 'rich',
overrides: {
mdx: 'keep', // Keep JSX components
mermaid: 'extract', // Extract diagram content
code: 'keep' // Keep all code blocks
}
}

Content Type Processing

JSX Component Handling

Strip Mode:

# Input
<Button onClick={handler}>Click me</Button>

# Output
(component completely removed)

Placeholder Mode:

# Input
<Button onClick={handler}>Click me</Button>

# Output
[component removed]

Keep Mode:

# Input
<Button onClick={handler}>Click me</Button>

# Output
<Button onClick={handler}>Click me</Button>

Mermaid Diagram Processing

Strip Mode:

# Input
```mermaid
graph TD
A[Start] --> B[Process]
B --> C[End]

Output

(diagram completely removed)


**Placeholder Mode:**
```markdown
# Input
```mermaid
graph TD
A[Start] --> B[Process]
B --> C[End]

Output

[diagram removed]


**Extract Mode:**
```markdown
# Input
```mermaid
graph TD
A[Start] --> B[Process]
B --> C[End]

Output

Diagram: Start connects to Process, Process connects to End


### Code Block Processing

**Strip Mode:**
```markdown
# Input
```javascript
function hello() {
console.log('Hello world');
}

Output

(code block completely removed)


**Keep Mode:**
```markdown
# Input
```javascript
function hello() {
console.log('Hello world');
}

Output

function hello() {
console.log('Hello world');
}

**Placeholder Mode:**
```markdown
# Input
```javascript
function hello() {
console.log('Hello world');
}

Output

[code block removed]


## Configuration

### Configuration File

```javascript
// raglite.config.js
export const config = {
preprocessing: {
mode: 'balanced', // 'strict', 'balanced', or 'rich'

// Override specific content types
overrides: {
mdx: 'placeholder', // 'strip', 'keep', or 'placeholder'
mermaid: 'extract', // 'strip', 'extract', or 'placeholder'
code: 'keep' // 'strip', 'keep', or 'placeholder'
}
}
};

Environment Variables

# Set preprocessing mode
export RAG_PREPROCESSING_MODE="balanced"

# Override specific content types
export RAG_PREPROCESSING_MDX="placeholder"
export RAG_PREPROCESSING_MERMAID="extract"
export RAG_PREPROCESSING_CODE="keep"

Programmatic Configuration

import { preprocessDocument } from 'rag-lite-ts';

const config = {
mode: 'balanced',
overrides: {
mdx: 'placeholder',
mermaid: 'extract',
code: 'keep'
}
};

const processed = preprocessDocument(content, filePath, config);

Use Cases

Technical Documentation

Scenario: API documentation with code examples and diagrams

Recommended configuration:

preprocessing: {
mode: 'rich',
overrides: {
mdx: 'keep', // Preserve API components
mermaid: 'extract', // Extract diagram relationships
code: 'keep' // Keep code examples
}
}

Benefits:

  • Code examples searchable
  • Diagram relationships indexed
  • API components preserved for context

User Guides

Scenario: End-user documentation with minimal technical content

Recommended configuration:

preprocessing: {
mode: 'strict',
overrides: {
mdx: 'strip', // Remove technical components
mermaid: 'strip', // Remove complex diagrams
code: 'strip' // Remove code blocks
}
}

Benefits:

  • Clean, readable content
  • Focus on user-facing information
  • Better semantic understanding

Mixed Content Collections

Scenario: Documentation with both technical and user content

Recommended configuration:

preprocessing: {
mode: 'balanced', // Default balanced approach
overrides: {
mdx: 'placeholder', // Indicate removed components
mermaid: 'placeholder', // Indicate removed diagrams
code: 'keep' // Preserve important code
}
}

Benefits:

  • Handles diverse content types
  • Maintains searchability
  • Reasonable embedding quality

Code Documentation

Scenario: Internal code documentation and specifications

Recommended configuration:

preprocessing: {
mode: 'rich',
overrides: {
mdx: 'keep', // Keep all components
mermaid: 'extract', // Extract architecture info
code: 'keep' // Essential for code docs
}
}

Benefits:

  • Complete information preservation
  • Code examples fully searchable
  • Architecture diagrams indexed

Advanced Configuration

File-Type Specific Processing

// Different processing for different file types
preprocessing: {
mode: 'balanced',
fileTypeOverrides: {
'.mdx': {
mdx: 'keep', // Keep JSX in MDX files
code: 'keep'
},
'.md': {
mdx: 'strip', // No JSX expected in MD files
code: 'placeholder'
}
}
}

Content-Based Processing

// Different processing based on content patterns
preprocessing: {
mode: 'balanced',
contentRules: [
{
pattern: /api\/.*\.md$/, // API documentation
overrides: { code: 'keep', mermaid: 'extract' }
},
{
pattern: /guides\/.*\.md$/, // User guides
overrides: { code: 'strip', mermaid: 'strip' }
}
]
}

Custom Processors

import { registerPreprocessor } from 'rag-lite-ts';

// Register custom processor for specific content
registerPreprocessor('custom-component', {
pattern: /<CustomComponent[^>]*>.*?<\/CustomComponent>/gs,
handler: (match) => '[custom component processed]'
});

Performance Impact

Processing Speed by Mode

ModeSpeedQualityUse Case
StrictFastestCleanestPure content search
BalancedMediumGoodGeneral purpose
RichSlowestCompleteTechnical documentation

Memory Usage

  • Strict: Lowest memory usage (less content to process)
  • Balanced: Moderate memory usage
  • Rich: Highest memory usage (preserves all content)

Embedding Quality

  • Strict: Highest semantic clarity, lowest noise
  • Balanced: Good balance of information and clarity
  • Rich: Maximum information, potential noise from code/markup

Troubleshooting

Common Issues

Problem: Search results missing expected content

// Solution: Use less aggressive preprocessing
preprocessing: {
mode: 'rich', // or 'balanced' instead of 'strict'
overrides: {
code: 'keep' // Ensure code blocks are preserved
}
}

Problem: Poor search quality with technical content

// Solution: Use stricter preprocessing
preprocessing: {
mode: 'strict',
overrides: {
mdx: 'strip', // Remove noisy components
code: 'strip' // Remove code noise
}
}

Problem: JSX components causing parsing errors

// Solution: Strip or replace problematic JSX
preprocessing: {
mode: 'balanced',
overrides: {
mdx: 'placeholder' // Replace with safe placeholders
}
}

Problem: Mermaid diagrams not processed correctly

# Ensure diagrams are in proper code blocks
```mermaid
graph TD
A --> B

Not just:

graph TD A --> B


### Debug Preprocessing

```typescript
import { preprocessDocument, getPreprocessingStats } from 'rag-lite-ts';

const original = '# Title\n<Component>content</Component>';
const processed = preprocessDocument(original, 'test.mdx');
const stats = getPreprocessingStats(original, processed);

console.log('Original length:', stats.originalLength);
console.log('Processed length:', stats.processedLength);
console.log('Reduction:', stats.reductionPercentage + '%');

Validation

# Test preprocessing with debug mode
DEBUG=1 raglite ingest ./docs/

# Check preprocessing output in logs
# Look for "Preprocessed: [filename]" entries

This guide covers all aspects of document preprocessing. Choose the mode and overrides that best match your content type and search requirements.