Multimodal Analyzer

AI-powered media analysis tool (CLI interface) using multiple LLM providers through LiteLLM. Analyze images, audio, and video files with customizab...

Tool Description

# Multimodal Analyzer tool

AI-powered media analysis tool (CLI interface) using multiple LLM providers through LiteLLM. Analyze images, audio, and video files with customizable prompts and output formats.

## Features

- **Image, Audio & Video Analysis**: Single files or batch process entire directories
- **Hybrid File Input**: Specify files by directory path OR explicit file lists from multiple locations
- **Automatic Image Preprocessing**: Images > 500KB are automatically converted to JPEG for optimal processing
- **Concurrent Processing**: Configurable concurrency with progress tracking
- **Multiple Output Formats**: JSON, Markdown, and Text export
- **Custom Prompts**: Flexible analysis with custom or predefined prompts

## Usage notes

- You MUST ALWAYS use Multimodal Analyzer to analyze media files, NEVER read images directly
- ALWAYS Use with the `bash` tool.
- ALWAYS use batch processing for analyzing multiple files
- ALWAYS use absolute paths with respect to when specifying file and directory paths.

## When to Use This Tool

Use this tool for ANALYZING/UNDERSTANDING media content:

1. Content Analysis Requests:
   - "What's in this image?", "Describe this media", "Analyze this file"
   - "Explain what you see", "What does this show?"

2. Information Extraction:
   - Transcribing audio, describing video content
   - Understanding visual elements, text recognition
   - Any AI-powered content interpretation

## Trigger Phrases for Analysis

- "What's in...", "Describe...", "Analyze...", "Explain..."
- "What does this show?", "What can you see?"
- "Extract from...", "Understand...", "Interpret..."

## When NOT to Use This Tool

- When users just want to VIEW the media, not analyze its content
- Simple Display Requests: "Show me the file", "Display this image", "Let me see .."

## Hybrid File Input Support

The Multimodal Analyzer CLI supports two flexible input modes:

### Directory Path Mode (`--path`)

Use `--path` to analyze files from directories or single files:

```bash
# Single file
multimodal-analyzer --type image  --path photo.jpg

# Directory (all supported files)
multimodal-analyzer --type image  --path /Users/.../mix/photos/

# Recursive directory scan
multimodal-analyzer --type image  --path /Users/.../mix/dataset/ --recursive

Explicit File List Mode (`--files`)

Use --files to specify exact files from multiple locations:

# Multiple files from different directories
multimodal-analyzer --type image  \
  --files /Users/.../mix/documents/photo1.jpg \
  --files /Users/.../mix/projects/chart.png \
  --files /Users/.../mix/local/screenshot.jpg

# Audio files from various locations
multimodal-analyzer --type audio  \
  --files /Users/.../mix/recording1.mp3 \
  --files /Users/.../mix/meetings/call.wav \
  --audio-mode transcript

When to Use Each Mode

Use --path for processing all files in a directory or subdirectories
Use --files for selective processing of specific files from multiple locations
Cannot use both --path and --files simultaneously (mutually exclusive)

Image Analysis Usage

Basic Image Commands

# Analyze single image
multimodal-analyzer --type image  --path photo.jpg

# Batch process directory
multimodal-analyzer --type image --model azure/gpt-4.1-mini --path /Users/.../mix/photos/ --output markdown

# Development installation (prefix with uv run)
uv run multimodal-analyzer --type image  --path photo.jpg

Advanced Image Analysis

# Custom prompt with word count
multimodal-analyzer --type image --model claude-3-sonnet-20240229 --path chart.jpg \
  --prompt "Analyze this chart focusing on data insights" --word-count 300

# Recursive batch processing
multimodal-analyzer --type image --model gpt-4o-mini --path /Users/.../mix/dataset/ \
  --recursive --concurrency 5 --output json --output-file results.json

# Analyze specific images from multiple directories
multimodal-analyzer --type image --model gpt-4o-mini \
  --files /Users/.../mix/screenshots/chart1.png \
  --files /Users/.../mix/photos/diagram.jpg \
  --files /Users/.../mix/temp/analysis_image.png \
  --prompt "Compare these visuals" --word-count 200

Audio Analysis Usage

Basic Audio Commands

# Transcribe audio
multimodal-analyzer --type audio --model whisper-1 --path audio.mp3 --audio-mode transcript

# Analyze audio content
multimodal-analyzer --type audio --model gpt-4o-mini --path podcast.wav --audio-mode description

Advanced Audio Processing

# Batch transcription
multimodal-analyzer --type audio --model whisper-1 --path /Users/.../mix/audio/ \
  --audio-mode transcript --output text --output-file transcripts.txt

# Content analysis with custom prompts
multimodal-analyzer --type audio --model gpt-4o-mini --path podcast.wav \
  --audio-mode description --prompt "Summarize key insights" --word-count 200

# Transcribe specific audio files from different locations
multimodal-analyzer --type audio --model whisper-1 \
  --files /Users/.../mix/meetings/standup.mp3 \
  --files /Users/.../mix/interviews/candidate1.wav \
  --files /Users/.../mix/recordings/conference_call.m4a \
  --audio-mode transcript --output markdown --output-file transcripts.md

Video Analysis Usage

Basic Video Commands

# Analyze video content
multimodal-analyzer --type video  --path video.mp4 --video-mode description

Advanced Video Analysis

# Single video analysis
multimodal-analyzer --type video  --path presentation.mp4 \
  --video-mode description --word-count 150

# Batch video processing with custom prompts
multimodal-analyzer --type video  --path /Users/.../mix/videos/ \
  --video-mode description --prompt "Describe the visual content and any audio" \
  --recursive --output markdown --output-file video_analysis.md

# Video analysis with detailed output
multimodal-analyzer --type video  --path tutorial.mp4 \
  --video-mode description --verbose --word-count 200

# Analyze specific videos from multiple projects
multimodal-analyzer --type video  \
  --files /Users/.../mix/project1/demo.mp4 \
  --files /Users/.../mix/project2/presentation.avi \
  --files /Users/.../mix/shared/training_video.mov \
  --video-mode description --prompt "Focus on key features demonstrated" \
  --word-count 300 --output json --output-file video_summaries.json