1 unstable release
| 0.1.2 | Mar 17, 2026 |
|---|
#5 in Accessibility
Used in oximedia
320KB
6.5K
SLoC
oximedia-caption-gen
Advanced caption and subtitle generation — speech alignment, line breaking, WCAG compliance, and speaker diarization for OxiMedia
Part of the OxiMedia sovereign media framework.
Features
- Frame-accurate speech-to-caption alignment with word-level and segment-level timestamps
- Automatic segment merging (short segments) and splitting (long segments at sentence/word boundaries)
- Caption block construction with configurable max lines and characters per line
- Greedy and optimal (Knuth-Plass DP) line-breaking algorithms minimizing raggedness
- Reading speed (CPS) computation, validation, and duration adjustment
- WCAG 2.1 compliance checking: caption coverage (1.2.2), live latency (1.2.4), reading speed, minimum duration, gap detection
- Speaker diarization: turn merging, per-speaker statistics, dominant speaker detection, crosstalk detection
- Speaker-to-caption block assignment based on temporal overlap
- Voice activity ratio computation with interval union
- Pure Rust with zero C/Fortran dependencies
Quick Start
[dependencies]
oximedia-caption-gen = "0.1"
Speech Alignment and Caption Blocks
use oximedia_caption_gen::alignment::{
TranscriptSegment, WordTimestamp, align_to_frames, build_caption_blocks,
};
let mut segment = TranscriptSegment {
text: "Hello world".to_string(),
start_ms: 0,
end_ms: 2000,
speaker_id: None,
words: vec![
WordTimestamp { word: "Hello".to_string(), start_ms: 0, end_ms: 1000, confidence: 0.95 },
WordTimestamp { word: "world".to_string(), start_ms: 1000, end_ms: 2000, confidence: 0.92 },
],
};
// Map words to frame numbers at 25 fps
let frames = align_to_frames(&segment, 25.0).unwrap();
assert_eq!(frames[0].0, 0); // "Hello" at frame 0
assert_eq!(frames[1].0, 25); // "world" at frame 25
// Build caption blocks with 2 lines, 42 chars per line
let blocks = build_caption_blocks(&[segment], 2, 42);
assert_eq!(blocks[0].id, 1);
Line Breaking
use oximedia_caption_gen::line_breaking::{greedy_break, optimal_break, LineBalance};
let text = "This is a sample caption text for demonstration";
// Greedy: break at last space before max width
let greedy = greedy_break(text, 20);
// Optimal (Knuth-Plass DP): minimize squared slack for balanced lines
let optimal = optimal_break(text, 20);
// Optimal produces better-balanced lines
let opt_balance = LineBalance::balance_factor(&optimal);
let greed_balance = LineBalance::balance_factor(&greedy);
assert!(opt_balance <= greed_balance + 0.01);
WCAG 2.1 Compliance
use oximedia_caption_gen::wcag::{run_all_checks, compliance_score, WcagLevel};
use oximedia_caption_gen::alignment::{CaptionBlock, CaptionPosition};
let blocks = vec![
CaptionBlock {
id: 1, start_ms: 0, end_ms: 2000,
lines: vec!["Hello world".to_string()],
speaker_id: None, position: CaptionPosition::Bottom,
},
CaptionBlock {
id: 2, start_ms: 2000, end_ms: 4000,
lines: vec!["How are you".to_string()],
speaker_id: None, position: CaptionPosition::Bottom,
},
];
let violations = run_all_checks(&blocks, 4000, WcagLevel::AA);
let score = compliance_score(&violations); // 100.0 if no violations
Speaker Diarization
use oximedia_caption_gen::diarization::{
DiarizationResult, Speaker, SpeakerTurn,
speaker_stats, dominant_speaker, assign_speakers_to_blocks,
CrosstalkDetector, voice_activity_ratio,
};
let mut result = DiarizationResult::new();
result.speakers.insert(1, Speaker {
id: 1, name: Some("Alice".to_string()), gender: None, language: None,
});
result.turns = vec![
SpeakerTurn { speaker_id: 1, start_ms: 0, end_ms: 5000 },
SpeakerTurn { speaker_id: 2, start_ms: 5000, end_ms: 10000 },
];
let stats = speaker_stats(&result);
let dominant = dominant_speaker(&result); // Some(1) or Some(2)
let var = voice_activity_ratio(&result, 12000); // fraction of content with speech
let overlaps = CrosstalkDetector::find_overlapping_turns(&result);
Modules
alignment
Core types WordTimestamp (word text, start/end ms, ASR confidence) and TranscriptSegment (text, timing, optional speaker, word list). align_to_frames maps segments to (frame_number, subtitle_line) pairs at a given FPS, supporting both word-level and segment-level alignment. merge_short_segments absorbs segments shorter than a threshold into adjacent segments. split_long_segments breaks oversized segments at sentence then word boundaries with proportional timestamp redistribution. build_caption_blocks wraps segments into CaptionBlock values with greedy line wrapping and configurable max lines. CaptionPosition supports Bottom, Top, and Custom(x%, y%) placement.
line_breaking
greedy_break wraps text at the last space before max_width. optimal_break uses a Knuth-Plass-inspired dynamic programming algorithm minimizing sum((max_width - line_width)^2) for more balanced output. LineBreakConfig holds broadcast-standard defaults (42 chars/line, 17 CPS, 2 lines, 80ms gap). compute_cps and reading_speed_ok validate reading speed in characters per second. adjust_duration_for_reading computes the minimum display time for a given CPS limit. LineBalance::balance_factor scores line balance from 0.0 (perfect) to 1.0 (maximally unbalanced). rebalance_lines attempts to improve balance by re-running the optimal algorithm.
wcag
WCAG 2.1 accessibility compliance checks organized by success criteria:
check_caption_coverage(1.2.2, Level A) -- detects gaps exceeding 2 seconds between caption blockscheck_live_latency(1.2.4, Level AA) -- validates live caption latency is under 3 secondscheck_sign_language(1.2.6, Level AAA) -- placeholder (not machine-checkable)check_cps-- validates reading speed against BBC/Netflix 17 CPS guidelinecheck_min_duration-- enforces minimum 1-second display time per blockcheck_gap_duration-- finds all inter-block gaps exceeding a threshold
run_all_checks executes all checks appropriate for a target WcagLevel. compliance_score computes a 0-100 score with configurable penalties per severity level. WcagViolation provides rule ID, message, severity level, and optional timestamp.
diarization
Speaker metadata (id, name, gender, language) and SpeakerTurn (speaker id, start/end ms) with overlap detection. DiarizationResult aggregates speakers and turns. merge_consecutive_turns joins same-speaker turns separated by less than 500ms. speaker_stats computes per-speaker total time, turn count, and average turn duration. dominant_speaker identifies the speaker with the most airtime. assign_speakers_to_blocks maps speakers to caption blocks by maximum temporal overlap. CrosstalkDetector::find_overlapping_turns detects simultaneous speech. voice_activity_ratio computes the fraction of content with active speech using interval union. format_speaker_label generates display names.
Architecture
The caption generation pipeline flows from raw transcript segments through alignment, merging/splitting, line breaking, and finally caption block construction. WCAG checks can be run as a post-processing validation step on the generated blocks. Diarization is an orthogonal pipeline that can be integrated at the block level via assign_speakers_to_blocks. All functions are stateless and operate on slices, making them composable in streaming or batch workflows. The only external dependency is thiserror.
License
Licensed under the terms specified in the workspace root.
Copyright (c) COOLJAPAN OU (Team Kitasan)
Dependencies
~135–510KB
~12K SLoC