3 releases (breaking)
| 0.4.0 | Dec 8, 2025 |
|---|---|
| 0.3.0 | Dec 2, 2025 |
| 0.2.0 | Dec 2, 2025 |
#218 in Concurrency
100KB
2.5K
SLoC
avila-parallel
A zero-dependency parallel computation library for Rust with true parallel execution and advanced performance features.
๐ Documentation
- Quick Start - Get started in 5 minutes
- API Documentation - Full API reference
- Optimization Guide - Performance tuning tips
- Contributing - How to contribute
- Changelog - Version history
โจ Features
Core Features
- ๐ True Parallel Execution: Real multi-threaded processing using
std::thread::scope - ๐ฆ Zero Dependencies: Only uses Rust standard library (
std::thread,std::sync) - ๐ Thread Safe: All operations use proper synchronization primitives
- ๐ Order Preservation: Results maintain original element order
- โก Smart Optimization: Automatically falls back to sequential for small datasets
- ๐ฏ Rich API: Familiar iterator-style methods
Advanced Features (v0.3.0)
- โ๏ธ Work Stealing Scheduler: Dynamic load balancing across threads
- ๐ข SIMD Operations: Optimized vectorized operations for numeric types
- ๐๏ธ Advanced Configuration: Customize thread pools, chunk sizes, and more
- ๐ Parallel Sorting: High-performance merge sort with custom comparators
- ๐งฉ Element-wise Operations: Zip, chunk, and partition with parallel execution
๐ Revolutionary Features (v0.4.0)
- ๐ Lock-Free Operations: Zero-contention atomic algorithms
- ๐ Pipeline Processing: Functional composition with MapReduce patterns
- ๐ง Adaptive Execution: Self-optimizing algorithms that learn optimal parameters
- ๐พ Memory-Efficient: Zero-copy operations and in-place transformations
๐ Quick Start
Add to your Cargo.toml:
[dependencies]
avila-parallel = "0.4.0"
Basic Usage
use avila_parallel::prelude::*;
fn main() {
// Parallel iteration
let data = vec![1, 2, 3, 4, 5];
let sum: i32 = data.par_iter()
.map(|x| x * 2)
.sum();
println!("Sum: {}", sum); // Sum: 30
// High-performance par_vec API
let results: Vec<i32> = data.par_vec()
.map(|&x| x * x)
.collect();
println!("{:?}", results); // [1, 4, 9, 16, 25]
// Lock-free counting (v0.4.0)
let count = lockfree_count(&data, |x| x > &2);
println!("Count: {}", count); // Count: 3
}
๐ฏ Available Operations
Basic Operations
map- Transform each elementfilter- Keep elements matching predicatecloned- Clone elements (for reference iterators)
Aggregation
sum- Sum all elementsreduce- Reduce with custom operationfold- Fold with identity and operationcount- Count elements matching predicate
Search
find_any- Find any element matching predicateall- Check if all elements matchany- Check if any element matches
Advanced Operations (v0.2.0+)
parallel_sort- Parallel merge sortparallel_sort_by- Sort with custom comparatorparallel_zip- Combine two slices element-wiseparallel_chunks- Process data in fixed-size chunkspartition- Split into two vectors based on predicate
Work Stealing & SIMD (v0.3.0)
work_stealing_map- Map with dynamic load balancingWorkStealingPool- Thread pool with work stealingsimd_sum_*- SIMD-accelerated sum operationssimd_dot_*- SIMD dot productThreadPoolConfig- Advanced thread pool configuration
๐ Lock-Free & Adaptive (v0.4.0)
lockfree_count- Atomic-based counting without lockslockfree_any/lockfree_all- Lock-free search with early exitAdaptiveExecutor- Learning executor that optimizes chunk sizesspeculative_execute- Auto-select parallel vs sequentialcache_aware_map- Cache-line optimized transformationsparallel_transform_inplace- Zero-allocation transformations
๐ Performance
The library automatically:
- Detects CPU core count (default: all available cores)
- Distributes work efficiently with configurable chunk sizes (default: 1024)
- Falls back to sequential execution for small datasets
- Maintains result order with indexed chunks
- Uses work stealing for dynamic load balancing
- NEW: Adapts chunk sizes based on workload characteristics
- NEW: Zero-lock algorithms for maximum concurrency
Benchmark Results (Updated for v0.4.0)
| Operation | Dataset | Sequential | Parallel (v0.3.0) | Parallel (v0.4.0) | Speedup |
|---|---|---|---|---|---|
| Sum | 1M | 2.5ms | 1.1ms | 0.9ms | 2.78x |
| Filter | 1M | 45ms | 15ms | 12ms | 3.75x |
| Count (lock-free) | 1M | 8ms | 4ms | 2.5ms | 3.20x |
| Sort | 1M | 82ms | 25ms | 25ms | 3.28x |
| Complex Compute | 100K | 230ms | 75ms | 65ms | 3.54x |
Note: For simple operations (<100ยตs per element), sequential may be faster due to thread overhead.
๐ง Advanced Usage
Lock-Free Operations (v0.4.0)
use avila_parallel::prelude::*;
let data = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
// Lock-free counting with atomics
let count = lockfree_count(&data, |x| x > &5);
// Lock-free search with early exit
let has_large = lockfree_any(&data, |x| x > &100);
let all_positive = lockfree_all(&data, |x| x > &0);
Adaptive Execution (v0.4.0)
use avila_parallel::adaptive::AdaptiveExecutor;
// Executor learns optimal chunk size over time
let mut executor = AdaptiveExecutor::new();
// First run: learns optimal parameters
let result1 = executor.execute(&data, |x| expensive_op(x));
// Subsequent runs: uses learned optimal chunk size
let result2 = executor.execute(&data, |x| expensive_op(x));
Memory-Efficient Operations (v0.4.0)
use avila_parallel::memory::parallel_transform_inplace;
// Zero-allocation in-place transformation
let mut data = vec![1, 2, 3, 4, 5];
parallel_transform_inplace(&mut data, |x| *x *= 2);
// data is now [2, 4, 6, 8, 10] without any allocations
Work Stealing (v0.3.0)
use avila_parallel::{work_stealing_map, WorkStealingPool};
// Dynamic load balancing
let data = vec![1, 2, 3, 4, 5];
let results = work_stealing_map(&data, |x| expensive_computation(x));
// Custom work stealing pool
let pool = WorkStealingPool::new(4);
pool.execute(tasks);
SIMD Operations (v0.3.0)
use avila_parallel::simd;
let data: Vec<i32> = (1..=1_000_000).collect();
let sum = simd::parallel_simd_sum_i32(&data);
let a: Vec<f32> = vec![1.0, 2.0, 3.0];
let b: Vec<f32> = vec![4.0, 5.0, 6.0];
let dot = simd::simd_dot_f32(&a, &b);
Thread Pool Configuration (v0.3.0)
use avila_parallel::{ThreadPoolConfig, set_global_config};
let config = ThreadPoolConfig::new()
.num_threads(8)
.min_chunk_size(2048)
.thread_name("my-worker");
set_global_config(config);
Parallel Sorting (v0.2.0+)
use avila_parallel::parallel_sort;
let mut data = vec![5, 2, 8, 1, 9];
parallel_sort(&mut data);
// data is now [1, 2, 5, 8, 9]
Using Executor Functions Directly
use avila_parallel::executor::*;
let data = vec![1, 2, 3, 4, 5];
// Parallel map
let results = parallel_map(&data, |x| x * 2);
// Parallel filter
let evens = parallel_filter(&data, |x| *x % 2 == 0);
// Parallel reduce
let sum = parallel_reduce(&data, |a, b| a + b);
// Parallel partition
let (evens, odds) = parallel_partition(&data, |x| *x % 2 == 0);
// Find first matching
let found = parallel_find(&data, |x| *x > 3);
// Count matching
let count = parallel_count(&data, |x| *x % 2 == 0);
Mutable Iteration
use avila_parallel::prelude::*;
let mut data = vec![1, 2, 3, 4, 5];
data.par_iter_mut()
.for_each(|x| *x *= 2);
println!("{:?}", data); // [2, 4, 6, 8, 10]
๐๏ธ Architecture
Thread Management
- Uses
std::thread::scopefor lifetime-safe thread spawning - Automatic CPU detection via
std::thread::available_parallelism() - Chunk-based work distribution with adaptive sizing
Synchronization
Arc<Mutex<>>for safe result collection- No unsafe code in public API
- Order preservation through indexed chunks
Performance Tuning
Default Configuration:
const MIN_CHUNK_SIZE: usize = 1024; // Optimized based on benchmarks
const MAX_CHUNKS_PER_THREAD: usize = 8;
Environment Variables:
# Customize minimum chunk size (useful for tuning specific workloads)
export AVILA_MIN_CHUNK_SIZE=2048
# Run your program
cargo run --release
When to Adjust:
- Increase (2048+): Very expensive operations (>1ms per element)
- Decrease (512): Light operations but large datasets
- Keep default (1024): Most use cases
๐งช Examples
CPU-Intensive Computation
use avila_parallel::prelude::*;
let data: Vec<i32> = (0..10_000_000).collect();
// Perform expensive computation in parallel
let results = data.par_vec()
.map(|&x| {
// Simulate expensive operation
let mut result = x;
for _ in 0..100 {
result = (result * 13 + 7) % 1_000_000;
}
result
})
.collect();
Data Analysis
use avila_parallel::prelude::*;
let data: Vec<f64> = vec![1.0, 2.0, 3.0, 4.0, 5.0];
// Calculate statistics in parallel
let sum: f64 = data.par_iter().sum();
let count = data.len();
let mean = sum / count as f64;
let variance = data.par_vec()
.map(|&x| (x - mean).powi(2))
.into_iter()
.sum::<f64>() / count as f64;
๐ When to Use
โ Good Use Cases
- CPU-bound operations (image processing, calculations, etc.)
- Large datasets (>10,000 elements)
- Independent computations per element
- Expensive operations (>100ยตs per element)
โ Not Ideal For
- I/O-bound operations (use async instead)
- Very small datasets (<1,000 elements)
- Simple operations (<10ยตs per element)
- Operations requiring shared mutable state
๐ ๏ธ Building from Source
git clone https://github.com/your-org/avila-parallel
cd avila-parallel
cargo build --release
cargo test
๐ License
MIT License - see LICENSE file for details
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
๐ Documentation
Full API documentation is available at docs.rs/avila-parallel
๐ Related Projects
โญ Star History
If you find this project useful, consider giving it a star!