3 stable releases

2.5.1 Apr 1, 2024
2.5.0 Mar 22, 2024
2.4.2 Mar 15, 2024

#1026 in Filesystem

MIT and CC-PDDC licenses

115KB
3K SLoC

scandir-rs

scandir_rs is a directory iteration module like os.walk(), but with more features and higher speed. Depending on the function call it yields a list of paths, tuple of lists grouped by their entry type or DirEntry objects that include file type and stat information along with the name. Using scandir_rs is about 2-17 times faster than os.walk() (depending on the platform, file system and file tree structure) by parallelizing the iteration in background.

If you are just interested in directory statistics you can use the Count.

scandir_rs contains following classes:

  • Count for determining statistics of a directory.
  • Walk for getting names of directory entries.
  • Scandir for getting detailed stats of directory entries.

For the API see:

Installation

For building this wheel from source you need the tool maturin.

Install maturin:

cargo install maturin

IMPORTANT: In order to build this project at least Rust version 1.61 is needed!

Build wheel (not on Windows):

maturin build --release --strip

Build wheel on Windows:

maturin build --release --strip --no-sdist

maturin will build the wheels for all Python versions installed on your system.

Building and running tests for different Python versions

To make it easier to build wheels for several different Python versions the script build_wheels.sh has been added. It creates wheels for Python versions 3.7, 3.8, 3.9, 3.10 and 3.11. In addition it runs pytest after successfull creation of each wheel.

Instruction how to install pyenv can be found here.

Examples

Get statistics of a directory:

from scandir_rs import Count, ReturnType

print(Count("/usr", return_type=ReturnType.Ext).collect())

The collect method releases the GIL. So other Python threads can run in parallel.

The same, but asynchronously in background using a class instance:

from scandir_rs import Count, ReturnType

instance = Count("/usr", return_type=ReturnType.Ext))
instance.start())  # Start scanning the directory
...
values = instance.results()  # Returns the current statistics. Can be read at any time
...
if instance.busy():  # Check if the task is still running.
...
instance.stop()  # If you want to cancel the task
...
instance.join()  # Wait for the instance to finish.

and with a context manager:

import time

from scandir_rs import Count, ReturnType

with Count("/usr", return_type=ReturnType.Ext) as instance:
    while instance.busy():
        statistics = instance.results()
        # Do something
        time.sleep(0.01)
    print(instance.results())

os.walk() example:

from scandir_rs import Walk

for root, dirs, files in Walk("/usr"):
    # Do something

with extended data:

from scandir_rs import Walk, ReturnType

for root, dirs, files, symlinks, other, errors in Walk("/usr", return_type=ReturnType.Ext):
    # Do something

os.scandir() example:

from scandir_rs import Scandir, ReturnType

for path, entry in Scandir("~/workspace", return_type=ReturnType.Ext):
    # entry is a custom DirEntry object

Benchmarks

See examples/benchmark.py

In the below table the line Walk.iter returns comparable results to os.walk.

Linux with Ryzen 5 2400G and SSD

Directory /usr with

  • 110171 directories
  • 862634 files
  • 47804 symlinks
  • 12275 hardlinks
  • 12 devices
  • 0 pipes
  • 32.7GB size and 34.8GB usage on disk
Time [s] Method
3.450 os.walk (Python 3.10)
6.021 scantree (Python 3.10)
1.186 Count.collect
1.416 Count(ReturnType=Ext).collect
1.089 Walk.iter
1.350 Walk.collect
1.336 Walk(ReturnType=Ext).collect
2.232 Scandir.collect
1.839 Scandir.iter
2.437 Scandir(ReturnType=Ext).collect

Around ~3 times faster on Linux (os.walk compared to Walk.iter).

Windows 10 with Laptop Core i7-4810MQ @ 2.8GHz Laptop, MTF SSD

Directory C:\Windows with

  • 132604 directories
  • 349911 files
  • 44.4GB size and 45.2GB usage on disk
Time [s] Method
21.779 os.walk (Python 3.10)
13.085 scantree (Python 3.10)
3.257 Count.collect
16.605 Count(ReturnType=Ext).collect
4.102 Walk.iter
4.056 Walk.collect
4.190 Walk(ReturnType=Ext).collect
3.993 Scandir.collect
8.921 Scandir.iter
17.616 Scandir(ReturnType=Ext).collect

Around ~5.3 times faster on Windows 10 (os.walk compared to Walk.iter).

Directory linux-5.9 with

  • 4711 directories
  • 69973 files
  • 1.08GB size and 1.23GB usage on disk
Time [s] Method
0.411 os.walk (Python 3.10)
1.203 os.walk (stat)
0.218 scandir.Count()
0.278 scandir.Count(return_type=ReturnType.Ext).collect()
0.227 scandir_rs.Walk().collect()
0.164 scandir.Walk(return_type=scandir.ReturnType.Ext) (iter)
0.204 scandir.Walk(return_type=scandir.ReturnType.Ext) (collect)
0.350 scandir.Scandir(return_type=ReturnType.Base).collect()
0.426 scandir.Scandir(return_type=ReturnType.Ext).collect()

Around ~2.5 times faster on Linux (os.walk compared to Walk.iter).

Time [s] Method
1.998 os.walk (Python 3.10)
14.875 os.walk (stat)
0.278 scandir.Count()
2.114 scandir.Count(return_type=ReturnType.Ext).collect()
0.464 scandir_rs.Walk().collect()
0.313 scandir.Walk(return_type=scandir.ReturnType.Ext) (iter)
0.455 scandir.Walk(return_type=scandir.ReturnType.Ext) (collect)
0.624 scandir.Scandir(return_type=ReturnType.Base).collect()
2.409 scandir.Scandir(return_type=ReturnType.Ext).collect()

Around ~6.4 times faster on Windows 10 (os.walk compared to Walk.iter).

Dependencies

~6–12MB
~140K SLoC