#user #devices #daemon #virtio-fs #vhost-user #host #guest

bin+lib virtiofsd

A virtio-fs vhost-user device daemon

15 stable releases

1.10.1 Jan 24, 2024
1.9.0 Dec 20, 2023
1.8.0 Sep 13, 2023
1.7.0 Jul 18, 2023
1.0.0 Dec 23, 2021

#57 in Filesystem

Download history 112/week @ 2024-01-08 76/week @ 2024-01-15 160/week @ 2024-01-22 133/week @ 2024-01-29 56/week @ 2024-02-05 321/week @ 2024-02-12 81/week @ 2024-02-19 113/week @ 2024-02-26 14/week @ 2024-03-04 99/week @ 2024-03-11 103/week @ 2024-03-18 46/week @ 2024-03-25 59/week @ 2024-04-01 120/week @ 2024-04-08 314/week @ 2024-04-15 118/week @ 2024-04-22

611 downloads per month

Apache-2.0 AND BSD-3-Clause

475KB
9K SLoC

virtiofsd

A virtio-fs vhost-user device daemon written in Rust.

Building from sources

Requirements

This project depends on libcap-ng and libseccomp. You can obtain those dependencies by building them for their respective sources, or by installing the correspondent development packages from your distribution, if available:

  • Fedora/CentOS/RHEL
dnf install libcap-ng-devel libseccomp-devel
  • Debian/Ubuntu
apt install libcap-ng-dev libseccomp-dev

Compiling

virtiofsd is written in Rust, so you will have to install Rust in order to compile it, and it uses cargo to manage the project and its dependencies. After installing Rust, you can compile it to a binary by running:

cargo build --release

CI-built binaries

Every time new code is merged, the CI pipeline will upload a debug binary of virtiofsd. It is intended to be an accessible way for anyone to download and test virtiofsd without needing a Rust toolchain installed.

The debug binary is built only for x86_64 Linux-based systems.

Click here to download the latest build

Contributing

See CONTRIBUTING.md

Usage

This program must be run as the root user or as a "fake" root inside a user namespace (see Running as non-privileged user).

The program drops privileges where possible during startup, although it must be able to create and access files with any uid/gid:

  • The ability to invoke syscalls is limited using seccomp(2).
  • Linux capabilities(7) are dropped. virtiofsd only retains the following capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FOWNER, CAP_FSETID, CAP_SETGID, CAP_SETUID, CAP_MKNOD, CAP_SETFCAP (and CAP_DAC_READ_SEARCH if --inode-file-handles is used).
virtiofsd [FLAGS] [OPTIONS] --fd <fd>|--socket-path <socket-path> --shared-dir <shared-dir>

Flags

-h, --help

Prints help information.

-V, --version

Prints version information.

--syslog

Log to syslog. Default: stderr.

--print-capabilities

Print vhost-user.json backend program capabilities and exit.

--allow-direct-io

Honor the O_DIRECT flag passed down by guest applications.

--announce-submounts

Tell the guest which directories are mount points. If multiple filesystems are mounted in the shared directory, virtiofsd passes inode IDs directly to the guest, and because such IDs are unique only on a single filesystem, it is possible that the guest will encounter duplicates if multiple filesystems are mounted in the shared directory. --announce-submounts solves that problem because it reports a different device number for every submount it encounters.

In addition, when running with --announce-submounts, the client sends one SYNCFS request per submount that is to be synced, so virtiofsd will call syncfs() on each submount. On the other hand, when running without --announce-submounts, the client only sends a SYNCFS request for the root mount, this may lead to data loss/corruption.

--no-killpriv-v2

Disable KILLPRIV V2 support. This is required if the shared directory is an NFS file system. KILLPRIV V2 support is disabled by default.

--killpriv-v2

Enable KILLPRIV V2 support. It is disabled by default.

--no-readdirplus

Disable support for READDIRPLUS operations.

--writeback

Enable writeback cache.

--xattr

Enable support for extended attributes.

--posix-acl

Enable support for posix ACLs (implies --xattr).

--security-label

Enable support for security label (SELinux).

--preserve-noatime

Always preserve O_NOATIME.

By default virtiofsd will implicitly clean up O_NOATIME to prevent potential permission errors when it does not have the right capabilities to access all the exported files (typically when running as unprivileged user and with --sandbox none, that means it won't have the CAP_FOWNER capability set).

The option --preserve-noatime can be used to override this behavior and preserve the O_NOATIME flag specified by the client.

Options

--shared-dir <shared-dir>

Shared directory path.

--tag <tag>

The tag that the virtio device advertises.

Setting this option will enable advertising of VHOST_USER_PROTOCOL_F_CONFIG. However, the vhost-user frontend of your hypervisor may not negotiate this feature and (or) ignore this value. Notably, QEMU currently (as of 8.1) ignores the CONFIG feature. QEMU versions from 7.1 to 8.0 will crash while attempting to log a warning about not supporting the feature.

--socket-group <socket-group>

Name of group for the vhost-user socket.

--socket-path <socket-path>

vhost-user socket path.

--fd <fd>

File descriptor for the listening socket.

--log-level <log-level>

Log level (error, warn, info, debug, trace, off).

Default: info.

--thread-pool-size <thread-pool-size>

Maximum thread pool size. A value of "0" disables the pool.

Default: 0.

--rlimit-nofile <rlimit-nofile>

Set maximum number of file descriptors. If the soft limit is greater than 1M or --rlimit-nofile=0 is passed as parameter, the maximum number of file descriptors is not changed.

Default: min(1000000, /proc/sys/fs/nr_open).

--modcaps=<modcaps>

Modify the list of capabilities, e.g., --modcaps=+sys_admin:-chown. Although it is not mandatory, it is recommended to always use the = sign, in other case, this will fail --modcaps -mknod, because it will be interpreted as two options, instead of the intended --modcaps=-mknod.

--sandbox <sandbox>

Sandbox mechanism to isolate the daemon process (namespace, chroot, none).

  • namespace: The program switches into a new file system namespace (namespaces(7)) and invokes pivot_root(2) to make the shared directory tree its root. A new mount (mount_namespaces(7)), pid (pid_namespaces(7)) and net namespace (network_namespaces(7)) is also created to isolate the process.

  • chroot: The program invokes chroot(2) to make the shared directory tree its root. This mode is intended for container environments where the container runtime has already set up the namespaces and the program does not have permission to create namespaces itself.

  • none: Do not isolate the daemon (not recommended).

Both namespace and chroot sandbox modes prevent "file system escapes" due to symlinks and other file system objects that might lead to files outside the shared directory.

Default: namespace.

--seccomp <seccomp>

Action to take when seccomp finds a not allowed syscall (none, kill, log, trap).

Default: kill.

--cache <cache>

The caching policy the file system should use (auto, always, metadata, never).

Default: auto.

--inode-file-handles=<inode-file-handles>

When to use file handles to reference inodes instead of O_PATH file descriptors (never, prefer, mandatory).

  • never: Never use file handles, always use O_PATH file descriptors.

  • prefer: Attempt to generate file handles, but fall back to O_PATH file descriptors where the underlying filesystem does not support file handles or CAP_DAC_READ_SEARCH is not available. Useful when there are various different filesystems under the shared directory and some of them do not support file handles.

  • mandatory: Always use file handles. It will fail if the underlying filesystem does not support file handles or CAP_DAC_READ_SEARCH is not available.

Using file handles reduces the number of file descriptors virtiofsd keeps open, which is not only helpful with resources, but may also be important in cases where virtiofsd should only have file descriptors open for files that are open in the guest, e.g. to get around bad interactions with NFS's silly renaming (see NFS FAQ, Section D2: "What is a "silly rename"?").

Default: never.

--xattrmap <xattrmap>

Add custom rules for translating extended attributes between host and guest (e.g., :map::user.virtiofs.:). For additional details please see Extended attribute mapping.

--uid-map=:namespace_uid:host_uid:count:

When running virtiofsd as non-root, map a range of UIDs from host to namespace. In order to use this option, the range of subordinate user IDs must have been set up via subuid(5). virtiofsd uses newuidmap(1), that requires a valid subuid, to do the mapping. If this option is not provided, virtiofsd will set up a 1-to-1 mapping for current uid.

namespace_uid: Beginning of the range of UIDs inside the user namespace. host_uid: Beginning of the range of UIDs outside the user namespace. count: Length of the ranges (both inside and outside the user namespace).

For instance, let's assume the invoking UID is 1000 and the content of /etc/subuid is: 1000:100000:65536, which creates 65536 subuids starting at 100000, i.e. the (inclusive) range [100000, 165535], belonging to the actual UID 1000. This range can be mapped to the UIDs [0, 65535] in virtiofsd’s user namespace (i.e. as seen in the guest) via --uid-map=:0:100000:65536:. Alternatively, you can simply map your own UID to a single UID in the namespace: For example, --uid-map=:0:1000:1: would map UID 1000 to root’s UID in the namespace (and thus the guest).

--gid-map=:namespace_gid:host_gid:count:

When running virtiofsd as non-root, map a range of GIDs from host to namespace. In order to use this option, the range of subordinate group IDs must have been set up via subgid(5). virtiofsd uses newgidmap(1), that requires a valid subgid, to do the mapping. If this option is not provided, virtiofsd will set up a 1-to-1 mapping for current gid.

namespace_gid: Beginning of the range of GIDs inside the user namespace. host_gid: Beginning of the range of GIDs outside the user namespace. count: Length of the ranges (both inside and outside the user namespace).

For instance, let's assume the invoking GID is 1000 and the content of /etc/subgid is: 1000:100000:65536, which creates 65536 subgids starting at 100000, i.e. the (inclusive) range [100000, 165535], belonging to the actual GID 1000. This range can be mapped to the GIDs [0, 65535] in virtiofsd’s user namespace (i.e. as seen in the guest) via --gid-map=:0:100000:65536:. Alternatively, you can simply map your own GID to a single GID in the namespace: For example, --gid-map=:0:1000:1: would map GID 1000 to root’s GID in the namespace (and thus the guest).

Examples

Export /mnt on vhost-user UNIX domain socket /tmp/vfsd.sock:

host# virtiofsd --socket-path=/tmp/vfsd.sock --shared-dir /mnt \
        --announce-submounts --inode-file-handles=mandatory &

host# qemu-system \
        -blockdev file,node-name=hdd,filename=<your image> \
        -device virtio-blk,drive=hdd \
        -chardev socket,id=char0,path=/tmp/vfsd.sock \
        -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs \
        -object memory-backend-memfd,id=mem,size=4G,share=on \
        -numa node,memdev=mem \
        -accel kvm -m 4G

guest# mount -t virtiofs myfs /mnt

See FAQ for adding virtiofs config to an existing qemu command-line.

Running as non-privileged user

When run without root, virtiofsd requires a user namespace (see user_namespaces(7)) to be able to switch between arbitrary user/group IDs within the guest. virtiofsd will fail in a user namespace where UIDs/GIDs have not been mapped (i.e., uid_map and gid_map files have not been written). There are many options to run virtiofsd inside a user namespace. For instance:

Let's assume the invoking UID and GID is 1000 and the content of both /etc/subuid and /etc/subgid are:

1000:100000:65536

Using podman-unshare(1) the user namespace will be configured so that the invoking user's UID and primary GID (i.e., 1000) appear to be UID 0 and GID 0, respectively. Any ranges which match that user and group in /etc/subuid and /etc/subgid are also mapped in as themselves with the help of the newuidmap(1) and newgidmap(1) helpers:

host$ podman unshare -- virtiofsd --socket-path=/tmp/vfsd.sock --shared-dir /mnt \
        --announce-submounts --sandbox chroot &

Using lxc-usernsexec(1), we could leave the invoking user outside the mapping, having the root user inside the user namespace mapped to the user and group 100000:

host$ lxc-usernsexec -m b:0:100000:65536 -- virtiofsd --socket-path=/tmp/vfsd.sock \
        --shared-dir /mnt --announce-submounts --sandbox chroot &

In order to have the same behavior as podman-unshare(1), we need to run

host$ lxc-usernsexec -m b:0:1000:1 -m b:1:100000:65536 -- virtiofsd --socket-path=/tmp/vfsd.sock \
        --shared-dir /mnt --announce-submounts --sandbox chroot &

We could also select --sandbox none instead of --sandbox chroot.

Limitations

  • Within the guest, it is not possible to create block or char device nodes in the shared directory.

  • virtiofsd can't use file handles (--inode-file-handles requires CAP_DAC_READ_SEARCH), so a large number of file descriptors is required. Additionally, on NFS, not using file handles may result in a hidden file lingering after some file is deleted (see NFS FAQ, Section D2: "What is a "silly rename"?").

  • virtiofsd will not be able to increase RLIMIT_NOFILE.

FAQ

  • How to read-only-share a directory that cannot be modified within the guest? To accomplish this you need to export a read-only mount point, for instance, exporting share:
mkdir ro-share
mount -o bind,ro share ro-share
virtiofsd --shared-dir ro-share ...
  • How to share multiple directories with the same virtiofsd? Currently, virtiofsd only supports sharing a single directory, but it is possible to use submounts to achieve this, for instance, exporting share0, share1:
mkdir -p share/{sh0,sh1}
mount -o bind share0 share/sh0
mount -o bind share1 share/sh1
virtiofsd --announce-submounts --shared-dir share ...

Note the use of --announce-submounts to prevent data loss/corruption.

  • How to add virtiofs devices to an existing qemu command-line:

    If -object memory-backend-memfd,id=mem and either -numa node,memdev=mem or a memory-backend=mem property in the -machine option have not already been added to the command, add them.

    If a different memory backend is already configured then it should be changed to memory-backend-memfd.

    -object memory-backend-memfd must have the option share=on and size= must match the memory size defined by -m.

    For each virtiofs device mount add a -chardev socket,id=${MATCHING_ID},path=${VIRTIOFSD_SOCKET_PATH} and -device vhost-user-fs-pci,queue-size=1024,chardev=${MATCHING_ID},tag=${VIRTIOFS_TAG} substituting appropriate values for the shell-style variables.

SELinux Support

One can enable support for SELinux by running virtiofsd with option "--security-label". But this will try to save guest's security context in xattr security.selinux on host and it might fail if host's SELinux policy does not permit virtiofsd to do this operation.

Hence, it is recommended to remap guest's "security.selinux" xattr to say "trusted.virtiofs.security.selinux" on host. Add following option to command line.

"--xattrmap=:map:security.selinux:trusted.virtiofs.:"

This will make sure that guest and host's SELinux xattrs on same file remain separate and not interfere with each other. And will allow both host and guest to implement their own separate SELinux policies.

Setting trusted xattr on host requires CAP_SYS_ADMIN. So one will need add this capability to daemon. Add following option to command line.

"--modcaps=+sys_admin"

trusted xattrs are not namespaced. So virtiofsd needs to have CAP_SYS_ADMIN in init_user_ns. IOW, one should not be using user namespaces and virtiofsd should run with CAP_SYS_ADMIN.

Giving CAP_SYS_ADMIN increases the risk on system. Now virtiofsd is more powerful and if gets compromised, it can do lot of damage to host system. So keep this trade-off in my mind while making a decision.

Dependencies

~8–18MB
~218K SLoC