Bash is slow. Supporting a nested inheritance structure on top of bash makes it
even slower. Without metadata caches, processing ebuilds would be
extraordinarily more painful than it already is. Imagine the extra time it
would take to source all the ebuilds required to run a command such as
emerge -e world before dependency resolution can begin. Clearly the importance of
using a cache when dealing with a large corpus of highly nested bash code
cannot be understated.
For ebuild repos, the metadata cache is currently located at metadata/md5-cache from the repo root and is created by sourcing each ebuild with all its eclass dependencies, tracking state over nested inherits for incrementally handled variables, and then writing the relevant metadata key/value mapping to the cache. An existing cache is updated by iterating over its entries, comparing checksums to determine validity, and performing entry creation on failure.
While loading package information for dependency resolution or any other tree-wide activity, the cache is first accessed to determine if it has the relevant information before falling back to generating the values. This allows package managers and related tools to avoid sourcing ebuilds when using using globally-scoped package metadata speeding up the process immensely.
Those doing ebuild development should be familiar with either portage’s
emerge --regen) or pkgcore’s
pmaint regen commands. Both
provide metadata cache generation functionality for their respective ecosystem
and are generally required when running on ebuild repos lacking bundled,
Metadata generation implementations handle verification differently with both portage and pkgcore treating the data as raw strings performing few validity checks. Verification requires parsing metadata strings into relevant data structures significantly extending generation time when written in slower languages. This is resolved using development tools such as pkgcheck that verifies ebuild metadata before pushing commits.
Conversely, pkgcraft tries to confirm metadata validity as much as possible. Most verification using localized knowledge is performed, but anything requiring repo-wide info such as dependency visibility is not. For example, all dependency fields are parsed into their related data structures, catching any invalid formatting.
Pkgcraft provides various command-line tools under the
pk command from the
pkgcraft-tools crate with metadata generation support via
pk repo metadata.
cargo install pkgcraft-tools
cargo install pkgcraft-tools --git https://github.com/pkgcraft/pkgcraft.git
Pre-built binaries are also provided for releases on supported platforms.
Incrementally generate metadata for the configured
pk repo metadata gentoo
Target multiple repos including an external repo:
pk repo metadata repo1 path/to/repo2
Force a full regen:
pk repo metadata -f path/to/repo
Note that at least one repo must be specified and either be the name of a
configured ebuild repo on the system or an external repo pointed to via a path.
Specifying the level of parallelism is supported using the
which defaults to the system’s number of logical CPU cores when unset.
In a technical sense, pkgcraft avoids bash as much as possible. As described in the post on rustifying bash, all functionality specified by PMS is implemented with rust building on top of a bundled version of bash. This allows tracking metadata using more efficient internal state structures rather than bash variables.
Parallelism is handled in a simplistic fashion by running the entire workflow – validity checks, ebuild sourcing, metadata structure creation, and file serialization – in a forked process pool. Each process runs the metadata workflow for a single package, encoding the result and sending it via inter-process communication (IPC) to the main process that unwraps it, handling error logging while tracking overall status. After all packages are processed the main process exits, failing if any errors occurred.
For security purposes (and because sandboxing isn’t supported yet), ebuild
sourcing is run within a restricted shell environment. This rejects a lot of
functionality possible in regular bash including running external commands. To
see the entire list (with minor differences), run
man rbash. In
addition, sourcing functionality is configured to act similar to
set -e being
enabled, but instead triggered by internal bash errors and not command exit
statuses. Overall, the stricter environment has highlighted many instances of
questionable bash usage in the tree such as using
return 0 or declaring local
variables in global scope.
Rough benchmarks comparing the metadata generation implementations of portage, pkgcore, and pkgcraft on a modest laptop with 8 cores/16 threads (AMD Ryzen 7 5700U) running against a semi-current gentoo tree with no metadata on top of an SSD are as follows:
egencache -j16– approximately 5m
pmaint regen -t 16– approximately 1m20s
pk repo metadata -j16– approximately 30s
For comparative parallel efficiency, pkgcraft achieves the following when using different amounts of jobs:
pk repo metadata -j8– approximately 40s
pk repo metadata -j4– approximately 1m
pk repo metadata -j2– approximately 2m
pk repo metadata -j1– approximately 4m
For a valid metadata cache requiring no updates:
egencache -j16– approximately 7s
pmaint regen -t 16– approximately 14s
pk repo metadata -j16– approximately .2s
Note that these results are approximated averages for multiple runs without flushing memory caches. Initial runs of the same commands will be slower from additional I/O latency.
From the results above, the effects from one of portage’s design flaws can
clearly be seen. Any time the bash side of portage is used, a new bash instance
is started. The process spawning overhead is so dominant that running
egencache using 16 jobs is roughly equivalent to running
pmaint regen using
2-3 jobs and even slower than
pk repo metadata using a single job. Due to
this, it’s best to avoid portage’s metadata generation support if performance
is a priority, especially on slower hardware.
In constrast, pkgcore performs relatively well due to leveraging a bash daemon that spawns subshells (forked processes) to generate metadata thus avoiding most of the unnecessary process overhead. This approach could be copied into portage to provide the same benefits, but that would require extensive rework of the bash functionality and IPC interface.
Pkgcraft is the fastest by a significant margin while also doing the most verification of the three; however, it has the advantage that the underlying support isn’t written in bash and currently limits IPC overhead to the relatively minimal encoding of results, relying on the operating system’s copy-on-write support for forked memory pages to “transfer” data into each process.
While all three projects implement the same specification, none produces matching output. In pkgcraft’s case metadata fields are parsed internally, reordering and removing duplicates where possible since many of the underlying data structures are ordered sets. It also uses different ordering for circular eclass inherits.
Beyond file serialization, pkgcraft is stricter in what it allows in global
scope whether in ebuilds or eclasses due to its bash configuration as mentioned
previously. This currently causes a decent amount of ebuilds to fail when
generating metadata due to command failures in global scope. It is unlikely
portage or pkgcore will ever be able to catch the same level of issues since
they are designed to run their metadata generation natively in function scope
which hides many of the issues. One alternative would be to leverage an
external bash parser flagging issues during
Even with these differences, package managers should be able to use caches created via any implementation if they adhere to the specified format. This means it’s possible to generate metadata using quicker solutions, pkgcore or pkgcraft, that is then used with portage.
There are potentially a few optimizations that might shave more time off a full tree regen run. On the bash side, every time a builtin may be called a binary search is run across the registered builtins to determine if one exists. Replacing those lookups with hash table queries could potentially speed up the execution pipeline for bash code using large amounts of commands if the hash function used runs in less time than a binary search. Including pkgcraft’s builtins, a worst case binary search takes approximately seven steps so this work may not be worth it.
In addition, initial work has been done to allow bash processes to reset their internal state in order to allow reuse rather than leveraging subshells internally or forked processes externally. The idea being that instead of forking a new process per ebuild, a process pool could reuse its processes by resetting their state thus saving time by avoiding the underlying operating system’s fork() setup time. However, this capability might not come to fruition as its difficult to make bash properly reset its state after errors or signals occur that cause nonlocal gotos via longjmp().
As mentioned previously, pkgcraft currently parallelizes metadata generation with a simplistic pool iterator using a forked process per package. Technically, only sourcing is required to run in a separate process so it may be faster to use thread pools for validity, structure creation, and file serialization. However, the dominant runtime factor for full regen runs is bash sourcing so this shouldn’t be significant, but could speed up repo verification.
pk repo metadata, language bindings should be able to natively
trigger metadata regen for relevant repos. This mainly entails adding support
into pkgcraft-c and determining what level of runtime feedback it provides,
e.g. it could return an iterator of errors, accept a callback function for
error handling, or just use a simple error status. Once that work is complete,
then language bindings can wrap it allowing native metadata regeneration.
As with all programming languages, it’s easy to write poorly performant code in bash. It’s probably fair to say shell coding exacerbates the situation because most interactions with it are interactive shell usage or one-off scripts, performance rarely being a priority in either case. This often leads to inadvertently adding poorly performing code including extensive subshell or external command use, processing large amounts of strings, or poorly written loop structures encompassing the previous two variants.
To help developers debug potential performance issues more subcommands for
are being looked at as well. Currently the idea is that
pk pkg env could be
added to provide source timings for given restrictions, allowing debugging
against a single package, an entire category, or the set of packages using a
specific eclass. Furthermore, environment mangling and dumping would be
supported so, for example, all the packages using a specific eclass could be
sourced and dump a specific variable from the sourced environment. This kind of
tooling should give developers more insight into the metadata generation
process and how different types of coding structures affect it.
Bash uses FNV-1 for hash table support.