Bulk Analyze Linux Packages with Linux Package Analyzer
January 09, 2022 at 09:10 PM | categories: packaging, RustI've frequently wanted to ask random questions about Linux packages and binaries:
- Which packages provide a file X?
- Which binaries link against a library X?
- Which libraries/packages define/export a symbol X?
- Which binaries reference an undefined symbol X?
- Which features of ELF are in common use?
- Which x86 instructions are in a binary?
- What are the most common ELF section names and their sizes?
- What are the differences between package X in distributions A and B?
- Which ELF binaries have the most relocations?
- And many more.
So, I built Linux Package Analyzer to facilitate answering questions like this.
Linux Package Analyzer is a Rust crate providing the lpa
CLI tool. lpa
currently
supports importing Debian and RPM package repositories (the most popular Linux
packaging formats) into a local SQLite database so subsequent analysis can be
efficiently performed offline. In essence:
- Run
lpa import-debian-repository
orlpa import-rpm-repository
and point the tool at the base URL of a Linux package repository. - Package indices are fetched.
- Discovered
.deb
and.rpm
files are downloaded. - Installed files within each package archive are inspected.
- Binary/ELF files have their executable sections disassembled.
- Results are stored in a local SQLite database for subsequent analysis.
The LPA-built database currently stores the following:
- Imported packages (name, version, source URL).
- Installed files within each package (path, size).
- ELF file information (parsed fields from header, number of relocations,
important metadata from the
.dynamic
section, etc). - ELF sections (number, type, flags, address, file offset, etc).
- Dynamic libraries required by ELF files.
- ELF symbols (name, demangled name, type constant, binding, visibility, version string, etc).
- For x86 architectures, counts of unique instructions in each ELF file and counts of instructions referencing specific registers.
Using a command like lpa import-debian-repository --components main,multiverse,restricted,universe
--architectures amd64 http://us.archive.ubuntu.com/ubuntu impish
, I can import
the (currently) ~96 GB of package data from 63,720 packages defining Ubuntu 21.10
to a local ~12 GB SQLite database and answer tons of random questions. Interesting
insights yielded so far include:
- The entirety of the package ecosystem for amd64 consists of 63,720 packages providing 6,704,222 files (168,730 of them ELF binaries) comprising 355,700,362,973 bytes in total.
- Within the 168,730 ELF binaries are 5,286,210 total sections having 606,175 distinct names. There are also 116,688,943 symbols in symbol tables (debugging info is not included and local symbols not imported or exported are often not present in symbol tables) across 19,085,540 distinct symbol names. The sum of all the unique symbol names is 1,263,441,355 bytes and 4,574,688,289 bytes if you count occurrences across all symbol tables (this might be an over count due to how ELF string tables work).
- The longest demangled ELF symbol is 271,800 characters and is defined in
the file
usr/lib/x86_64-linux-gnu/libmshr.so.2019.2.0.dev0
provided by thelibmshr2019.2
package. - The longest non-mangled ELF symbol is 5,321 characters and is defined in multiple files/packages, as it is part of a library provided by GCC.
- Only 145 packages have files with indirect functions (IFUNCs). If you discard duplicates (mainly from GCC and glibc), you are left with ~11 packages. This does not appear to be a popular ELF feature!
- With 54,764 references in symbol tables,
strlen
appears to be the most (recognized) widely used libc symbol. It even bestsmemcpy
(52,726) andfree
(42,603). MOV
is the most frequent x86 instruction, followed byCALL
. (I could write an entire blog post about observations about x86 instruction use.)
There's a trove of data in the SQLite database and the lpa
commands only
expose a fraction of it. I reckon a lot of interesting tweets, blog posts,
research papers, and more could be derived from the data that lpa
assembles.
lpa
does all of its work in-process using pure Rust. The Debian and RPM
repository interaction is handled via the
debian-packaging and
rpm-repository crates (which I
wrote). ELF file parsing is handled by the (amazing)
object crate. And x86 disassembling via
the iced-x86 crate. Many tools similar
to lpa
call out to other processes to interface with .deb
/.rpm
packages,
parse ELF files, disassemble x86, etc. Doing this in pure Rust makes life so
much simpler as all the functionality is self-contained and I don't have to
worry about run-time dependencies for random tools. This means that lpa
should just work from Windows, macOS, and other non-Linux environments.
Linux Package Analyzer is very much in its infancy. And I don't really have a grand vision for it. (I built it and some of the packaging code it is built on) in support of some even grander projects I have cooking.) Please file bugs, feature requests, and pull requests in GitHub. The project is currently part of the PyOxidizer repo (because I like monorepos). But I may pull it and other os/packaging/toolchain code into a new monorepo since target audiences are different.
I hope others find this tool useful!
Rust Implementation of Debian Packaging Primitives
January 03, 2022 at 04:00 PM | categories: packaging, RustDoes your Linux distribution use tools with apt
in their name to manage
system packages? If so, your system packages are using Debian packaging.
Most tools interfacing with Debian packages (.deb
files) and repositories
use functionality provided by the apt
repository. This repository provides libraries like libapt
as well as
tools like apt-get
and apt
. Most of the functionality is implemented in
C++.
I wanted to raise awareness that I've begun implementing Debian packaging
primitives in pure Rust. The debian-packaging
crate is
published on crates.io. For
now, it is developed inside the
PyOxidizer repository (because I
like monorepos).
So far, a handful of useful functionality is implemented:
- Parsing and serializing control files
- Reading repository indices files and parsing their content.
- Reading HTTP hosted repositories.
- Publishing repositories to the filesystem and S3.
- Writing changelog files.
- Reading and writing
.deb
files. - Copying repositories.
- Creating repositories.
- PGP signing and verification operations.
- Parsing and sorting version strings.
- Dependency syntax parsing.
- Dependency resolution.
Hopefully the documentation contains all you would want to know for how to use the crate.
The crate is designed to be used as a library so any Rust program can (hopefully) easily tap the power of the Debian packaging ecosystem.
As with most software, there are likely several bugs and many features not yet
implemented. But I have bulk downloaded the entirety of some distribution's
repositories without running into obvious parse/reading failures. So I'm
reasonably confident that important parts of the code (like control file parsing,
repository indices file handling, and .deb
file reading) work as advertised.
Hopefully someone out there finds this work useful!