Bulk Analyze Linux Packages with Linux Package Analyzer

January 09, 2022 at 09:10 PM | categories: packaging, Rust

I've frequently wanted to ask random questions about Linux packages and binaries:

  • Which packages provide a file X?
  • Which binaries link against a library X?
  • Which libraries/packages define/export a symbol X?
  • Which binaries reference an undefined symbol X?
  • Which features of ELF are in common use?
  • Which x86 instructions are in a binary?
  • What are the most common ELF section names and their sizes?
  • What are the differences between package X in distributions A and B?
  • Which ELF binaries have the most relocations?
  • And many more.

So, I built Linux Package Analyzer to facilitate answering questions like this.

Linux Package Analyzer is a Rust crate providing the lpa CLI tool. lpa currently supports importing Debian and RPM package repositories (the most popular Linux packaging formats) into a local SQLite database so subsequent analysis can be efficiently performed offline. In essence:

  1. Run lpa import-debian-repository or lpa import-rpm-repository and point the tool at the base URL of a Linux package repository.
  2. Package indices are fetched.
  3. Discovered .deb and .rpm files are downloaded.
  4. Installed files within each package archive are inspected.
  5. Binary/ELF files have their executable sections disassembled.
  6. Results are stored in a local SQLite database for subsequent analysis.

The LPA-built database currently stores the following:

  • Imported packages (name, version, source URL).
  • Installed files within each package (path, size).
  • ELF file information (parsed fields from header, number of relocations, important metadata from the .dynamic section, etc).
  • ELF sections (number, type, flags, address, file offset, etc).
  • Dynamic libraries required by ELF files.
  • ELF symbols (name, demangled name, type constant, binding, visibility, version string, etc).
  • For x86 architectures, counts of unique instructions in each ELF file and counts of instructions referencing specific registers.

Using a command like lpa import-debian-repository --components main,multiverse,restricted,universe --architectures amd64 http://us.archive.ubuntu.com/ubuntu impish, I can import the (currently) ~96 GB of package data from 63,720 packages defining Ubuntu 21.10 to a local ~12 GB SQLite database and answer tons of random questions. Interesting insights yielded so far include:

  • The entirety of the package ecosystem for amd64 consists of 63,720 packages providing 6,704,222 files (168,730 of them ELF binaries) comprising 355,700,362,973 bytes in total.
  • Within the 168,730 ELF binaries are 5,286,210 total sections having 606,175 distinct names. There are also 116,688,943 symbols in symbol tables (debugging info is not included and local symbols not imported or exported are often not present in symbol tables) across 19,085,540 distinct symbol names. The sum of all the unique symbol names is 1,263,441,355 bytes and 4,574,688,289 bytes if you count occurrences across all symbol tables (this might be an over count due to how ELF string tables work).
  • The longest demangled ELF symbol is 271,800 characters and is defined in the file usr/lib/x86_64-linux-gnu/libmshr.so.2019.2.0.dev0 provided by the libmshr2019.2 package.
  • The longest non-mangled ELF symbol is 5,321 characters and is defined in multiple files/packages, as it is part of a library provided by GCC.
  • Only 145 packages have files with indirect functions (IFUNCs). If you discard duplicates (mainly from GCC and glibc), you are left with ~11 packages. This does not appear to be a popular ELF feature!
  • With 54,764 references in symbol tables, strlen appears to be the most (recognized) widely used libc symbol. It even bests memcpy (52,726) and free (42,603).
  • MOV is the most frequent x86 instruction, followed by CALL. (I could write an entire blog post about observations about x86 instruction use.)

There's a trove of data in the SQLite database and the lpa commands only expose a fraction of it. I reckon a lot of interesting tweets, blog posts, research papers, and more could be derived from the data that lpa assembles.

lpa does all of its work in-process using pure Rust. The Debian and RPM repository interaction is handled via the debian-packaging and rpm-repository crates (which I wrote). ELF file parsing is handled by the (amazing) object crate. And x86 disassembling via the iced-x86 crate. Many tools similar to lpa call out to other processes to interface with .deb/.rpm packages, parse ELF files, disassemble x86, etc. Doing this in pure Rust makes life so much simpler as all the functionality is self-contained and I don't have to worry about run-time dependencies for random tools. This means that lpa should just work from Windows, macOS, and other non-Linux environments.

Linux Package Analyzer is very much in its infancy. And I don't really have a grand vision for it. (I built it and some of the packaging code it is built on) in support of some even grander projects I have cooking.) Please file bugs, feature requests, and pull requests in GitHub. The project is currently part of the PyOxidizer repo (because I like monorepos). But I may pull it and other os/packaging/toolchain code into a new monorepo since target audiences are different.

I hope others find this tool useful!