Your Linux Data Center Experts

Matt Taggart is down from Seattle this week. Last night we ended up talking at length about a next generation Linux packaging system. Matt thought it might take locking up a bunch of people who care about RPMs and Debs in a room to iron things out. I thought it might require locking these people up in a room, and then have a few other people design it. ;-) Anyway, I wanted to write down what we had discussed, for the record.

The idea is that there are advantages to having fewer package formats. However, it's unlikely that either of the big formats would be directly adopted by the other camp. Each one has advantages that would be hard to give up. So, what can we learn from existing package formats if a new one were to be created?

Matt is very experienced with Debian package, and I am with RPMs.

File format

RPM uses a format that has small header and then a cpio archive of the files. The header contains meta information about the package, pre/post scripts, package signature, and similar information.

Debian packages are ar format archives, with all meta-information contained as entries within the archive. The problem with ar is that it's not platform-interdependent. When building packages, particularly “noarch” packages that will run on any architecture,

A format that is architecture independent is useful, and not really very hard. RPMs have solved this pretty well, using cpio. A better choice may be the POSIX tar specification, which is truly cross-platform. Except that tar headers and files are padded at 512 byte boundaries. CPIO, I believe, is not.

Source packages

Debian uses a copy of the pristine source, with a patch file which contains all differences to get it to build, and a description file. So, to get a Debian package you need to download 3 files. The benefit is that when a new package is released, mirrors don't need to get the source over again, they can just get the patch and description files.

Source RPMs are a single file containing (almost) everything needed to build the binary package.

We toyed around with the idea that appropriately formatted, an rsync could rsync just the diffs if a file changed, but usually the file name changes so rsync would be doing a full diff anyway. With systems like “apt-get” you can easily download the source to a package, even if it's multiple files, so Debian's way is probably the best over all.

How much of a difference does this make? A full mirror of ftp.debian.org is 122GB, ftp.redhat.com is 202GB, and fedora.redhat.com is 122GB. Fedora is pretty amazing because it's really only on it's 4th release, with an age of around 2 years. With the number of releases done on Red Hat, it's not as big as I'd expected.

Source patches

RPMs have the ability to have multiple patches, having patches for different functionality and adding new patches without re-working an existing patch, but also conditionally including patches into the build or disabling/removing patches based on the builders desires.

Debian packages only have a single patch, but a helper has been added recently to allow that patch to contain multiple patches, which are then used in the build. I haven't seen this mechanism in action, but I imagine it is a wash as to which mechanism is used.

Siged packages

RPMs have had support for signed packages for ages. You can individually sign binary and source packages with a GPG key, and at install time you can specify that packages which do not have a matching signature will not get installed by yum (the equivalent to apt-get on Debian, better than apt-get on RPM-based systems).

Sarge may (I haven't gotten a straight answer on this yet) include kind of signed package support. Apparently it works like this. The main Releases file is signed and includes md5sums of the files it refers to, which eventually leads down to an md5sum of the individual files. I'm not convinced that this is actually secure though.

My feeling is that the RPM mechanism is superior, or at the least that it would work with the Debian procedures (where the individual packages wouldn't be signed, but a master md5sum file would be).

Conditional building

SRPMs have a mechanism for doing conditional builds, which was added recently. This allows the build process to conditionally include patches or scripts based on the build environment. The problem is that there's no real standards for the name-space of this. It also is poorly documented.

I don't know anything about conditional building in Debian.

The RPM mechanism is fairly good, and works fairly well. If Debian has something as good or better, I'd be fine with that. See the section on Build Environment for some more things that need to be included here, which RPM does not.

Recommends

Debian packages have a nice feature which are “recommends”. It's kind of like a less serious dependency. It seems nice, because the package automation tools can inform the user that they can get extra functionality by installing other optional packages. For example, there's “vim-python” which allows you to program vim extensions in Python. I don't know that RPMs have this ability.

Helpers

Debian packages are built by running a bunch of user-specified helpers that do things like compress man pages, compile Python module files, etc. The nice thing here is that the user specifies which ones are run.

RPMs have a similar functionality, but it's built into the RPM tool and therefore is less flexible and less deterministic. For example, at one point “rpmbuild” started compressing man pages by default, but didn't change the file-list in the spec file so RPMs would stop building and would report “Hey, the man page foo.1 wasn't installed”. RPM doesn't have built-in code for compiling Python modules, but it does do things like check programs for dependencies and auto-generate a dependency list. This sometimes goes slightly wrong, which can be maddening.

I think the Debian mechanism is superior here.

Name-spaces

Both Debian packages and RPMs could do better about name-spaces. This comes about in many different ways. Of course, it would be nice to have a unified package name name-space, so you could tell a user to install a particular package without caring what distro they are on. The most obvious example is that Debian calls their development packages “*-dev” where Red Hat and similar call them “-devel”. However, there are many other packages that could have name-space convergence. Of course, this isn't really a package problem, but is in the same name-space.

RPMs also have some conditional compilation abilities, but the name-space for these options isn't well defined. This functionality is pretty new, so that's probably part of the reason. My first impulse is that Gentoo's build system is probably the most mature on this front, but I know next to nothing about it.

There is also an issue with package grouping name-space. RPMs have a “group” field in the meta-data which includes a file-system-like “/” separated group hierarchy, but this isn't well defined. You can put anything in here, and people do. In fact, it's not easy to find out what other packages are using.

Beyond just finding it, it would be nice to have better groupings. For example, you it would be nice to have all related packages (or recommends) available under a particular package. Something like “Web/Servers/Apache” and Apache would identify itself as this name, with sub-items for things like “Web/Servers/Apache/mod_python” and “mod_php” and “mod_ssl”, etc… Of course, it can get complicated by things like PHP, which is also available stand-alone, with many of it's own sub-packages.

Ideally you'd like something where you could specify “Programming/Languages/PHP” which contains a list of available modules, but also have those modules available under “Web/Servers/Apache/PHP”. So, you could get to the modules by either path.

Build environment

I don't know if Debian packages help this any, I suspect not. However, in building RPMs there is a problem that the environment of a package build impacts the resulting package. Of course, part of this is just the system that you are on, what libraries are installed and the like. However, there are also environment variables (“CFLAGS”, “MAKEFLAGS”) that impact the build. It would be nice if there were a known environment that packages were built with, with any deviations from this included in the source package.

Of course, some things could be defined as being allowed to be overridden on the local machine, for example the build concurrency (“make -j N”) would depend on the number of CPUs and things like distcc.

This functionality would be extremely nice to have.

Databases

RPM uses a Berkeley database for the system package database. BSDDB is extremely hard to get right (based on discussions I've had with people using BSDDB extensively, and demonstrated by database corruption I've experienced with SVN and RPM). This seems to have been resolved recently, but there was a good year or two where we had tons of problems with RPM databases, which leads me to the conclusion that BSDDB lacks robustness because of the complications in it's API or documentation.

Debian uses flat text files for the system package database, and seems to be fast and I've never had corruption problems with it.

While this isn't specifically tied with package format, it would impact choice of adopting an existing package format to modify based on. To be honest, in this regard I believe that Debian's solution wins hands down.

Dependencies

Dependencies I believe are well solved. I don't expect there will be much problem coming up with a unified dependency system which both Debian and RPM people are happy with.

Conclusion

I think this covers the topics we discussed, and then some other ideas I wanted to write down while thinking of the topics Matt and I covered. Please reply if you have any good related ideas or discussion.

comments powered by Disqus

Join our other satisfied clients. Contact us today.