Initially I am full time on the Netarkivet project , which harvests the Danish part of the internet several times a year and stores it for future research as part of preserving the Danish internet heritage. Think Internet Archive/Wayback Machine for Danes.
The Netarchive Suite (which controls the Heretrix harvester and archives the harvested files) is Open Source and available at https://sbforge.org/display/NAS/NetarchiveSuite.
One of the pending maintenance tasks is to migrate from a rather evolved ant build to Maven (after migrating from subversion to git). There is a large set of tests in place but as the ant build generates several artifacts and Maven is designed to only generate one artifact for each Maven project it is important to reproduce the ant artifacts in the maven build.
As the build contains a lot of classes and a lot of tests the easiest way to programatically confirm the builds are the same is to simply see if the resulting jar/war files are binary identical. I have earlier found in my work making Jenkins Fingerprinting working with multi-step maven builds that Maven builds are simply not designed to be reproducible, and has the following problems:
- Maven generated artifacts depend on where and when they were done and not only on the source file contents. Default property files enclosed in the artifacts have timestamps, but this can be turned off directly in pom.xml
- All entries in a jar/war/ear file have a time stamp corresponding to when the file was written to disk. This needs to be explicitly set to the same value across all builds. The Maven jar plugin does not support this.
- The order of the entries in the jar/war/ear files are non-deterministic by default making them uncomparable. A sorting order needs to be defined - it appears that much of this work has already been done in the pack200 repackaging utility. To my knowledge this is not supported in any of the default jar/war/ear creation mechanisms in ant or Maven.
- The operating system platform indicator stored with each entry (which apparently is used for defining how platform dependent information is stored) needs to be normalized. A reasonable default would be the unix platform allowing for rwx-bits for scripts, or the fat platform to be as simple as possible.
After thinking it over the simplest approach is probably to rework the current ant build to avoid having any timestamps inside generated files, and create a new project capable of normalizing a zip-file. Inspired by the work done by James Clark to make XML files comparable - http://jclark.com/xml/canonxml.html - I think a good name for such a normalized zip-file is Canonical ZIP.