IIPC Curator Tools Fair: Islandora Web ARChive solution pack

The following is the text for a video that I was asked to record for the 2014 International Internet Preservation Consortium General Assembly Curator Tools Fair, on the Islandora Web ARChive solution pack.

My name is Nick Ruest. I am a librarian at York University, in Toronto, Ontario. I’m going to give a quick presentation on the Islandora Web ARChive solution pack. I only have a few minutes, so I’ll quickly cover what the module does, what areas of the web archiving life cycle it covers, and a provide a quick demonstration.

So, what is the Islandora Web ARChive solution pack?

I’ll step back and quickly answer what is Islandora first. "Islandora is an open source digital asset management system based on Fedora Commons, Drupal and a host of additional applications." A solution pack in Islandora parlance, is a Drupal module that integrates with the main Islandora module and the Tuque library thereby allowing users to deposit, create derivatives, and interact with a given type of object. We have solution packs for Audio, Video, Large Images, Images, PDFs, paged content, and now web archives. The Web ARChive solution pack allows users to ingest and retrieve web archives through the Islandora interface. If we think about it in terms of OAIS, we give the repository a SIP, which in the case of this solution pack can be a single warc file and some descriptive metadata, and if available, a screenshot and/or a PDF. From there, the solution pack will create an AIP and DIP. The AIP will contain: the original warc, MODS descriptive metadata, FITS output (file characterization/technical metadata), web dissemination versions of the screenshots (jpg & thumbnail), PDF, and derivatives of the warc via warctools. Those derivatives are a csv and filtered warc. The csv -- WARC_CSV -- is a listing of all the files in a given warc. This allows a user/researcher to have a quick glance at the contents of the warc. The filtered warc -- WARC_FILTERED -- is a warc file stripped down as much as possible to the text, and it is used only for search indexing/keyword searching. The DIP is an a JPG/TN of the captured website (if supplied) and download links to the WARC, PDF, WARC_CSV, screenshot, and descriptive metadata. Here a link to the ‘archived site’ can be supplied in the default MODS form. The suggested usage here is to provide a link to the object in a local instance of Wayback, if it exists.

I’ve also been asked to address the following questions:

1) What aspects of the web archiving life cycle model does the tool cover? What aspects of the model would you like to/do intend to build into the tool? What functionality does the tool provide that isn’t reflected in the model?

I’ll address what it does not cover first: Appraisal and selection, scoping, and data capture. We allow users to use their own Appraisal and selection, scoping, and data capture processes. So, for example, locally, we use Heritrix for cron based crawls, and our own bash script for one-off crawls.

What does it cover? All of the rest of the steps!

Storage and organization: via Fedora Commons & Islandora
QA and analysis: via display/DIP -- visualization exposes it!
Metadata/description: every web archive object has a MODS descriptive datastream
Access/use/reuse: each web archive object has a URI, along with its derivatives. By default warcs are available to download.
Preservation: preservation depends on the policies of the repository/institution, but, in our case we have a preservation action plan for web archives, and suite of Islandora preservation modules running (checksum, checksum checker, FITS, and PREMIS) that cover the basics.
Risk management: see above.

2) What resources are committed to the tool’s ongoing development? What are major features in the roadmap? Is the code open source?

I developed the original version, and transferred it to the Islandora Foundation, allowing for community stewardship of the project.

Currently, there is no official roadmap for the project. If anybody has ideas, comments, suggestion, or roadmapish ideas, feel free to send a message to the Islandora mailing list.

...and yes, the code is totally open source. It is available under a GPLv3 license, and the canonical version of the code can be found under the Islandora organization on Github.

3) What is the user base for the tool? How environment-specific is the tool as opposed to readily reusable by other organizations?

Not entirely sure. It was recently released as part of the 7.x-1.3 version of Islandora.

Given that is it an Islandora module, it is tied to Islandora. So, you’ll have to have at least a 7.x-1.3 instance of Islandora running, along with the solution pack’s dependencies to run it.

4) What are the tool’s unique features? What are its shortcomings?

I think some unique features are that it is apart of a digital asset management system (it is the first of its kind that I am aware of), and the utilization of warctools for keyword searching and file inventories.

Shortcomings? That it is apart of a digital asset management system.

Very quick demo time!