Websites bewaren

Uit Cultureel Erfgoed Standaardentoolbox
Naar navigatie springen Naar zoeken springen

ToDo 9/7

test case on one or more of the following website:

  • CEST website (wiki CMS)

later:

  • DCA & packed website (proprietary CMS)
  • SCART website (drupal CMS)

Different ways of making a static archive of a Drupal website: https://www.drupal.org/node/27882

doel

Websites offline halen en in een repository onderbrengen in een vorm waarin ze op lange termijn bewaard kunnen worden zonder dat de inhoud verloren gaat.

geboden

preamule

  • archiving a website should be done before the website is offline, so you can perform controll to check if the archive copy represents the original online
  • this also allows to identify external service that are essential for representing the website> so you can decide the scope of the crawling.
  • basically, the objective is to crawl the complete content of the website, including the external services
  • basically, we talk about taking a snapshot of a website.
  • web archiving, means accepting gaps,
  • web archiving means reconstruction/mimic things

1. document the technical environment

minimum

  • what server configuration the website was running on (php, mysql version)
  • technical metadata (e.g. html, css, javascript version)
  • CMS, wiki software used to build the website (drupal, mediawiki, joomla)

recommended


2. record the original structure of the website.

minimum

  • keep the links as they were (original naming of the URL) and associates with another URL to preserve the function
  • keep the files in the order they are put in the folder

recommended

  • use the warc format?

3. record the context of the website

minimum

  • related external pages (links to other websites/webdocuments)
  • related external content (youtube, mediawiki)
  • related external services/scripts

recommended

4. record the evolution of the website

web inventory guideline

What is the main URL of the website ? Which URL should you use to archive the website ?

• Archive every URL of one website

• Some URL might only mirror a part of the website and not every part of it. (ex. DCA website)



  1. . describe the content of the website
  2. . decide the extent of the crawl (physical scope)
  3. . decide how often you wnat to crawl (temporal scope)
  4. . decide to which extent you preserve the look&feel
    1. do you only archive an original copy?
    2. do you archive a preservation and access file?


How to deal with changing webstandards:

  • document which standards and browsers apply
  • emulate the environment
  • making static versions

ARC files

research projects

  • Liwa
  • arcomem


static html dynamic pages

  • scripts/programs
  • endless pages

Tools

There are a number of tools for archiving websites. Depending on the website, some ways or tools will be more efficient.

HTTRACK is a good tool to get a clocne/offline version of a website. The only point that could pose a problem with this tool is that in order to have working links, it renames all the files and local url in order to enable local navigation. Therefore, the original page "names" are not kept. It can be a problem for archiving but nor for access.

http://www.httrack.com/page/1/en/index.html

There are however alternatives to HTTRACK, such as the Heritrix crawler with which you can configure what you want to archive and also use formats such as ARC and WARC to store this archive. While you can configure more parameters, it is less user-friendly. It is well known because the Internet Archive uses this tool.

https://webarchive.jira.com/wiki/display/Heritrix/Heritrix;jsessionid=61E63873884C9828116A6DD8A58974B8

There is also a free software to navigate in WARC archives. http://archive-access.sourceforge.net/projects/wera/

GNU Wget is another open source and free software to make website copies. It is also configurable and more user friendly than Heritrix. Just like Heritrix, it has a graphical web interface but can also be used with command lines.

http://www.gnu.org/software/wget/

On mac osx there is also a software to easily make local copies of online website, it is called Sitesucker. It has limitation as to what tags and what formats of file it can crawl. A list of this limitations is available on the software's website.

http://sitesucker.us/mac/mac.html

http://warcreate.com/

This tool is an extension for your browser that allows you to archive one page at a time in the WARC format. The fact that it only archives one page at a time doesn't make it a practical tool to archive entire website.

Grab Site

Grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses wpull for crawling.

https://github.com/ludios/grab-site

Webrecorder and webarchiveplayer (Emanuel)

Database Archiving (Emanuel)

Database to XML tool

http://deeparc.sourceforge.net/

https://github.com/soheilpro/Xinq

SIARD1 and SIARD2 standards

Keep Solution import and export tools for Databases

Archive Facebook (Check if it still works and compare the results with webrecorder)

https://addons.mozilla.org/en-US/firefox/addon/archivefacebook/

An extension to your browser that allows you to make a local archive of a Facebook account.

Twitter archiving (Emanuel / Does it still work?)

"twarc is a command line tool and Python library for archiving Twitter JSON data. Each tweet is represented as a JSON object that is exactly what was returned from the Twitter API. Tweets are stored as line-oriented JSON. Twarc runs in three modes: search, filter stream and hydrate. When running in each mode twarc will stop and resume activity in order to work within the Twitter API's rate limits."

https://github.com/edsu/twarc

Memento protocole (Emanuel)

http://www.dlib.org/dlib/september12/09inbrief.html

Access to archived website (Emanuel)

http://oldweb.today/

A project by Ilya Kreymer supported by Rhizome that allow you to browse website in old versions of browsers or browsers that don't exist anymore such as Netscape.

This website is proposing a good list of tools : http://netpreserve.org/web-archiving/tools-and-software

Webpagina's inventariseren

betwer webpublicaties ipv websites

doel

Alle webpublicaties van een bepaalde organisatie of persoon, plaats of onderwerp verzamelen binnen een bepaalde periode.

Centrale vraag die de inventaris moet beantwoorden: wat heeft iemand over iets op een bepaald moment op het web gezet? Ongeacht of dit via een website, blog, sociaal netwerk is. Het zoekresultaat moet een reeks links naar webpublicaties zijn. Als die webpublicaties niet meer online staan, is het zaak dat uit een digital repository worden gevist. Hoe die publicaties in een repository bewaren is onderwerp van de andere richtlijn.

    • ontwikkel een goeie zoekstrategie > welke elementen moeten in de zoekstrategie zitten
    • documenteer de zoekstrategie bij de resultaten > in welk formaat moet het archief en de metadat bewaard worden
    • beschrijving van het webarchief > standaarden voor metadata over webresources
    • hoe inventariseer je dynamische content van een webpublicatie.
    • maak je de inventaris van je webpublicaties doorzoekbaar?
    • hoe inventariseer je de ontwikkeling in de tijd van een publicatie?

zoekstrategieen:

  • depth/breadt first-popularity ranks-topical crawling

zie liwa-arcomem apache nutch heritrix UK web archive portugeuese web archive padicat


guessing linkes extract paramaets from the program code execute of javascript> simultae user activities

crawl strategies 1. depth-first (sequence of dives in to the depth of the page hierarchy) 2. breadth first (level by level lower in the hierarchy 3. select pages by popularity (obv pagerank 4. cntent based selection

topical crawling

focussed on events and rarely around entities based on the intention of the researcher pagen rank and smantics for prioritizin pages