Annex F: Further Details on Harvesting Process


Cite Permalink:
1
Explanation of the Harvesting Process
Cite Permalink:
2
8.1      The Panel’s goal has been to propose a deposit process which is cost-efficient for the legal deposit Libraries and which also imposes no administrative cost burden upon publishers. Because it potentially involves such a large number of publications, the Panel has recommended that libraries pull them directly from the Web using an automated process in which no action is required of publishers. This process will use a software tool (‘harvester’) to crawl relevant web domains.
Cite Permalink:
3
8.2      An initial seed list of Uniform Resource Locators (URLs) will be loaded into the harvester by library staff. These will usually be URLs for the home or root pages of web domains that are within the scope as recommended by the Panel.
Cite Permalink:
4
8.3      For each URL, the harvester will issue an electronic request to the publisher’s web hosting server for delivery of a copy of the page or file. Each request will include information which identifies:
Cite Permalink:
5
• the Internet Protocol (IP) address of the harvester issuing the request
Cite Permalink:
6
• the URL for the page or file requested
Cite Permalink:
7
• a ‘user-agent string’ which identifies the library controlling the harvester and the fact that it is a harvesting request
Cite Permalink:
8
• the URL for a web page containing details of how to contact the library, plus contextual information about legal deposit and the terms of the regulation.
Cite Permalink:
9
8.4      The web hosting server responds automatically, delivering a copy of the page or file to the harvester. Once the copy has been delivered to the harvester, it may then be incorporated into the library’s archive collection.
Cite Permalink:
10
8.5      Essentially the same process, albeit with different information contained in the ‘user-agent string’, underlies all browsing activity by every web user; web publishers will not need to make any systems changes or undertake any action to facilitate this.
Cite Permalink:
11
8.6      Website owners may not choose to log this information, but the general practice is certainly to log the user-agent string; many use this information actively to tailor content accordingly for different users, e.g. for mobile phones as opposed to computer browsers.
Cite Permalink:
12
8.7      Libraries will set rules and parameters for the harvester to ensure that there is no harmful impact upon the performance of the web hosting server:
Cite Permalink:
13
• Only web pages and documents that are publicly and freely available will be requested; harvesting will not go anywhere that is not public.
Cite Permalink:
14
• Web pages and documents will only be harvested periodically; the Panel’s cost estimates were based upon an assumed average of twice a year.
Cite Permalink:
15
• When multiple requests for different pages and files are issued to the same web hosting server, a generous interval between each request will safeguard against any risk of using up bandwidth or overloading the server.
Cite Permalink:
16
• The harvester will not obtain any content that is protected by a firewall or by any kind of barrier such as username/password protection.
Cite Permalink:
17
• The harvester will not request any pages or documents that do not have web links to them; therefore any pages or files which are not part of the public website cannot be requested.
Cite Permalink:
18
8.8 The harvester will automatically follow links from the home or root page to the next levels down within the same domain, issuing a separate request for each page or file.

Tags: , , , , , , , ,

Total comments on this page:

Comments are closed.