2
8.1 The Panel’s goal has been to propose a deposit process which is cost-efficient for the legal deposit Libraries and which also imposes no administrative cost burden upon publishers. Because it potentially involves such a large number of publications, the Panel has recommended that libraries pull them directly from the Web using an automated process in which no action is required of publishers. This process will use a software tool (‘harvester’) to crawl relevant web domains.
8.1 The Panel’s goal has been to propose a deposit process which is cost-efficient for the legal deposit Libraries and which also imposes no administrative cost burden upon publishers. Because it potentially involves such a large number of publications, the Panel has recommended that libraries pull them directly from the Web using an automated process in which no action is required of publishers. This process will use a software tool (‘harvester’) to crawl relevant web domains.3
8.2 An initial seed list of Uniform Resource Locators (URLs) will be loaded into the harvester by library staff. These will usually be URLs for the home or root pages of web domains that are within the scope as recommended by the Panel.
8.2 An initial seed list of Uniform Resource Locators (URLs) will be loaded into the harvester by library staff. These will usually be URLs for the home or root pages of web domains that are within the scope as recommended by the Panel.4
8.3 For each URL, the harvester will issue an electronic request to the publisher’s web hosting server for delivery of a copy of the page or file. Each request will include information which identifies:
8.3 For each URL, the harvester will issue an electronic request to the publisher’s web hosting server for delivery of a copy of the page or file. Each request will include information which identifies:7
• a ‘user-agent string’ which identifies the library controlling the harvester and the fact that it is a harvesting request
• a ‘user-agent string’ which identifies the library controlling the harvester and the fact that it is a harvesting request
8
• the URL for a web page containing details of how to contact the library, plus contextual information about legal deposit and the terms of the regulation.
• the URL for a web page containing details of how to contact the library, plus contextual information about legal deposit and the terms of the regulation.
9
8.4 The web hosting server responds automatically, delivering a copy of the page or file to the harvester. Once the copy has been delivered to the harvester, it may then be incorporated into the library’s archive collection.
8.4 The web hosting server responds automatically, delivering a copy of the page or file to the harvester. Once the copy has been delivered to the harvester, it may then be incorporated into the library’s archive collection.10
8.5 Essentially the same process, albeit with different information contained in the ‘user-agent string’, underlies all browsing activity by every web user; web publishers will not need to make any systems changes or undertake any action to facilitate this.
8.5 Essentially the same process, albeit with different information contained in the ‘user-agent string’, underlies all browsing activity by every web user; web publishers will not need to make any systems changes or undertake any action to facilitate this.11
8.6 Website owners may not choose to log this information, but the general practice is certainly to log the user-agent string; many use this information actively to tailor content accordingly for different users, e.g. for mobile phones as opposed to computer browsers.
8.6 Website owners may not choose to log this information, but the general practice is certainly to log the user-agent string; many use this information actively to tailor content accordingly for different users, e.g. for mobile phones as opposed to computer browsers.12
8.7 Libraries will set rules and parameters for the harvester to ensure that there is no harmful impact upon the performance of the web hosting server:
8.7 Libraries will set rules and parameters for the harvester to ensure that there is no harmful impact upon the performance of the web hosting server:13
• Only web pages and documents that are publicly and freely available will be requested; harvesting will not go anywhere that is not public.
• Only web pages and documents that are publicly and freely available will be requested; harvesting will not go anywhere that is not public.
14
• Web pages and documents will only be harvested periodically; the Panel’s cost estimates were based upon an assumed average of twice a year.
• Web pages and documents will only be harvested periodically; the Panel’s cost estimates were based upon an assumed average of twice a year.
15
• When multiple requests for different pages and files are issued to the same web hosting server, a generous interval between each request will safeguard against any risk of using up bandwidth or overloading the server.
• When multiple requests for different pages and files are issued to the same web hosting server, a generous interval between each request will safeguard against any risk of using up bandwidth or overloading the server.
16
• The harvester will not obtain any content that is protected by a firewall or by any kind of barrier such as username/password protection.
• The harvester will not obtain any content that is protected by a firewall or by any kind of barrier such as username/password protection.
17
• The harvester will not request any pages or documents that do not have web links to them; therefore any pages or files which are not part of the public website cannot be requested.
• The harvester will not request any pages or documents that do not have web links to them; therefore any pages or files which are not part of the public website cannot be requested.
18
8.8 The harvester will automatically follow links from the home or root page to the next levels down within the same domain, issuing a separate request for each page or file.
8.8 The harvester will automatically follow links from the home or root page to the next levels down within the same domain, issuing a separate request for each page or file.Tags: firewall, Internet Protocol, IP, mobile phones, web domains, web hosting server, web links, web publishers, web user
Table of Contents
Comments
Commenters