Defining the UK Web: Publications in Scope
5.1 The Recommendation only relates to freely available online publications, which can be harvested or collected by Legal Deposit Libraries (LDLs)[1] without any requirement for action by publishers (a reflection that the publications are available to the public free of charge and accessible without restriction). Restrictions that would remove publications from the scope of these proposals may include identification, authentication/authorisation, registration, subscription, and Internet Protocol (IP) address range[2]. Material that requires compliance with a basic technical formality such as downloading ‘cookies’ should be permitted, provided that this does not entail any active (human) intervention by the publisher or website owner.
5.2 The online publications to which this Recommendation applies are not intended to include:
- Sites outside the UK (see Territoriality below)
- Chargeable content/commercial content
- Sites with technical barriers
- Secured transactions
- Members-only areas within public sites
- Private intranets and restricted access content
- Recorded sound and film where such works comprise the sole or main purpose of the content or where any other material is incidental (e.g. the BBC ‘Radio Player’, any equivalents of Napster, YouTube and suchlike, and sites offering ring tones or streamed films and programmes from broadcasters).
Question 6: Do you agree that this is an appropriate definition for the type of publications that should be included in scope for regulations? Explain why. Is there anything else that should be included in this definition? Is there anything that should be excluded from this definition?
Defining the UK Territoriality
5.3 Harvesting, where online publications are collected using software that facilitates their collection and archiving, provides a simple approach to deposit of such a wide range and number of publications. The first step in harvesting is defining the parameters for collection and its required links to the UK. This definition is also a requirement of Section 1 of the 2003 Act.
5.4 In fulfilment of these proposals, the territoriality criteria proposed for publications relevant to this Recommendation are that:
- Publishers should be based in the UK or have a UK address (physical or electronic);
- Publications should be lawfully published or made available by or on behalf of that publisher from a UK address; and
- Publications should be made available to the public.
The following criteria are also thought to be relevant to fixing the place of publication and, therefore, the potential relevance to any approved harvesting for the purposes of legal deposit, namely that the site from which the publication is harvested:
- has a UK domain name
- relates to UK-based individuals or organisations which use other domain names, such as .org, .com, .net etc. or alternatives; and
- can be demonstrated, if an overseas publication, to be made available by a UK-based publisher.
5.5 Exceptions to this definition include publications:
- with no connection to the UK[3] and
- substantially consisting of sound recording or films (see Act s1 (5) a, b).
5.6 While territoriality establishes the parameters of the domain to be harvested, analysis of the domain growth and size identifies the scale of the work and some of the key assumptions underlying the calculation of costs.
5.7 Further information on this issue is set out in Annex E.
Question 7: Do you agree with the territorial definition of the UK web? Explain why. Is there anything else that should be included in this definition? Is there anything that should be excluded from this definition?
The UK Domain[4]
5.8 The category definition and territoriality rules govern what may be collected. Within these, the model used to calculate costs for harvesting assumes that the UK web space is defined as all .UK domains registered by Nominet (6.1 million in mid-2007) plus approximately 50,000 other domains which can be readily identified as published in the UK. See Libraries key costs assumptions for cost model and further information on assumptions (Annex D).
5.9 It is estimated that the numbers will continue to grow by 17% per annum until 2011, then by 15% until 2016. However, 35% of the domains are inactive, i.e. registered but not live, or where static content can be ‘de-duplicated’ after a first harvest. A further 25% are primarily ‘deep web’ or protected publications outside the scope of this category. Overall, the number of online publications in scope is therefore estimated at 3.9 million in 2007 rising to 14.6 million in 2016.
Question 8: Do you agree with this analysis of the UK Web Domain? Explain why. What do you think the impact of your analysis would be?
5.10 The average size of websites (and therefore the number of copyright works and publications that they contain) has been growing significantly each year. However, the cost model assumes that most audiovisual content, one of the major causes of growth, is out of scope, and therefore a more modest 5% growth per annum is appropriate. The average size also varies dramatically, from circa five megabytes for 80% of sites to one gigabyte for 0.5% of sites; this model assumes a weighted average of 25 megabytes.
Harvesting the Web[5]
5.11 The proposed method of collecting and preserving such a large number of publications is to ‘pull’ (harvest) them from the Web. Harvesting is an automated process, where, through the use of special software, libraries can collect publications with no action required by publishers. The costs, impact evidence, and success rate for this type of harvesting are based on a pilot implemented by the UK Web Archiving Consortium (UKWAC). The pilot, commencing in 2003 for two years (extended to September 2007), involved the selection of freely available online publications to be preserved and archived. The Consortium has so far archived more than 2,700 publications and over 10,000 instances (see http://www.webarchive.org.uk).
5.12 Harvesting conducted as part of a regulation does not require the individual permissions of publishers, as exemptions from such liabilities as copyright infringement and defamation are covered under the 2003 Act.
Question 9: How do you see a Deposit Library driven system of web harvesting interfacing with a publisher driven duty to deposit under the 2003 Act?
Question 10: How could Deposit Libraries most efficaciously ensure a comprehensive body of eligible content is deposited?
UK Legal Deposit Libraries Harvesting the Web
Harvesting Costs
5.13 This proposal involves harvesting by Libraries, therefore, the costs largely accrue to them. However, this does not impose a specific duty upon libraries to collect a pre-determined number or proportion of UK publications. Their duty is to collect in accordance with their overriding legal deposit obligations, to archive as much of the national cultural record and make it available for research within limitations of their resources and budgets. Therefore, these costs are not direct, bottom-line (cause and effect) consequences of each option. They are illustrations of what the libraries believe might realistically be achieved within their budget and resource constraints and after prioritising this activity and category of publications against other collection goals.
5.14 The cost of storage includes built-in redundancy to ensure safe preservation of the archive. However, the real cost of storage per terabyte has fallen by more than 30% per annum over the last 20 years and is expected to continue falling by 25% per annum until 2016.
5.15 Two infrastructures have already been designed and built (apart from certain elements) by the Libraries and will be used to store other digital or digitised collections as well as legal deposit material. Therefore, this cost model focuses only on the incremental systems costs (including renewing equipment every three years) plus staffing costs required to collect and preserve this category of publication. 8 Some stakeholders have reservations about the extent of harvesting and access to the harvested content and we will look at ways to overcome these concerns in our detailed policy proposals.
5.16 Costs have been analysed under the headings of selection, obtaining copyright permission, harvesting, QA, storage & preservation, resource discovery, digital rights management (DRM) & access, and other costs. They include salaries, pensions, NI and other staff-related costs, allocations for wider costs such as IT support and expenses, plus allocations for general overheads (See Annex D for more detail on costs and assumptions).
5.17 The near elimination of selection and IPR permissions activities makes harvesting a much more efficient process than requiring every publisher to deposit their own material. Total costs are estimated at £215 per annum for every terabyte archived over a 10-year period, although higher overall costs estimated at £1,132,000 per annum would be necessary for the infrastructure, harvesting, and storage, because of the greater volume collected[6]. See Libraries’ key cost assumptions Annex D.
5.18 Further information on the practical arrangements are set out in Annex F.
Question 11: Do you agree with this costing model? Explain why. Are there costs that need to be factored in or excluded?
Publishers
5.19 Ascertaining publisher costs presents a difficulty that can be ascribed primarily to the broad definition of ‘publisher’ for this category of publication, a definition that is quite distinct from that of other categories for deposit. Traditionally, publishers are a group well defined and contained by type and content of publication, as well as by business model. The online publisher of freely available publications, however, runs the gamut from the individual blogger with no revenue stream to a multinational corporation. This sheer number of publications and range of publisher types impose a considerable challenge for determining costs and benefits that suit any group of publishers, let alone cover the whole spectrum.
5.20 At the beginning of 2008, the Legal Deposit Advisory Panel undertook a survey of Trade Association publisher members, as well as non-commercial publishers that participated in the UKWAC pilot. This survey provided publishers with information about deposit as well as asked them for feedback on costs and other impacts of harvesting and archiving. The findings from the survey were as follows:
- A majority of those commercial and non-commercial publishers surveyed supported regulation-based harvesting;
- Not only did they think this kind of harvesting the most efficient and less invasive to their business process, but they also observed that there would be relatively little cost to them;
- However, publishers were not able to assess the level of cost to them associated with permissions-based harvesting.
5.21 Generally, publishers cost concerns were primarily in the area of revenue and the possible impact from harvesting, and to what extent these concerns could be addressed in a rapidly changing commercial and technological environment.
5.22 As publishers do not push (deposit) publications to libraries in the traditional sense, there appears to be no specific activity from which costs can be calculated. However, there are potential risks that may have significant impacts, if not eventual costs. These include copyright protection of freely available online publications. We are awaiting the outcomes from the UK Intellectual Property Office’s Copyright Exceptions Consultation, so that this concern can be addressed in future detailed policy proposals.
Moreover, the deposit process adds a level of complexity for publishers in their agreements with third parties, either providing content or software. Indeed, there are concerns, as expressed in the Commercial Publishers Survey, over securing ongoing rights for data or images that were made available free of charge but on a time limited basis. For example, some promotional sites provide high value business information on a time-limited basis as sample data to encourage site traffic or subscription sales. Accordingly, publishers may be exposed to such liabilities as third party IPR and licensing infringement, as well as defamation, contempt of court, and libel.
Question 12: Do these assumptions adequately reflect the financial burden of publishers? Is there anything that needs to included or excluded?
—————————–
[1] For the purposes of this paper, ‘LDLs’ or ‘Libraries’ applies to all six Legal Deposit Libraries named in this Recommendation paper: the British Library, the National Library of Scotland, the National Library of Wales, and the University Libraries of Cambridge, Oxford, and Trinity College, Dublin.
[2] Where access is only enabled for users within a specified IP address range.
[3] LDAP is reviewing the use of this phrase in connection with online publications, as its inclusion here would imply that publications on non-UK related subjects, but by British authors, would be excluded from the archive. A number of agencies with helpful practices might also aid the LDLs in identifying publications ‘connected to the UK’, such as Internet Watch and Nominet.
[4] See notes under Libraries Key Costs Assumptions, for sources used to support the assumptions for the growth and size of the domain.
[5] Some stakeholders have reservations about the extent of harvesting and access to the harvested content and we will look at ways to overcome these concerns in our detailed policy proposals.
[6] This figure represents the total cost across Legal Deposit Libraries. It assumes that readers in any of the six legal deposit libraries’ premises would be able to access all materials and electronic publications that are harvested and archived by the BL/NLW/NLS infrastructures. The University Libraries of Oxford, Cambridge and Trinity College do not currently plan to harvest themselves to the same extent, but would retain the entitlement to do so.