LINKROT
   HOME

TheInfoList



OR:

Link rot (also called link death, link breaking, or reference rot) is the phenomenon of hyperlinks tending over time to cease to point to their originally targeted
file File or filing may refer to: Mechanical tools and processes * File (tool), a tool used to ''remove'' fine amounts of material from a workpiece **Filing (metalworking), a material removal process in manufacturing ** Nail file, a tool used to gent ...
, web page, or server due to that resource being relocated to a new address or becoming permanently unavailable. A link that no longer points to its target, often called a ''broken'' or ''dead'' link (or sometimes ''orphan'' link), is a specific form of
dangling pointer Dangling pointers and wild pointers in computer programming are pointers that do not point to a valid object of the appropriate type. These are special cases of memory safety violations. More generally, dangling references and wild references ar ...
. The rate of link rot is a subject of study and research due to its significance to the
internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, pub ...
's ability to preserve information. Estimates of that rate vary dramatically between studies.


Prevalence

A number of studies have examined the prevalence of link rot within the
World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet. Documents and downloadable media are made available to the network through web ...
, in academic literature that uses URLs to cite web content, and within
digital libraries A digital library, also called an online library, an internet library, a digital repository, or a digital collection is an online database of digital objects that can include text, still images, audio, video, digital documents, or other digital m ...
. A 2003 study found that on the Web, about one link out of every 200 broke each week, suggesting a
half-life Half-life (symbol ) is the time required for a quantity (of substance) to reduce to half of its initial value. The term is commonly used in nuclear physics to describe how quickly unstable atoms undergo radioactive decay or how long stable at ...
of 138 weeks. This rate was largely confirmed by a 2016–2017 study of links in
Yahoo! Directory The Yahoo! Directory was a web directory which at one time rivaled DMOZ in size. The directory was Yahoo!'s first offering and started in 1994 under the name Jerry and David's Guide to the World Wide Web. When Yahoo! changed its main results to ...
(which had stopped updating in 2014 after 21 years of development) that found the half-life of the directory's links to be two years. A 2004 study showed that subsets of Web links (such as those targeting specific file types or those hosted by academic institution) could have dramatically different half-lives. The URLs selected for publication appear to have greater longevity than the average URL. A 2015 study by Weblock analyzed more than 180,000 links from references in the full-text corpora of three major open access publishers and found a half-life of about 14 years, generally confirming a 2005 study that found that half of the
URLs A Uniform Resource Locator (URL), colloquially termed as a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifi ...
cited in D-Lib Magazine articles were active 10 years after publication. Other studies have found higher rates of link rot in academic literature but typically suggest a half-life of four years or greater. A 2013 study in '' BMC Bioinformatics'' analyzed nearly 15,000 links in abstracts from Thomson Reuters's Web of Science citation index and found that the median lifespan of web pages was 9.3 years, and just 62% were archived. A 2021 study of external links in 1996-2019 ''
New York Times ''The New York Times'' (''the Times'', ''NYT'', or the Gray Lady) is a daily newspaper based in New York City with a worldwide readership reported in 2020 to comprise a declining 840,000 paid print subscribers, and a growing 6 million paid ...
'' articles found that 25% of links were inaccessible. In addition, from a sample of 4,500 links still accessible, 13% did not lead to the original content, a phenomenon called ''content drift''. A 2002 study suggested that link rot within digital libraries is considerably slower than on the web, finding that about 3% of the objects were no longer accessible after one year (equating to a half-life of nearly 23 years).


Causes

Link rot can result from several occurrences. A target web page may be removed. The server that hosts the target page could fail, be removed from service, or relocate to a new
domain name A domain name is a string that identifies a realm of administrative autonomy, authority or control within the Internet. Domain names are often used to identify services provided through the Internet, such as websites, email services and more. As ...
. A domain name's registration may lapse or be transferred to another party. Some causes will result in the link failing to find any target and returning an error such as
HTTP 404 In computer network communications, the HTTP 404, 404 not found, 404, 404 error, page not found or file not found error message is a hypertext transfer protocol (HTTP) standard response code, to indicate that the browser was able to commun ...
. Other causes will cause a link to target content other than what was intended by the link's author. Other reasons for broken links include: * the restructuring of websites that causes changes in URLs (e.g. might be moved to ) * relocation of formerly free content to behind a paywall * a change in server architecture that results in code such as
PHP PHP is a general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by The PHP Group. ...
functioning differently * dynamic page content such as search results that changes by design * the presence of user-specific information (such as a login name) within the link * deliberate blocking by content filters or firewalls * the expiration of a
domain name registration A domain name registry is a database of all domain names and the associated registrant information in the top level domains of the Domain Name System (DNS) of the Internet that enables third party entities to request administrative control of a do ...


Prevention and detection

Strategies for preventing link rot can focus on placing content where its likelihood of persisting is higher, authoring links that are less likely to be broken, taking steps to preserve existing links, or repairing links whose targets have been relocated or removed. The creation of URLs that will not change with time is the fundamental method of preventing link rot. Preventive planning has been championed by Tim Berners-Lee and other web pioneers. Strategies pertaining to the authorship of links include: * linking to primary rather than secondary sources and prioritizing stable sites * avoiding links that point to resources on researchers' personal pages * using
clean URL Clean URLs, also sometimes referred to as RESTful URLs, user-friendly URLs, pretty URLs or search engine-friendly URLs, are URLs intended to improve the usability and accessibility of a website or web service by being immediately and intuitively ...
s or otherwise employing
URL normalization URI normalization is the process by which URIs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URI into a normalized URI so it is possible to determine if two syntactically differen ...
or
URL canonicalization A canonical link element is an HTML element that helps webmasters prevent duplicate content issues in search engine optimization by specifying the "canonical" or "preferred" version of a web page. It is described in RFC 6596, which went live in Apr ...
* using
permalinks A permalink or permanent link is a URL that is intended to remain unchanged for many years into the future, yielding a hyperlink that is less susceptible to link rot. Permalinks are often rendered simply, that is, as clean URLs, to be easier to ...
and
persistent identifier A persistent identifier (PI or PID) is a long-lasting reference to a document, file, web page, or other object. The term "persistent identifier" is usually used in the context of digital objects that are accessible over the Internet. Typically, s ...
s such as ARKs, DOIs, Handle System references, and
PURL A persistent uniform resource locator (PURL) is a uniform resource locator (URL) (i.e., location-based uniform resource identifier or URI) that is used to redirect to the location of the requested web resource. PURLs redirect HTTP clients using H ...
s * avoiding linking to documents other than web pages * avoiding
deep linking In the context of the World Wide Web, deep linking is the use of a hyperlink that links to a specific, generally searchable or indexed, piece of web content on a website (e.g. "http://example.com/path/page"), rather than the website's home page (e ...
* linking to web archives such as the
Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, ...
,
WebCite WebCite was an on-demand archive site, designed to digitally preserve scientific and educationally important material on the web by taking snapshots of Internet contents as they existed at the time when a blogger or a scholar cited or quoted ...
, archive.today,
Perma.cc Perma.cc is a web archiving service for legal and academic citations founded by the Harvard Library Innovation Lab in 2013. Concept Perma.cc was created in response to studies showing high incidences of link rot in both academic publications an ...
, or Amber Strategies pertaining to the protection of existing links include: * using redirection mechanisms such as
HTTP 301 The HTTP response status code 301 Moved Permanently is used for permanent redirecting, meaning that links or records returning this response should be updated. The new URL should be provided in the Location field, included with the response. The 3 ...
to automatically refer browsers and crawlers to relocated content. * using
content management systems A content management system (CMS) is computer software used to manage the creation and modification of digital content (content management).''Managing Enterprise Content: A Unified Content Strategy''. Ann Rockley, Pamela Kostur, Steve Manning. New ...
which can automatically update links when content within the same site is relocated or automatically replace links with canonical URLs * integrating search resources into
HTTP 404 In computer network communications, the HTTP 404, 404 not found, 404, 404 error, page not found or file not found error message is a hypertext transfer protocol (HTTP) standard response code, to indicate that the browser was able to commun ...
pages The detection of broken links may be done manually or automatically. Automated methods include plug-ins for content management systems as well as standalone broken-link checkers such as like
Xenu's Link Sleuth Xenu, or Xenu's Link Sleuth, is a computer program that checks websites for broken hyperlinks. It is written by Tilman Hausherr and is proprietary software available at no charge. The program is named after Xenu, the Galactic Ruler from Scie ...
. Automatic checking may not detect links that return a
soft 404 In computer network communications, the HTTP 404, 404 not found, 404, 404 error, page not found or file not found error message is a hypertext transfer protocol (HTTP) standard response code, to indicate that the browser was able to commun ...
or links that return a 200 OK response but point to content that has changed.


See also

* Software rot *
Digital preservation In library and archival science, digital preservation is a formal endeavor to ensure that digital information of continuing value remains accessible and usable. It involves planning, resource allocation, and application of preservation methods and ...
*
Deletionism and inclusionism in Wikipedia Deletionism and inclusionism are opposing philosophies that largely developed within the community of editors of the online encyclopedia Wikipediasite's community. The terms reflect differing opinions on the appropriate scope of the encycloped ...
*
Archive Team Archive Team is a group dedicated to digital preservation and web archiving that was co-founded by Jason Scott in 2009. Its primary focus is the copying and preservation of content housed by at-risk online services. Some of its projects include ...
, web archiving team


Further reading

* * * * *


References


External links


Future-Proofing Your URIs
* {{cite web, url=http://www.useit.com/alertbox/980614.html, title=Fighting Linkrot, authorlink=Jakob Nielsen (usability consultant), last=Nielsen, first=Jakob, date=14 June 1998, archive-url=https://web.archive.org/web/20121223011620/http://www.useit.com/alertbox/980614.html, archive-date=23 December 2012 URL Data quality Product expiration