What is the
Internet Archive Wayback Machine?
The Internet Archive Wayback Machine is a service that allows people to visit
archived versions of stored websites. Visitors to the
Internet Archive Wayback Machine can type in an URL, select a date, and then
begin surfing on an archived version of the web. Imagine
surfing circa 1999 and looking at all the Y2K hype, or revisiting
an older copy of your favorite website. The Internet Archive Wayback Machine
can make all of this possible. See the Press
Release.
Can I link to old
pages on the Internet Archive Wayback Machine?
Yes! Alexa Internet has built the Internet Archive Wayback Machine so that it
can be used and referenced by anybody and everybody. If you
find an archived page that you would like to reference on your
web page or in an article, you can copy the URL and share it
with others. You can even use fuzzy URL matching and date specifications...
but that's a bit more
advanced.
I don't want my site's pages in the archive. How do I remove them?
By installing a robots.txt file on your web server, you can exclude your site
from being archived, as well as block access to them on the archive. For
information, see our FAQ about removing documents.
Are other sites
available in the Internet Archive Wayback Machine?
The Internet Archive is attempting to archive the entire
publicly available web. Some sites may not be included
because the automated crawlers were unaware of their existence
at the time of the crawl. It's also possible that some
sites were not archived because they were password protected or
otherwise inaccessible to our automated systems.
What does it mean
when a site's archive date has been "updated"?
When our automated systems crawl the web every few months or
so, we find that only about 50% of all pages on the web have
changed from our previous visit. This means that much
of the content in our archive is duplicate material. If
you don't see "*" next to an archived document, then
the content on the archived page is identical to the previously
archived copy.
Who was involved
in creating the Internet Archive Wayback Machine?
The original idea for the Internet Archive Wayback Machine began in 1996, when
the Internet Archive first began archiving the web. Now,
five years later, with over 100 terabytes and a dozen web crawls
completed, the Internet Archive has made the Internet Archive Wayback Machine
available to the public. The Internet Archive has relied
on donations of web crawls, technology and expertise from Alexa
Internet and others. The Internet Archive Wayback Machine is
owned and operated by the Internet Archive.
How was the Wayback
Machine made?
Over 100 terabytes of data are stored on several dozen modified
servers. Alexa Internet, in cooperation with the Internet Archive,
has designed a three dimensional index that allows browsing of web
documents over multiple time periods, and turned this unique feature
into the Wayback Machine.
How large is the
Archive?
The Internet Archive Wayback Machine contains over 100 terabytes of data and is
currently growing at a rate of 12 terabytes per month. The
archive contains multiple copies of the entire publicly
available web. This eclipses the amount of data contained
in the world's largest libraries, including the Library of
Congress. If you tried to place the
entire contents of the archive onto floppy disks (I don't
recommend this!) and laid them end to end, it would stretch from
New York, past Los Angeles, and halfway to Hawaii.
Can I search the
Archive?
Using the Internet Archive Wayback Machine, it is possible to search for the
names of sites contained in the Archive and to specify date
ranges for your search. However, we do not yet have an indexed
text search of the documents in the collection. The collection is a bit too
large and complicated for that. We continue to work on it and
should have a full text search soon.
What type of
machinery is used in the Internet Archive?
The Internet Archive is stored on dozens of slightly modified
Hewlett Packard servers.
The computers run on the FreeBSD operating system. Each
computer has 512Mb of memory and can hold just over 300
gigabytes of data on IDE disks.
How do you
archive dynamic pages?
There are many different kinds of dynamic pages, some of which
are easily stored in an archive and some of which fall apart
completely. When a dynamic page renders standard html, the archive
works beautifully. When a dynamic page contains forms, JavaScript,
or other elements that require interaction with the originating
host, the archive will not accurately reflect the original site's
functionality.
Why are
some sites harder to archive than others?
If you look at our collection of archived sites, you will find
some broken pages, missing graphics, and some sites that aren't
archived at all. We have tried to create a complete archive,
but have had difficulties with some sites. Here are some things
that make it difficult to archive a web site:
- Robots.txt -- If our robot crawler is forbidden
from visiting a site, we can't archive it.
- Javascript -- Javascript elements are often
hard for us to archive, but especially if it generates links
without having the full name in the page. Plus, if javascript
needs to contact with the originiating server in order to
work, it will fail when archived.
- Server side image maps -- Like any functionality
on the web, if it needs to contact the originating server
in order to work, it will fail when archived.
- Unknown sites -- If Alexa doesn't know about
your site, it won't be archived. Use the Alexa service, and
we will know about your page. Or you can visit our Archive
Your Site page.
- Orphan pages -- If there are no links
to your pages, our robot won't find it (our robots don't enter
queries in search boxes.)
As a general rule of thumb, simple html is
the easiest to archive.
Some sites are
not available because of Robots.txt or other exclusions.
What does that mean?
The Standard for Robot Exclusion (SRE) is a means by which web
site owners can instruct automated systems not to crawl their
sites. Web site owners can specify files or directories that
are allowed or disallowed from a crawl, and they can even create
specific rules for different automated crawlers. All of this
information is contained in a file called robots.txt. While
robots.txt has been adopted as the universal standard for robot
exclusion, compliance with robots.txt is strictly voluntary.
In fact most web sites do not have a robots.txt file, and many
web crawlers are not programmed to obey the instructions anyway.
However, Alexa, the company that crawls the web for the
Internet Archive, does respect robots.txt instructions, and even
does so retroactively. If a web site owner ever decides he /
she prefers not to have a web crawler visiting his / her files
and sets up robots.txt on the site, the Alexa crawlers will
stop visiting those files and mark all files previously gathered
as unavailable. This means that sometimes, while using the
Internet Archive Wayback Machine, you may find a site that is unavailable due
to robots.txt or other exclusions. Other exclusions? Yes, sometimes
a web site owner will contact us directly and ask us to stop
crawling or archiving a site. We comply with these requests.
How can I
get my site included in the Archive?
Alexa Internet has been crawling the web since 1996, which has
resulted in a massive archive. If you have a web site, and you
would like to ensure that it is saved for posterity in the
Archive, chances are that it's already there. We make every
effort to crawl the entire publicly available web. However,
if you wish to take extra measures to ensure that we archive
your site, you can visit the "Archive
Your Site" page.
How can I help?
The Internet Archive actively seeks donations of digital materials
for preservation. Alexa Internet provides access to a web-wide
crawl that contains copies of the publicly accessible web. If
you have digital materials that may be of interest to future
generations, let
us know. The Internet Archive is also seeking additional
funding to continue this important mission. Please contact
us if you wish to make a contribution.
|