Skip the site navigation

The Invisible Web: Finding Resources that Search Engines Miss

A RECAP Workshop by Paula Edmiston
West Chester University, Pennsylvania, May 14, 2004.

** check these offsite links 3/5/12 **

Still round the corner there may wait,
A new road or a secret gate.
-- J. R. R. Tolkien

What is the Invisible Web (sometimes referred to as the Deep Web)? It is composed of those resources not accessible through the search engines. Revealing the Invisible Web requires taking a slightly different road, opening new gates. The resources are there but traditional paths, via search engines, don't lead to them.

A study by faculty and students at the School of Information Management and Systems at the Unviersity of California at Berkeley (Lyman), estimates the size of the Internet in 2002 at 532,897 terabytes. The surface web, accessible via search engines was measured at 167 terabytes. They measured the Deep Web (Invisible Web) as 91,850 terabytes. N.B. If digitized with full formatting, the seventeen million books in the Library of Congress contain about 136 terabytes of information. (Peter Lyman)

Search engines were originally designed to index web pages created using HTML, a markup language used to organize and structure text. Search engines are struggling to keep pace with the evolution of databases, multimedia and other formats used to present information on the web.

What makes Resources Invisible?

The resources are not really invisible of course. But they are not included in the indexing of the web conducted by search engines; so they are, in a sense, invisible to the search engines. Sometimes the phrase "Deep Web" is used to describe these resources because you have to dig deep to find them. Easy to find web pages can be considered "surface" resources.
Databases
In addition to the proprietary databases such as Psych Abstracts, the MLA Bibliography, Chemical Abstracts and the citation indexes, there are many databases available for free via the web. Academic organizations, professional associations and non-profits usually sponsor these databases. An example of this is the Group Psychotherapy, Psychodrama & Sociometry Bibliography Database, a collection of over 4,000 citations in several languages. This is a tremendous resource but the citations are in a database and not accessible from a search engine.
Format
Formats that are difficult to index include word processing and spreadsheets, PDF, Postscript, Flash, Shockwave, executables, streaming audio and video, and images. More search engines are indexing a greater variety of formats all the time but there are still limitations. While some search engines include the PDF format they index only a portion of the document. Search engines like Altavista and Google do offer image searches, but search file names only.
Pages that are Not linked
Search engines discover new web pages only by following links on known pages (they also learn when webmasters register their addresses directly). If a web page has not been linked to a page already being indexed (or registered by hand), the likelihood of it being indexed is small.
Robots and Spiders
Instructions can be applied sitewide or page-by-page. Authors of individual pages can include a meta tag to turn away search engine spiders. Web hosts can place special commands on their sites that will turn spiders away from entire sites.
Password Protection
Some web pages are password protected. Even demonstration sites containing valuable information can be rendered invisible because the spiders cannot read and use the login information that might be displayed on the page.
Search Engine limitations
Some search engines place restrictions on indexing, such as choosing to index only a portion of a page.
Ephemeral Pages
Search Engines will avoid pages that are generated "on the fly" or with information that's only useful for a short period of time: weather conditions, air flight arrivals, etc.
The web is maturing and search engine spiders are improving. Slowly more types of resources are being indexed (such as images and PDF documents). In the meantime there are a number of ways you can identify relevant resources and manage your link collections.

Revealing the Invisible Web

Searching for information is a tricky process requiring the searcher to be precise and exact in phrasing. The way a search is conducted can affect the success of finding appropriate resources. Consider the needle and the haystack from Koll:

A known needle in a known haystack
A known needle in an unknown haystack
An unknown needle in an unknown haystack
Any needle in a haystack
The sharpest needle in a haystack
Most of the sharpest needles in a haystack
All the needles in a haystack
Affirmation of no needles in a haystack
Things like needles in any haystack
Let me know whenever a new needle shows up
Where are the haystacks?
Needles, haystacks -- whatever

[from link deleted Matthew Koll, "Major Trends and Issues in the Information Industry http://www.asis.org/Bulletin/Jan-00/track_3.html]

Search engines
Although search engines can't index the contents of databases they can sometimes be used to identify potentially useful databases. Try a search in Teoma:

database "art history"
then:
"database of art history"

Note the use of quotation marks. This strategy can be used in most search engines. Quoting text marks it as a "phrase": the words must be next to each other and in the order given.
Portals and Specialized Search Engines
Portals are software programs that organize the titles in lists of links by subject and usually offer some method of searching the collection of titles.
Some universities and many libraries are maintaining portals of databases and subject-specific resources.
  • The Librarian's Index to the Internet is a well-organized point of access for reliable, trustworthy, librarian-selected Internet resources, serving California, the nation, and the world.
  • Digital Librarian is a librarian-maintained portal to myriad resources.
  • Infomine is a collection of scholarly resources - some chosen by librarians, some added automatically by the software. Infomine is a joint project of several libraries.
  • Scirus - for scientific information is a search engine covering scientific, scholarly, technical and medical data on the Web. It includes peer-reviewed articles and journals that other search engines miss.
  • The Public Library of Science (PLoS) is a non-profit organization of scientists and physicians committed to making the world's scientific and medical literature a freely available public resource.
  • OAIster is a project of the University of Michigan Digital Library Production Service. It includes over 3 million resources from 277 institutions.
Word of mouth
Make note of resources mentioned by colleagues; add these resources to the tools you use to keep track of specialized sites (see the section on tools, below).

Tools for Keeping the Resources Visible

Bookmarks
You can use your browser bookmarks to save and organize links to tools and resources. Browser bookmarks can be very convenient in that it is possible to quickly bookmark a site and file it into a specific subject folder. But if you work at more than one computer you'll run into a problem because the bookmark files remains on the computer on which it was saved. Once you walk away from that computer you lose access to the bookmark file.

Netscape saves its bookmarks in a single file. If you have a web account you can copy your bookmark file to your web site and access it from anywhere in the world.

MS Internet Explorer saves each "favorite" as an individual file and you cannot copy all those files to your web site. It is possible to export (a command to save the files in a different format) the favorites to a Netscape bookmarks compatible file. With MSIE V. 5 you find the export command on the File menu. Once you've exported the favorites they can be placed in your web site.
Web Pages
If you have your own web site you can create pages to keep track of useful sites, including notes about search strategies.
Personal Portals
You can acquire ready-made programs to install in your web site to act as a portal, similar to the subject directory portals offered by Yahoo and the Open Directory Project.

One example is the Linker script. An example of a perl CGI script for managing lists of links in subject categories.

The iVia software behind Infomine (see below) is available as open source software.

CWIS (pronounced see-wis) from the Internet Scout Project is software to assemble, organize, and share collections of data about resources, like Yahoo! or Google Directory but conforming to international and academic standards for metadata. CWIS was specifically created to help build collections of Science, Technology, Engineering, and Math (STEM) resources and connect them into NSF's National Science Digital Library, but can be (and is being) used for a wide variety of other purposes. CWIS is open source.

The Future of Web-based Resources

Scholarly and Institutional organizations are beginning to work toward making previously inaccessible resources available. Here are a few examples.
The Scout Portal Toolkit
The Scout Portal Toolkit, also from the Internet Scout Project, allows groups or organizations that have a collection of knowledge or resources they want to share via the World Wide Web to put that collection online without making a big investment in technical resources or expertise.
Open Archives
The Open Archives Initiative is a project to make web-based scholarly resources accessible through the use of a special language tool called "metadata". Metadata offers a way to standardize how documents harvested from institutions. The OAI is developing tools that can be used by institutions to make their resources more available.
The MIT dSpace
DSpace is a groundbreaking digital library system to capture, store, index, preserve, and redistribute the intellectual output of a university's research faculty in digital formats.

The Invisible Web can be rendered visible. As tools such as search engines continue to improve, and researchers continue to learn new search techniques, more resources will become accessible.

References and Readings

All pages viewed May 2004.

Bergman, Michael K. "The Deep Web: Surfacing Hidden Value". The Journal of Electronic Publishing. 7:1. August 2001.

Block, Marylaine. The Invisible Web. 14 Nov 2003.

Koll, Matthew. "Information Retrieval". Bulletin of The American Society for Information Science. 26:2 (December / January 2000)

Lyman, Peter et al. How Much Information? 2003

Norvig, Peter [Director of Search Quality, Google] Internet Searching (PDF). [to appear, Report on the Fundamentals of Computer Science, The National Academies, 2003]

Search Tools Consulting. Search Indexing Robots and Robots.txt. Page Updated 2002-12-18.

Sherman, Chris and Gary Price. "The Invisible Web: Uncovering Sources Search Engines Can't See." Library Trends 52:2 Fall 2003, pp.282-298.

Learn to see,
and then you'll know
there is no end to the
new worlds of our vision.
-- Carlos Castaneda


http://paula.edmiston.org/conf/2004recap-invisibleweb.html
Last Edited: 15 Apr 2012