Novak, David
Online , 27 , 1 , 18(5)
Jan-Feb , 2003
ISSN: 0146-5422
Language: English
Record Type: Fulltext
Word Count: 3488
Line Count: 00275
Text:
It has taken me several
years to grasp just how vast a gulf lies between
searching and researching
the Internet. Searching the Internet is computer
science. We practice our
understanding of search technologies and search
engines. Researching the
Internet is library science. We act upon our
understanding of how information
is arranged.
These two approaches
are very distinct.
I shall attempt to trace a gradual evolution in how we find
information using the Internet.
I believe we have been moving from Internet
searching to Internet research--from
computer science to library science.
If I am right, this portends
perhaps the single most dramatic change to
library science in decades:
a renaissance of library science and
librarianship.
CONSTANT CHANGE
The two or three most effective ways to search the Internet change
every year or two. It comes
as a bit of a shock to realize, but even the
very short history of the
Internet has seen a wide range of tools and
techniques come and go.
Today, there appears to be a consensus that Google
is the primary search tool
for searching for Internet information. And yet
this same conviction was
directed to Yahoo! just 2 years ago. What has
happened?
In the very early days, before the Web arrived, I remember pleading
with my Internet service
provider to mirror a copy of the many guidebooks
that made up the Internet
Clearinghouse Project. You may know of this
project as its later re-incarnation:
the Argus Clearinghouse. In its heyday
it was internationally famous.
One of its typical text guidebooks, Not Just
Cows, described in detail
all of the better Internet resources and active
mailing lists for agriculture.
When I met this archive, it was racing past
130 guidebooks.
Archie complemented this as a database of all the publicly accessible
files found on FTP sites.
Actually, Archie was not a complete database but
was thought to index well
over 95 percent of all FTP material. This
coverage was so complete,
it started the tradition that the publisher was
responsible for informing
a nearby Archie if a new FTP site was launched.
How far we have come today Most of the guidebooks have grown up or
disintegrated in time. Argus
has not been updating for several years and is
being folded into the Internet
Public Library (IPL) directory. Argus
founder Lou Rosenfeld formed
his own consulting company
(www.lourosenfeld.com) and
gives seminars in conjunction with the Norman
Nielsen group. Argus' direct
competitor, AlphaSearch, is gone, too. Even
Archie gave way to Shareware.com,
which was then purchased by CINET, then
lost all pretense at completeness.
But much more was lost. The idea that a
single person could organize
all the resources in a given topic was one
casualty. So was the idea
of a search engine that indexed all Internet
resources, as Archie did
for FTP. The Internet simply outgrew these ideas.
In the early days, it was
both possible and brilliantly executed.
With the arrival of Gophers, Veronica stepped in and became a third
vital approach to finding
Internet information. Veronica was a
quasi-definitive list of
all Gopher categories. It never attained the
completeness that Archie
had for FTP resources and its fame slipped rapidly
away once it became apparent
that the Web was going to be far more
interesting than Gopherspace.
THE WEB ARRIVES
The early search engines, with names like the World Wide Web Worm and
Webcrawler, changed this
environment significantly. These search engines
indexed most of the Web,
certainly achieving initially over 50 percent
coverage, then slipping
to 30 percent as the Web grew. These tools were as
famous as Google and Yahoo!
Are today. Everyone used them. And when the Web
was young, these tools sparkled.
Unfortunately, the search algorithms used by early search engines
were of the kind used by
commercial databases of the day A search for
"Internet Research" returned
a list of Web pages ranked by frequency and
title. Web pages with "Internet
Research" in their titles would lead the
list, followed by pages
with the words "Internet research" occurring
several times in the text.
This gave rise to the uninspired marketing maxim
that you must place your
primary keywords in the title and three or four
times in the first paragraph.
These early search engines also invited and even expected publishers
to inform them of new Web
pages. The search engines would dutifully send
out their spiders, sometimes
immediately. For some reason, though, I don't
remember much use of field
searching in these early days. Perhaps the early
search engines did not permit
Title and URL searching, or perhaps we didn't
know we needed these tools.
Complementing these early search engines were two simple techniques
that gave the motion to
Internet surfing. Initially, we would search for a
hotlinks page. A search
for "Accounting Hotlinks" would likely unearth a
page created by someone
who had just finished a scan of accounting
resources. If it was a month
or two old, it served as a very fine starting
point for your efforts to
do the same.
About a year later, as Hotlinks stopped being the word de jour, we
would visit the "further
links" section of an interesting Web site.
Publishers were kindly creating
these lists more and more, pointing out and
linking to comparable sites.
This may have been where the habit of surfing
arose--you could hop on
and gradually move from one Web site, to its
further links page, to the
next Web site, to its further links
page--surfing to the information
that peers recognized as useful.
THE AGE OF THE DIRECTORY
The World Wide Web Virtual Library, soon followed by Yahoo!, began to
succeed as the guidebooks
began to falter. Yahoo! required much less effort
to update, so rapidly delivered
a far more extensive list of
resources--though sadly
listing few of the cherished mailing lists.
Yahoo! really made its move at a time when the early search engines
were struggling to make
the transition to popularity ranking. There were
too many resources out there.
The basic search algorithms that had
delivered such brilliant
results only a year earlier were now increasingly
exasperating. They didn't
work any more. The best information was often
buried deep within a mass
of other information.
Essentially, as the Web grew and search engine databases struggled
unsuccessfully to keep pace,
the search engine results deteriorated. It did
not help that these early
search engines defaulted to OR, so that even a
simple search for three
blind mice would deliver millions of results.
Adding the + symbol before
each word-making an explicit request for a
Boolean AND search--initially
tamed this mess, but the trouble was more
fundamental. It required
a major rethink in how information was ranked to
revitalize these search
engines.
In this chaotic transition, Yahoo! reigned supreme. Suddenly you
could not move fast enough
to see what Yahoo! had to offer. The age of the
directory also heralded
a raging business model that, through massive
promotion, made Yahoo! synonymous
with Internet research for a time.
LATE ERA SEARCH ENGINES
The growth of the Internet continued. When Google introduced ranking
technologies, it changed
everything. Here was a way to float the more
popular and, coincidentally,
the more recognized, resources to the top of
the long search engine lists.
With the default changed to AND, the search
engines began to work again
as an effective research tool. Then the
databases searched by search
engines swelled in size.
There were fundamental shifts taking place. With these new
algorithms, the search engines
no longer required the assistance of
publishers to index the
best information. Initially, the engines began
asking for e-mail addresses.--often
bathing a publisher in spam as a price
for indexing--and then some
gradually stopped altogether. At the same time,
as databases grew, the potential
pay-off for a publisher shrank. Most new
publishers would only occasionally
see a visitor sent their way from any
effort in informing the
search engines of new pages.
When Google crested 1 billion records, the limitations of Yahoo! were
becoming increasingly apparent.
No directory could ever index the complete
volume of the Internet effectively,
it was said, forgetting that only a few
years earlier Archie had
effectively indexed all FTP resources. What had
happened, of course, was
rapid Internet growth, which diluted earlier
achievements to the point
of being inadequate. It did not help that at this
time Yahoo! began to charge
a consideration fee for publishers wishing to
be indexed.
BOOLEAN, FIELDS, AND MORE
Another change happened. The search engines allowed for field
searching, and those in
the know began to make much greater use of
additional techniques to
further refine their searching. A title search
could be most helpful in
certain circumstances. All the Web permitted a
title search using tit title.normal:words.
This Was later changed to match
Alta Vista's simpler title:words,
though Google persisted for a long time
in not inviting users to
use its title search capability.
Almost by accident, many researchers began extending a skill I refer
to as URL interpretation.
From an early understanding that .gov means
government and .au, Australia,
researchers could intuit additional
information from the Web
address. On a good day, I can tell the format,
date, publisher, and type
of author from the URL. Guessing these elements
helps me to anticipate type
and quality of information on the site.
Region also came into play. A simple url:.au would limit results to
Australia. Even more effectively,
Bryan Strome with his
SearchEngineCollossus.com
would (and still does) lead you quickly to a
regional search engine,
an Australian-only search engine. Predictions swept
the Web that the next great
step forward would be in regional Webspace and
in topic-specific search
engines. Both predictions, I am mindful, play as
yet minor roles in Internet
research.
BACK TO CHAOS
As the Internet grows further, search engines have begun to run into
trouble again. Google stands
at just about 3 billion records now, but the
Web races ahead at a much
faster pace. There are complex reasons for this
pace--not least that the
number of people capable of Internet publishing
grows at an exponential
rate. I've explained my views at
www.SpireProject.com/art10.htm
and www.SpireProject.com/art13.htm. This
growth is real and seriously
disrupts popularity ranking. Estimating an
absolute size of the Web
is perilous, but if you accept an estimate of 15
billion Web pages, only
14 percent of the Web is indexed. Next year, as
this figure surely dips
below 7 percent, ranking technology will take on a
whole new meaning.
Where once ranking would float the best information to our attention,
by next year it will retreat
to become similar to Yahoo! with its emphasis
on site, time, and money.
Google is not losing its battle but is definitely
losing the technological
war on organizing chaos. However, this war is
being fought more successfully
on other fronts.
CHANGES IN APPROACH
There is more to this evolution than a change in tools. This is
really a story about a change
in approach. In the early days, we expected
almost all FTP resources
to be indexed by Archie. With the early search
engines, we expected most
important Web pages to be represented. Tomorrow,
we will expect most important
Web sites to be represented. Yes, we will
leap from Web pages to Web
sites.
There is another message here. Over time, we discover better ways to
find information.
For a simple illustration, consider how we judge the quality of
Web-based information. In
the early days, there were murmurs about
assessing quality based
on the .gov versus .com or perhaps just assuming
the worst. Even today, some
online advice suggests an assessment based on
the presence of a copyright
notice and date. Is the author identified on
the article? Are the links
working? Is the spelling correct?
Thankfully, we've progressed. We now look to context, format, and
source. Who wrote it--and
if we have a name, what else have they written
(found with a simple search)?
Make an assessment of the author and
publisher based on other
items they have published. (Hack the URL or query
Google with a URL field
search to find information logically located
nearby.) Look for evidence
of peer review by considering the format in
which the information was
prepared. Perhaps consider Web site popularity
(found with a link field
search). We can still consider spelling.
MORE AND MORE LIBRARY SCIENCE
Let's have a research example. One of my frequent tasks as a
traveling public speaker
is to find suitable auditoriums. This is not
simple. Bluntly querying
Google for a list of auditoriums in Dallas will
only give me a list of those
with Web sites, primarily those with some
popularity. What I re-ally
want is a list of auditoriums. It turns out two
organizations create such
lists. The local convention and visitors bureau
often has a list of meeting
room venues that include auditoriums. The state
agency involved in disability
legislation also may have a definitive list
of auditoriums and their
respective handicap access status.
I learned this through a bit of feedback research. After I stumbled
upon two such lists in other
cities, I began to actively seek such lists
with a purpose. The key,
however, is to realize Google rarely indexes these
lists. But knowing they
exist, I'll first strike out and find the local
convention and visitors
bureau (with the help of Google or a list of
convention centers) and
then move through the Web site towards the list of
meeting facilities. I may
also consult a directory of museum Web
sites--since they occasionally
have auditoriums.
What has happened? Simple. Searching failed me. Without library
science--knowledge of source,
anticipating information, feedback
research--I would have to
admit defeat and choose a hotel.
Internet research continues to mature. About a year ago, I had a
delightful afternoon with
Lecturer Theresa Anderson at University of
Technology Sydney (UTS).
She was completing her thesis on the criteria
experienced researchers
use to select information. With the help of
multiple video cameras and
computer memory, she has traced how skilled
commercial-zone searchers
interact with the information world dynamically,
predicting what was out
there, selecting and guiding their attention based
on clues.
As we watched while I executed a difficult Internet search, we saw
the same techniques at play.
I was intimately aware of what I thought was
out there, what I was finding,
and constantly comparing the two. There was
an internal dialogue selecting,
reformulating, seeking a certain type of
information, and being frustrated
when I didn't find it. At the
experiential level, Internet
research techniques merge with commercial and
information research techniques.
WHAT THIS MEANS
We have witnessed a voyage away from an era in which the Internet was
controlled and deeply understood
from a computer science perspective.
Internet research was initially
about technically searching the Internet.
It extended from search
engines, to Boolean logic, to popularity
ranking--all elements of
computer science. Because most early adopters were
computer techies, Internet
research adopted this computer tech mantle.
This is changing, and the change is accelerating.
Over time, the Internet has grown. It has gradually morphed from a
shallow pool, into a deep
lake, into an ocean where the depths are largely
unknown and not directly
searchable. We simply can no longer see much of
the information from a single
vantage point.
The Internet transformed into the very beast found in the older
information world--very
much requiring library science and a research
heritage distilled from
years of working with incompletely indexed
information with multiple
and overlapping layers of organization.
The Internet became not congested, or chaotic, since it is clearly
neither. The Internet began
to grow up, add weight, and resemble its
information birth parent.
Evidence of this lies in the amusement we now hold for early search
techniques. Why don't we
still search for hotlink pages? Why can't a single
person write a guidebook
organizing all the resources in agriculture? Do we
really need to use quotes
with search engines?
GUIDEBOOKS PRECURSOR TO MODERN SEARCHING
The one ill-fitting piece to this jigsaw is the early guidebooks I
long held so dear. It reminds
me that even in the early, pre-Web era,
library science was there,
evident, and making an impact. But that impact
was initially minor compared
to the results of computer science and
visibility of commercially
viable search engines.
As the Internet has grown up, dwarfing our simplistic search tools
and techniques, we have
put in its place more and more library science to
deliver us from confusion.
This trend will continue. In fact, it will
continue until the very
nature of Internet research shifts monumentally
from computer science to
library science.
The relative gifts of computer science will be eclipsed by an
understanding that Internet
research is more about finding information than
about searching--and finding
information is intimately library science.
SHIFTING ALLEGIANCES
Yes, the whole concept of Internet research will detach itself from
computing science and merge
as a discipline of library science. It will
shift allegiance. The move
is inevitable and I personally think it will
take about 3 years.
What else could transpire? Could computer science absorb library
science? Not likely. In
the vast Internet, resembling in so many ways the
reality of information research,
computing science is relegated to a role
in organizing discrete baskets
of information--not the task of guiding
research itself. The computing
aspect of searching will become a sub-topic
to the concept of Internet
research.
As an aside, Internet cataloging actually runs the opposite risk, of
being absorbed into computer
science. The relative gifts of thesaurus and
classification schemes can
be eclipsed by the more visible gifts of
computer science--but that
is another story.
How will we find information on an Internet with 50 billion records,
in which the largest index
is but 3 or 4 billion records in size? The
answer is with intellect,
with skill, and primarily with the arsenal
provided by library science.
We will have a multi-tiered approach, where
individuals with more skill
will dig deeper and be more effective. We have
been moving in this direction
for a decade.
The totality and inevitability of this move is the inspiring event.
Slowly, Internet searching
will come to be seen as an element of Internet
research. Internet research
will assume the undisputed mantle of library
science.
WHERE TO GO FROM HERE
The digitizing of our lives never altered the need for
assistance--just the type
of assistance the community required. The new
forms of assistance will
relate to digital information. In viewing the
library community in its
widest context, that of assisting and facilitating
access to information, we
see that the library community belongs here. This
is your home. I see three
effects:
1) There is no urgency to selling a message that the Internet needs a
librarian. There is no need
to sell your role to the community: There is
only the need to be there
when the community learns it needs you.
2) Priorities within the library community are changing. There are
ways to prepare for these
changes with training and, legislation. I
personally want to see libraries
involved in teaching Internet research to
the community. Soon the
community will come to you seeking advice on how to
undertake a challenging
bit of Internet research. Will you be ready?
3) This should inspire the library community. Its destiny is assured.
Librarians will be as important
as they've always been.
History will describe the early Internet as an aberration, the one
time when the Internet did
not resemble the whole information sphere, in
all its complexity organization,
and beauty. History will remember these
last few years as the one
time when Internet research was not part of
library science.
David Novak (david@spireproject.com) is a public speaker and founder
of the Spire Project.
Comments? E-mail letters to the editor to marydee@xmission.com.