How methods from computational linguistics, such as lemmatization or parsing, can improve retrieval results.
Unfortunately literature in this domain is sparse. The edition of Strzalkowski comprises a series of articles and contains a good overview of the domain by Karen Sparck Jones:
You can also have a look at some special publications in this area, for example the TREC proceedings:
There are some sample texts for today. Please download them (Devon1, Leicester1, Devon1dt and Leicester1dt as well as de_text1 and de_text2). The Devon and Leicester texts exist in a text version and in a prepared version comprising the lemma for each word and NP brackets. The two latter tests are also provided in a P(art)O(f)S(peech) tagged version (de_text1.tts and de_text2.tts).
For the Devon & Leicester texts please write the following programs:
For the POS tagged texts please write the following program:
Connect to the Twenty-One demonstrator:
http://twentyone.tpd.tno.nl/cgi-bin/21searchv3/21demomooi/deutsch/21main_frame.htmlTry to find out by submitting queries to the 21 database which options have been realized for this system (please note that all documents are about "sustainable development"/"nachhaltige Entwicklung"/"duurzame ontwikkeling"/"développement durable"; you can therefore ask for a lot of information in the ecolocial domain):
Natural Language Processing techniques and resources can be used for the purpose
of Information Retrieval in quite a few ways. From the historical point of view
NLP techniques (such as: more or less deeply parsing) for
The use of NLP resources such as semantic networks or lexical databases (bilingual dictionaries, for example) have only been applied to a broader extent in recent years.
There are two different approaches of embedding NLP techniques in Information Retrieval:
This chapter is organized in the following way: section 2 deals with attempts and results of linguistically motivated indexing (LMI) and the changing motivation for the application of NLP techniques and resources in IR during the last decades of the 20th century. Section 3 deals with AI-based systems.
It is worth noting that this chapter only deals with NLP applications to text
retrieval;
The attempts to apply NLP techniques to create indices are nearly as old as
automatic retrieval in general. The early developments tried to imitate human
indexing, and quite a few of these systems applied
As experiences with search engines in the WWW show, searching in large databases by means of a single term will often yield (too) many hits. It is therefore important to restrict searches by means of, for example, multi-term queries. This requires the possibility to compare such a query with the index terms in the documents, which can be based on two ways of indexing:
As LMI techniques allow the treatment of multiword index terms in a different, perhaps more elaborate way than NLI, we complete the series of distinctions. In both cases precoordination and postcoordination is possible:
An index term may be a single term (a single word or a stem) or a compound one: the latter can be a complex term (found by any LMI) or a joined term (in the case of NLI: "joined" can mean "adjacent" or "in the neighbourhood"):
These concepts are important for our considerations of the early approaches in LMI. A more detailed overview of the LMI approaches is found in Sparck Jones (1999).
The SYNTOL system (Bely et al.) is designed to recognise thesaurus terms and to establish logical relations between them (such as "cause"). For this purpose rather elaborated NLP techniques were applied. Like may other of these former systems, the application was restricted: SYNTOL was applied to abstracts. There has been no retrieval performance evaluation of the SYNTOL system, Bely et al. only mention that the automatic descriptions were similar to manual ones.
Salton (1968) also used a thesaurus, but with
the additional feature of a syntactical analysis in order to establish a proper
dependency relationship. The result of his analysis (the so-called
Evaluation of the retrieval results were not possible in a large scale in these early days of IR, only in the nineties the TREC events allowed a thorough testing. Nevertheless the Cranfiled tests (in the late sixties) were already designed to compare different index langauges. The outcome was rather surprising in the Cranfield 2 test (Cleverdon 1967): Very different index languages resulted in similar performance. Heavily controlled languages with complex terms, for example, were no better than manually provided natural language descriptions. Saltons experiments with the MEDLAR's controlled languages - compared to natural language - showed similar results.
Fagan's (Fagan 1987; 1988) test marked the end of this first phase. He showed that approaches using joined terms performed better than approaches using complex terms; the latter ones in term proved to be better than approaches using simple terms. The most interesting fact in these tests are not only the results, but also the conclusions Fagan could draw from them. The relatively bad result for compund terms was seen to be due to the fact that compounds were not shared between queries and documents. This reason for this was that the documents were relatively short (about abstract length).
The availbility of large text databases as well as the possibility to make retrieval runs with such amounts of texts and to evaluate the results (due to the TREC collections) can be seen as characteristic for the IR approaches in the nineties.
Current practice in Information Retrieval is mostly based on statistical techniques, but recent research projects aim at investigating whether Natural Language Processing techniques could improve retrieval results or offer additional features to the users.
In this section, we will restrict the considerations on NLP applications in
IR to the classic tasks of indexing and retrieval, although new applications
such as text classification, routing and filtering, as well as summarization,
make intensive use of NLP techniques (for further reading, consult
NLP techniques applied for both processes mainly concern the use of linguistic knowledge for the document and query language(s):
Morphological analysis can be applied to terms in the queries and in the documents. The idea behind this is that retrieval results might be improved if morphological variants of a term are reduced to a single form.
A simple technique is suffix stripping on the basis of a list of frequent endings for the language under consideration. There exist efficient algorithms for such a "stemming" procedure, because no dictionary look-up is necessary, but they produce quite a number of errors due to their lack of linguistic information. In addition, the produced stems might not correspond to existing words (eg "comput" for "computer").
A well-known stemmer is the Porter stemmer; other tools are the S-stemmer and
the Lovins stemmer. They all have been developed for English. A study made by
[
Linguistically motivated algorithms for morphological analysis check the resulting
forms against entries of a dictionary. In spite of quite a number of problems
(eg inconsistency/incompleteness of the dictionary, spelling errors in the test
corpus, proper nouns, hyphenation variation) this technique shows improvements
compared to the Porter algorithm [
Compounding is a further important aspect in morphological analysis with regard to information retrieval, it concerns languages such as Dutch or German. Compound splitting requires combination rules and a lexicon. Compounds may occur in queries and in the texts. IR systems may differ in the way they allow the retrieval of compounds:
(From the technical viewpoint, these "matching procedures" are often realized by means of query expansion with the terms or compounds under consideration).
Syntactic knowledge can be used for the indexing process as well as during retrieval time. Some systems only index terms which occur in specific phrases the texts and in the query and to compare these structures during retrieval.
Semantic information is used for query expansion. Semantic networks such as the one developed in the EuroWordNet project are used to add automatically terms to the queries:
The process of query expansion aims at allowing to retrieve also semantically linked terms (eg synonyms) in the texts. Paul Buitelaar (DFKI, Saarbrücken) has developed such a query expansion module using the GermaNet (the EuroWordNet version for German). This tool allows to choose the degree of the semantic relation between the terms (50 - 100 %). Having chosen a high degree, the term "verwaltung" in German is only related to "rathaus", whereas on a lower level, terms such as "kirche" or "schule" are also in the list of related concepts. You can try a demo of Paul's query expansion tool (it has to be noted here that the demo shows actual research work; the author will be happy if you mail us your comments):
http://cl-www.dfki.uni-sb.de/~paulb/expand/expand.html
Multilingual information is used to allow users to access information in a language different than their mother tongue. Some systems allow the users to formulate queries in various languages in order to retrieve documents in these languages. These systems are called multilingual, but they don't perform any translations. Cross-language Information Retrieval systems (CLIR) allow to retrieve documents by means of queries in a language different from the document language. There are several possibilties to achieve this goal [Oard 1996]:
Translation can be performed during indexing time (off-line) or as a pre-processing step in the retrieval process (on-line). The translation process can be based on three sources of transfer knowledge:
Queries are in general translated on-line. If dictionaries are used for this purpose, each (lemmatized) word in the query is replaced by its possible translations. The disadvantage of this method is that word senses can hardly be disambiguated, especially in case of short queries (eg the German word "Satz" can be translated into "sentence", "theorem" or "setting", depending on the context). This may lead to decrease in Precision. On the other hand, Recall might be increased, because synonyms are added to the query. Multi-word expressions such as idiomatic expressions, terminology or collcations are a problem in CLIR. Word based translation fails here because the meaning of multi-word expressions is not compositional (eg "yellow pages" vs. "Gouden Gids"). MT based approaches can yield good results for longer queries, because such systems can exploit syntactic and semantic aspects of context in order to improve translation. For short queries, there is not enough context to be exploited during the translation process. A corpus-based approach exploits parallel corpora in order to derive bilingual dicitonaries from it. Domain specific terminology can thus be taken into account, and multi-word expressions can be detected.
Documents can be translated by means of a machine translation system. The disadvantage of this approach is the dependency on the quality of the MT system. The advantage, however, is the possibility to display the translated texts to the users if desired. A second approach consists in only partially translating the documents. Using this option, only terms or phrases which are regarded as relevant for indexing are translated. Partial translations of lemmatized index words in the documents are hampered by the same problems as dictionary based query translation. If terms occurring in documents are translated, however, the context can be used for disambiguation purposes.
In this section we will look at the combination of Artificial Intelligence and Information Retrieval and consider some examples of AI-based systems. These examples will cover a rule-based IR-system (TOPIC/RUBRIC) and two systems that depend on parsing and frames (SCISOR and the German TOPIC). We will also look at an example of a neural network in IR. Approaches based on genetic algorithms may be argued to also belong to Artificial Intelligence but because of their vector oriented operation we will treat these separately in a different chapter. For a general introduction see one of the many textbooks on the subject, e.g. [Winston, 1991].
The notion of a frame is nothing but a reflection of the fact that almost any concept may be analysed into smaller concepts. A concept is represented by a frame and the sub-concepts may themselves be frames that fit in 'slots' in the bigger frame. Concretely, a frame is a memory structure with a number of fixed slots.
A car may be the frame and the motor of the car fills the slot 'motor'. The motor itself may again be represented by a frame and then the slots may be the cylinders, the spark plugs or the carburettor, each to be filled by a value or a new frame. In short: frames can be seen as collections of semantic nodes and slots that together describe a stereotyped object, act, or event.
Frames are often used in the context of scripts, and scripts themselves may be considered frames with a temporal sequence. The keyword again is 'stereotype': a script describes expectations in certain stereotyped situations. If, in a restaurant-script, the waiter puts something on the table, the knowledge in the script expects the 'something' to be plates, food or the bill; not a boulder or the tyre of a car. It must be stressed here that the sheer size of knowledge to be analysed and described in advance prohibits the use of frames and scripts in all but dedicated applications that act in very small domains (the scaling problem). More about frames and scripts can be found in the work of Schank, e.g. [Schank and Abelson, 1977], Minsky [Minsky, 1981] or in the Handbook of Artificial Intelligence [Barr and Feigenbaum, 1989].
Frames and scripts are especially useful when a system wants to extract relevant information using top-down analysis of the events or situations described in a document.
Early attempts to intelligent IR, such as [DeJong, 1982] or Lebowitz, [Lebowitz, 1986] started as it were 'from the top', hence they were called top-down systems. This 'top' consisted of a collection of known, stereotyped situations such as earthquakes or railway accidents. The systems then tried to match incoming documents with such stereotypes and if a match could be made, to fill in slots that represented events as the number of casualties, the place and time of the quake, or the strength on the Richter scale. As these systems work with 'expectations' about what concepts will occur together in a text or text passage, they also are known as expectation driven systems. Later systems, such as were more sophisticated in that they actively tried to build new representations of objects, instead of acting passively on a number of pre-cooked stereotypes. However, to do so they had to incorporate knowledge that was distilled from the texts themselves and this is called bottom-up processing.
Bottom-up modelling is more difficult, as it starts with the text itself and the individual words. A process called 'parsing' then tries to identify the parts of the sentences and their relation to each other. Obvious problems here are disambiguation and the resolution of anaphors and deixis. On the other hand, it can produce accurate results in arbitrary texts and texts that contain unexpected information.
In the following pages we will describe two systems that made use of these AI-techniques: SCISOR and (the german) TOPIC.
Developed at General Electrics, SCISOR (System for Conceptual Information Summarization, Organization and Retrieval) is an experimental system, that detects and stores information about financial transactions, such as mergers, takeovers etc. in an input stream of financial news (the Wall Street Journal). It subsequently answers simple questions about this domain, for instance ``What was offered for Polaroid'', or even incomplete questions as ``Acquisitions by Shamrock'' [Rau and Jacobs, 1990]. The system contains the following main functions:
The system is connected to an input stream of financial news stories. Leaving aside for the moment the rather trivial processing needed to recreate the story structures, such as headers, bylines and datalines, the first task of the system is to analyze the input stream to decide whether the incoming stories are about corporate mergers and take-overs. It passes the story through a number of sieves, each trying to decide if the story is definitely about the merger/take-over domain, definitely not about this domain or if there still is doubt left. In the latter case it is passed to the next sieve.
The sieves start with rather coarse filtering on headlines and keywords, becoming more sophisticated and thus computationally more expensive later on. This arrangement ensures that the expensive techniques have to be called in only on a subset of the documents. The modular architecture also makes it easy to plug new algorithms in or out, making comparisions between them relatively easy.
The performance of this sieve in terms of precision and recall may be estimated from an example given in [Rau and Jacobs, 1990], where respectively 88.5% recall and 92% precision are obtained. This may seem rather high, but the measure for correctness was the estimate of a single person and a 10% error margin was assumed.
| Input | Revere said it had received an offer from an investment group to be acquired for $16 a share, or about $ 127 million. |
| partial (bottom up) analysis |
![]() |
| partial (bottom up) semantic analysis | ![]() |
| conceptual expectations (top down) | ![]() |
| final semantic analysis | ![]() |
The next step in the processing is the application of natural language analysis to the stories thus selected. This analysis consists of an integration of both bottom-up linguistic parsing and top-down conceptual analysis. The bottom-up parsing module, TRUMP, identifies linguistic structures and tries to map these to a conceptual framework; the top-down analysis module TRUMPET tries to fit partial information from the text in conceptual expectations (see figure 3.3). The input sentence is ``Revere said it had received an offer of an investment group to be acquired for $16 a share, or about $127 million. As the words are already in a lexicon, TRUMP understands all words, but can not complete its conclusions by bottom-up parsing alone. For example, the phrase starting with ``to be acquired'', might be attached to ``an investment group'' or ``an offer'', but in this example, ``Revere'' is the subject of the phrase. The knowledge in TRUMPET is such that the offerer must be the same as the acquirer and that the acquirer must be different from the acquiree. This implies that ``Revere'' is the acquiree and therefore the target of the take-over.
The conceptual representation of the story that is created in this way is stored as a network of unique instances, i.e. as individual members of conceptual categories in a knowledge base. These instances serve as indices for information retrieval.
| How much was Bruck Plastics sold for? |
![]() |
Answering of questions takes the form of reporting on slots. Consider the question ``How much was Bruck Plastics sold for?'' (figure 3.4). The question is also passed through TRUMP and TRUMPET and to obtain a representation that is compatible with the knowledge base. The processing is done in two stages: first a rough comparison is made with the features of stored representations. The second stage consists of a more careful match of relationships that are asked for or implicit in the question. In the figure, the category selling-2 causes the system to pass through merchandise-transfer and corporate-takeover and to find a match for the target ``Bruck Plastics'' in the suitor ``M.A.Hanna''. It then finds that the slot terms is filled with the slot-filler `undisclosed. This knowledge in its turn is passed to the (Knowledge INtensive Generator) (not shown in the figure) that generates English responses.
Nothing new about has been published after [Rau and Jacobs, 1990] and the authors have since that time been involved in other occupations. It must be assumed that the system did not survive the paradigm shift from rule- and frame-based to statistical and quantitative methods.
Although the German TOPIC has the same name as the commercial descendant of RUBRIC, mentioned below, it is an entirely different system. It belongs to those systems that create a knowledge representation in the same vein as SCISOR or and many others. However, unlike the other two systems, the German TOPIC seems to have survived the paradigm shift mentioned before, as work on this system still continues at the university of Konstanz [Hahn and Reimer, 1988].
TOPIC is presented as a text condensation system. Text parsing augments an initially given frame knowledge base that describes the domain of discourse by adding text-specific knowledge. The extraction of knowledge is driven by script-like structures and controlled by so called word experts, that apply grammatical constraints to the matching of text-items to frames and connected structures. Word experts are lexicalized grammatical modules, which do the actual job of mapping text items onto the knowledge representation structures. These word experts also resolve anaphors and other referential structures.
The frames are connected to each other by semantic relations. The central feature, however, that defines the use of TOPIC is, that the system maintains counts of the references to these structures. This happens not as in keyword based systems by considering occurrences of word tokens, but by every reference to the frames, slots or slot-fillers, that constitute the knowledge representation of a database. Such references cause the corresponding counters and by inheritance the counters of the higher structures to be incremented and so create a basis for judging the importance of these structures.
The combination of this hierarchical knowledge structure and the 'activation weights' assigned to the various structures and substructures is a powerful tool for text summarization and the determination of dominant concepts in the text.
A major measure for identifying an important concept in a text is the frequency of its explicit and implicit mention in the text. The activation weights that are attached to the structures in the text representation are rather independent of linguistic surface phenomena and can therefore directly be used as relevance indicators. Since in these weights are adjusted both by the explicit and implicit occurrence of the concept in the text (e.g. by resolving anaphors), they are better indicators of the importance of the concept than plain word occurrence (for the relative importance of anaphora in the weighting of keywords see [Bonzi and Liddy, 1989]).
The actual text condensation is done in two steps. In the first one, the dominant concepts are identified; in the second step those concepts are recombined to form the topic description of a thematically coherent part of the text.

Beispielunterschrift
A slot-filler is considered as dominant, if any of these conditions is true:
A further measure of importance that was investigated by Hahn and Reimer is the role of connectivity patterns based on a generalized hierarchies of frames.
A number of active frames with a common superordinate frame may constitute a cluster of frames. This superordinate frame is called the cluster frame, but it does not have to be active or even be mentioned explicitly in the text. Cluster frames are detected by recursively searching downwards from the most general concepts as long as no significant loss of active concepts occurs (according to an empirically chosen treshold or when the summed activation weight of the frames drops below a certain level).
The dominance measures result in a collection of formally unconnected concepts, which may be represented as linear graphs. Complete descriptions of topics are arrived at by checking for overlapping nodes of the same type, but occurring in different descriptions, adding links where possible. The result is a text graph, which allows flexible, content-oriented access to full-text information.
The system TOPIC by Verity Inc., is the commercial offshoot of the rather well publicised experimental RUBRIC-system [McCune et al., 1985], [Appelbaum and Tong, 1988]. Although is a complete system, with indexing modules, retrieval engine and a user interface for interactive querying, we will limit ourselves to the document representations and related issues and we will refer to the system by the original name, the more so, because after commercialization nothing much interesting has been published about the system.
TOPIC approaches the problem of document retrieval in two stages. In the first stage an inverted file is created of all strings occurring in the document. Positional information about paragraphs or particular segments of the documents, is preserved in this inverted file. Together with Boolean and proximity operators and the ability to recognise fields in the document, this puts alongside systems such as STAIRS/VS, that also enable Boolean retrieval on strings in full-text documents. The document representation that consists of the set of words occurring in the document, and that is stored in the inverted file, acts as a primary access mechanism. At retrieval time the original document is consulted to obtain information on the proximity of the words. Thus it might be said that the document representation so far consists of terms from the complete text of the documents, which makes it a typical derived index.
However, the second stage that is grafted on top of this retrieval engine, is a knowledge representation tool that may be thought of as essentially a weighted thesaurus, but that is implemented as a rulebase. More importantly, this rulebase is filled by the user and thus is a system for expressing personal preferences about documents, rather than an ``expert'' on specific topics [Appelbaum and Tong, 1988]. Concepts are arranged in trees, or rather acyclic graphs, in which the strings that should occur in the documents are the leaves. The occurrence of such strings, using Boolean and/or proximity operators, is taken as weighted proof for the relevance of the higher concept and such concepts in their turn support other concepts.
To do this, uses two referential rules: EVIDENCE and IMPLIES. So for example:
EVIDENCE moscow((*OR* ``MOSCOW'',''KREMLIN'')0.6))
where ``MOSCOW''and ''KREMLIN'' are text strings, and the number 0.6 is the degree of belief to be assigned to the concept moscow if either of the two strings are found in the document. If neither string is present, a zero degree of belief is assigned. The IMPLIES rule works slightly different:
IMPLIES Jeltsin(moscow 0.3)
where 0.3 is the degree of belief that the user wishes to assign to a document that describes moscow, when he is in fact interested in documents about the Russian president with that name.
In a typical -application many of such concepts (topics) will be built in advance by an information specialist, thus effectively adding a knowledge base to the system. Subsequently the user can build and add topics of his own and those topics may or may not be accessible to other users.
GENERAL-MOTORS * 1.00 GM-COMPANIES ** 0.50 GENERAL-MOTORS-ACCEPTA-PHRS *** "general" *** "motors" *** "acceptance" *** "corp" ** 0.50 "gmac" ** 0.50 "hughes aircraft co" * 0.50 GM-PEOPLE ** 1.00 GM-EX-CEO *** 1.00 "roger smith" ** 1.00 GM-PRES *** 1.00 LLOYD-REUSS **** "lloyd" **** "reuss" ** 1.00 GM-CEO *** 1.00 ROBERT-STEMPEL * 0.50 GM-PRODUCTS ** 0.50 "pontiac" ** 0.50 "oldsmobile" ** 0.50 "buick" . . . . . |
This second layer may be considered part of the document representation too. Indeed, if a topic is built and documents are recognized by the rules in that topic, these documents are added to a list with postings for that topic. Thus it is relatively easy to extract the sets of topics that may be said to belong to a document (i.e. score above a threshold for that document).
The difference between the topics of and the entries of an orthodox classification system or a thesaurus is that the topics here ultimately are defined as properties of documents instead of in semantic terms. This gives the system a great flexibility, but also ample opportunity for snap decisions, ad hoc constructs and heuristics that may work fine in small collections, but may break down when applied to large databases. A possible reason for this breakdown is that in large databases different sub-populations of documents will come in existence, that all cover more or less the same subject, but approach it from widely different angles and (therefore) will use different vocabularies. Experiments [Gey and Chan, 1988] in which the performance of and the Vector Space Model were compared, showed that the using the cosine similarity measure, yielded comparable results to , with a slight edge of over the for marginally relevant documents.
An extensive summing up of the limitations of the commercial offshoot of RUBRIC, named TOPIC, is given in [dimensions Inc., 1990]. One should keep in mind that this report was written by an unsuccesful competitor for the library system of Tilburg University.
As we already indicated, AI-approaches using scripts, frames or rule bases, floundered on the scaling problem and mainstream returned to statistical and quantitative models and methods. Neural networks and genetic algorithms are numerical methods in that they operate on quantities of simple data, rather than by recognition and reconstruction of symbolic representations.
A survey of Information Retrieval would not be complete without mention of the attempts to use connectionist approaches to the problems of aboutness and document matching. In an experiment by Belew [Belew, 1989] it is shown how a connectionist approach is taken to Information Retrieval.
In this system, , ways are explored to improve the performance of retrieval by changing its document representation, using relevance feedback from users of the system. It operates on a database of bibliographic citations; each document is represented by its title, its author(s) and a number of keywords or descriptors. In the experiment described here, the keywords are taken from the title.
To start with, a representation of the information in the database is built by creating nodes for all documents. These nodes are connected with the nodes for the authors (one for every author) and the nodes for the keywords (one for every keyword), where for every connection there are two links. The links are initially weighted according to an inverse frequency weighting scheme. The sum of all the weights departing from a node is forced to be constant.
Figure 3.6 shows a very simple network with four keywords, three documents and two authors.
When an initial query is put to the system all nodes that correspond to that query are activated, and this activity is allowed to propagate through the system. The answer of the system is ranked according to the final activation of the nodes and presented to the user.
Subsequent queries in the same session are performed much differently. The user has to indicate which features (nodes) from the ranked features het judges relevant and which not on a scale of ++,+,-,-. Not all features have to be commented upon. The system creates a new query based on this feedback, strengthening or weakening the links according to this scale, and so is effectively trained by the users to recognize associations that are useful for .
If a query contains a new terms, i.e. one for which no node exists, the query is first handled without that term and subsequently (after the user's response) a new node is created for that term and connected to the network.
The net result of all this is that the network will evolve towards a consensus of users about what keywords and documents belong together. This 'democratic' view of the aboutness of documents contrasts with the omniscient notion of aboutness, that is present in almost all other IR-systems. That is: the relevance of a document with respect to a query in orthodox IR systems tends to be absolute, as if determined by a omniscient indexer.
In this chapter we have given an overview of how automation became an important tool for . In the beginning, computers only could perform elementary clerical tasks, but soon attempts were made to extend the assistance of the computer to more complicated areas, notably those to do with natural language processing and even "understanding" of texts. This culminated in the late eighties in a number of -based systems, of which we presented some salient examples.
In the meantime, experiments with frequency based processing went on. In the late sixties and seventies of this century most principles were already established by prominent scholars and scientists like Edmundson, Salton, Van Rijsberghen, and Sparck Jones, to mention but a few, but progress was hampered by the absence of seizable quantities of machine-readable texts. When the micro-computer became a common appliance, and the availability of electronic storage increased correspondingly, at the beginning of the last decade critical mass was reached and quantitative and statistical -processing soared. In the next chapter we will present various principles and models of both the creation of document representations as vectors, and in the chapter after that, the similarity functions that can be applied to compare those vectors.