Sign In

Communications of the ACM

Communications of the ACM

Mining Online Text

Natural-language applications, such as machine translation, speech recognition, information retrieval, and summarization, are reaching a broader range of users today. Anyone who has used these products knows how imperfect they are. Still, people use them because they are desperate for ways to organize and sift through the vast amount of information available to them—textually—online. In addition, vast online texts are a potential gold mine of linguistic information for natural language processing (NLP) application builders to improve the quality of their systems.

NLP problems, like most in AI, require large amounts of formally codified knowledge, that is, knowledge about words, parts of speech, grammar, word meanings, phonetics, text structure, and the world itself. For small or circumscribed domains, this knowledge can be typed in manually, but for general-purpose programs (the kind that most users want), the amount of knowledge is overwhelming. The quest for automating or semiautomating the acquisition of NLP knowledge has recently spawned a collection of new techniques that sometimes go under the heading "statistical NLP." The phrase is not a particularly good one, as it seems to exclude approaches that do not use statistical frequency counting or probability theory. But it is brief and commonly used. The type of research described here might more broadly be called "automated or semiautomated knowledge acquisition from linguistic resources."

Let's begin with a straightforward example. Languages like Japanese, Chinese, and Russian lack the small words we call articles (a, an, the). When we translate these languages into English, we must insert articles at the appropriate points, or the translation will be awkward. Interestingly, this problem can be viewed as one of English composition rather than translation. Indeed, if I were to simply remove all the articles on this page, a native English speaker would be able to be re-insert them with very high accuracy. This has been experimentally confirmed. Now, can we build a computer program to do the same thing?

No one has yet been able to extract even somewhat accurate syntactic parses from raw text databases.

We might try to hand code a small number of effective rules, but this approach is doomed. (Ask any non-native English speaker!) People only learn to use English articles correctly through extensive experience in reading, listening, and speaking. A different idea is to expose a computer to a tremendous amount of online text so that it, too, can learn from experience. Any online text resource, like a daily newspaper, can be viewed as a database of article usage. At each position in the text, the database tells you whether to use an article, and which one to use. The bulk of the database can be used for training an article selector; the rest can be used to measure the selector's performance.

To exploit this database, a designer must specify which features are relevant to the article selection task. It is possible to automatically collect a large set of binary features such as "Is the next word 'some'?" or "Am I at the beginning of a noun phrase whose final word is 'year'?" These features can be used to train a decision tree to make a decision between every pair of words (either insert "the," insert "a," insert "an," or do nothing). Performance is reasonably good when the system is exposed to 20 million words of English text [3]. Because articles are rarely enunciated clearly, this sort of program has applications for speech recognition as well as translation.

Consider another NLP task, that of determining the meaning of a word in context. Most words are ambiguous; for example, the word "duty" may either be a type of obligation or a type of tax, depending on how it is used in a sentence. If we were to sit down and try to write rules to disambiguate "duty," we would eventually want to see a lot of examples of how it is used by real authors. Again, a computer can build its own decision structure by examining such examples autonomously. Unlike in the article-selection case, however, raw text does not already contain the correct answer when it comes to word sense.

When faced with this sort of situation, NLP researchers often decide to create a database by hand. They annotate text manually; then, data mining proceeds as usual. Annotation can be very labor-intensive, but at least it can be done by non-specialists, and the results can subsequently be distributed widely throughout the community. In this case, we can take 1,000 uses of the word "duty" in various contexts, and label each as Class A (tax) or Class B (obligation). Features can be things like "Does the word "customs" appear within a 15-word window?" For dual-sense nouns like these, it is possible to obtain high accuracies in the range of 96% correct [5].

It would be better if we could mine raw, unannotated text directly. Surprisingly, it is indeed possible to achieve equally high accuracy in word-sense disambiguation without the need for large-scale annotation. It is possible to bootstrap from a very small set of labeled examples [6].

Another critical NLP task is that of determining the syntactic structure of a sentence, also known as parsing. Correct parsing of unrestricted text is an exceedingly difficult task, due to ambiguities in part of speech (noun, verb, and so on) and structure. The famous sentence "time flies like an arrow" is parseable in many different ways. In a practical machine translation system, is not feasible to present all of these analyses to a user; the machine has to pick one interpretation and go with it. One of the earliest linguistic databases was a million-word collection of text from various genres in which words were tagged with information that included parts of speech, for example, "time/N flies/V like/P an/DET arrow/N ..." From even a portion of this database, we see that verbs never follow determiners, among other useful facts. It is possible to train very accurate part-of-speech taggers from manually tagged data [2]. Once a likely sequence of tags has been established, structural parsing can proceed on more sure footing. For parsing, many researchers rely on an extensive database of manually parsed newspaper sentences created at the University of Pennsylvania. This million-word collection is called the Penn Treebank [4].

Is it possible to learn parsing knowledge from raw text databases, perhaps supplemented with a bit of annotated data? For part-of-speech taggers, the results are somewhat positive. But despite the existence of promising learning algorithms, no one has yet been able to extract even somewhat accurate syntactic parses from raw text databases.

I hope this article gives readers a feel for the conditions in which statistical NLP researchers work. When faced with a problem, they ask: "Is there a database that can help? If not, can we create such a database?" On the flip side, when faced with some naturally occurring database such as online text, they wonder: "What could we use this for?"

As a final example of a naturally occurring database, consider bilingual text, that is, the same text written in two languages, say, English and French. The largest such database in existence is the online proceedings of the Canadian parliamentary debate, which is mandated to be published in both official languages of Canada. Many online technical manuals also exist in several languages. These bilingual texts can be mined to create technical glossaries and even whole translation systems [1].

Back to Top


1. Brown, P., Della Pietra, S., Della Pietra, V., and Mercer, R. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 19, 2 (June 1993) 263–312.

2. Church, K. A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the 2nd Conference Applied Natural Language Processing (Austin, Tex., June 1988). Morgan Kaufmann, San Francisco, 136–143

3. Knight, K., and Chander, I. Automated postediting of documents. In Proceedings of the National Conference on Artificial Intelligence. (Seattle, Wash., July 31–Aug. 4, 1994). AAAI Press, Menlo Park, CA, MIT Press, Cambridge, London.

4. Marcus, M., Santorini, B., and Marcinkiewicz, M. Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist. 19, 2 (June 1993), 313–330.

5. Yarowsky, D. Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French. In Proceedings of the Conference of the Association for Computational Linguistics. Las Cruces, NM., June 27–30, 1994). Morgan Kaufmann, San Francisco, 88–95.

6. Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the Conference of the Association for Computational Linguistics. (Cambridge, MA, June 26–30, 1995) Morgan Kaufmann, San Francisco, 189–196.

Back to Top


Kevin Knight ( is a senior research scientist at the University of Southern California's Information Sciences Institute in Marina del Rey, Calif.

©1999 ACM  0002-0782/99/1100  $5.00

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 1999 ACM, Inc.