Sign In

Communications of the ACM

Communications of the ACM

Evaluating the Efficacy of a Terrorism Question/Answer System

Terrorism education has become a topic of growing interest of late. With many agencies and private organizations scrambling to provide information, the actual process of finding relevant material can sometimes become lost in the chaos. To the credit of the U.S. Department of Homeland Security and several private organizations, there is some valuable information available; however, it is mainly geared toward first responders and not the general public.

In both the 9/11 Commission Report and Making the Nation Safer, the authors propose to bridge this gap through the use of systems embodying command, control, and communications (C3) elements. These systems would allow for the deployment of communications channels during an emergency to support decision management as well as communicate instructions to the public during emergency situations [3].

One potential approach to C3 is through the use of Artificial Linguistic Internet Chat Entity robots, or ALICEbots. The advantage of these Question Answer (QA) chatterbots, developed in 1995 by Richard Wallace, is their ability to be quickly programmed with terrorism-specific knowledge, as well as their robust and scalable nature. ALICEbots are built first and foremost for conversation and are a promising vehicle in disseminating terrorism-related information to the public.

Back to Top

ALICEbots and QA Systems

ALICEbots work by matching user input against preexisting XML-based input patterns and returning the template response. This simple method of conversational mimicking eliminates the computational overhead that would normally be associated with deeper reasoning systems. The technique can also permit expansion into new knowledge domains, allowing the ALICEbot to convey an expert appearance [10]. An example knowledge base entry is as follows:

<pattern>WHAT IS AL QAIDA </pattern>
<template>'The Base.' An international terrorist group founded in approximately 1989 and dedicated to opposing non-Islamic governments with force and violence.

In studying the ALICEbot system, it was found that because of its architecture, ALICE cannot properly answer all of the queries given to it. This is the inevitable outcome of using a shallow method of conversational parsing. As simplistic as it may be, this system has proven itself against deeper algorithmic systems by winning the Loebner Prize in 2000, 2001, and 2004 for most human-like conversational partner.

QA systems fall into two different categories—domain dependent and domain independent systems. Domain independent systems do not try to understand text; instead they focus on retrieving a snippet of text from the source [8]. An example system is MURAX, which uses external encyclopedic information to answer queries.

Domain dependent systems rely on a specially crafted knowledge base, and are composed of two subcategories: the traditional, or narrow, domain, and open domain [4, 8]. In the traditional domain, systems attempt conversational fluency in limited domains of expertise. Example systems include Winograd's SHRDLU, which could answer natural language queries about a fictitious Block World [11]; STUDENT, which could solve algebraic word problems [11]; and LUNAR, which could answer queries on lunar rock data [12]. Open domain systems are those that still rely on internal knowledge bases, but have a more diverse repertoire of topics. These systems can be further classified into two major camps—information retrieval and information extraction [4, 8].

The goal of information extraction is to extract the "who did what to whom" and "where" type of information from text and fill in predefined templates with the extracted data. This field is serviced by the Message Understanding Conferences (MUCs).

There are two basic classes of information retrieval—document-based and sentence-based retrieval systems [9]. The goal of document-based systems is to return a set of relevant documents to the user; which is one of the tenets of the Text Retrieval Conferences (TReC) [4]. Sentence-based retrieval systems return a small snippet of text to the user, similar to the domain independent systems, except that sentence-based IR uses internal knowledge bases. Crossover systems that utilize both document-based and sentence-based approaches exist, but are mainly confined to the arena of search engine design [6]. ALICEbots fit into the sentence-based category [9] primarily because of their sentence-directed response capability and use of internal knowledge bases.

Two of the most relevant studies on public emergency communications systems and how they fit into the C3 model involve the Citizen Awareness System for Crisis Mitigation (CAMAS) system and the Integrated Community Care (INCA) system. CAMAS utilizes a "humans as sensors" approach where event information is extracted from victims and first responders in a disaster area. That information is then analyzed and the appropriate response personnel are dispatched to priority areas [2]. The downside is that CAMAS does not handle bidirectional information flow back to the public. INCA is a network of health care agents that monitor patients and coordinate emergency services when needed [1]. Like CAMAS, INCA also does not handle bidirectional information flow.

There are several aspects to the C3 model that must be maintained. Emergency communications capability should include channels from emergency personnel to the general public, channels from emergency personnel to specific groups or individuals, and channels of communication from within the disaster area to those affected outside of it. Applying QA systems to this problem addresses many of the information dissemination needs. In particular, the use of ALICEbots can help fill the role of information dissemination and can deal with answering many simultaneous disaster-related inquiries in a tireless and personable manner. Specific disaster-related knowledge can be quickly scooped up by scanning news sites or entered manually to meet precise needs. Since most search queries focus on the use of interrogatives and definitional types of answers [7], ALICEbots can capitalize on this fact by focusing attention on reputable definitional sources (such as and

Back to Top

Terrorism Activity Resource Application (TARA)

Our research is aimed at examining the efficacy of shallow QA systems for disseminating terrorism-related information to the general public. We created three modified ALICEbots that differed from each other on the dimension of terrorism knowledge bases used. One chatterbot used only general conversational knowledge, the second used only terrorism domain knowledge, and the third was a combination of both conversation and terrorism knowledge. This leads us to the following research questions:

  • How will users perceive the usefulness of the different chatterbots?
  • How will the different chatterbots perform?
  • What are the most frequent types of interrogatives used in the study?

To answer these questions, we created the Terrorism Activity Resource Application (TARA) that is based on a modified version of the ALICE Program D chatterbot engine (freely available at

There are some notable differences between TARA and ALICE, as illustrated in Figure 1. The only component that remained unaltered was the actual Chat Engine.

The use of ALICEbots can help fill the role of information dissemination and can deal with answering many simultaneous disaster-related inquiries in a tireless and personable manner.

As mentioned earlier, we used three chatterbots with differing knowledge bases. The control chatterbot "Dialog," the general conversationalist, was loaded with the Standard and Wallace knowledge set that allowed ALICE to win the early Loebner contests. This set consists of 41,873 knowledge base entries. The second chatterbot "Domain," was loaded with 10,491 terrorism-related entries. The third chatterbot "Both," was a summation of Dialog and Domain, which can carry on a general conversation and easily handle terrorism inquiries as well. It contains 52,354 entries, 10 less than a true summation because of an overlap between the dialog and domain knowledge bases.

Terrorism entries were collected through a mixture of automatic and manual means. The majority were gathered automatically from several reputable Web sites. Manual entry was used sparingly to augment the terrorism knowledge set.

For our research we used 90 participants, 30 for each chatterbot. They represented a mix of undergraduate and graduate students taking various MIS classes. Participants were randomly assigned to one of the chatterbots, were asked to interact with the system for approximately 30 minutes, and were permitted to talk about any terrorism-related topic. They were further given an incentive to perform through the random awarding of gift cards for use at popular local businesses.

The evaluation method of chatterbot responses was an integrated process where users would chat a line and then immediately evaluate the chatterbot's response. Users were asked to evaluate each line with the following two measures: appropriateness of response (Yes/No), and satisfaction level of the response using a Likert scale of values (1–7). Users were also given the opportunity to provide open-ended comments on a line-by-line basis. Figure 2 shows the evaluation interface.

Participants were then given a survey at the end of the study that was aimed at extracting user impressions about the system and its potential impact. The survey asked the following questions:

  • Do you feel comfortable using this system to find terrorism information?
  • Would you use it to find terrorism information?
  • Would you recommend it to a friend who wanted to find terrorism information?

These questions are asked in Yes/No format with space for open-ended comment after each question.

Back to Top

Testing TARA

Users appeared to prefer a natural flow of conversation instead of a definitional approach to knowledge dissemination. For the first experimental design question—how will users perceive the usefulness of the different chatterbots—we studied the results from the end of survey questions that dealt specifically with chatterbot responses. We had expected the chatterbot with Both conversation and terrorism knowledge bases would rate the highest in all three categories: user comfort level, system usability, and recommendation potential. As it turned out, the Both chatterbot failed to rate highest in the "user comfort" category (31.0%), beaten unexpectedly by the terrorism-only Domain chatterbot (33.3%) at a p-value < 0.05. Table 1 summarizes the results.

From the examination of user comments, we quickly discovered the cause of this discrepancy. The Domain chatterbot returned only terrorism-related definitions to its participants. Therefore, participants of the Domain chatterbot came to accept these definitions as normal. For users of the Both chatterbot, the dialog knowledge set would return responses in conversational form and terrorism domain responses in definitional form that participants found to be in conflict. It was further found that participants of the Both chatterbot were uncomfortable with the idea of having the domain-specific knowledge returned to them in definitional style. Users mentioned they would have preferred the responses in a conversational context. One user in particular mentioned the usefulness of the system's conversational aspects. "Some comments are very appropriate and most of them make sense even when they're off topic."

The Both chatterbot performed better than Dialog and Domain. For the second question—how will the different chatterbots perform—measurements were conducted on the appropriateness and satisfaction rating of the chatterbot responses. Because the Both chatterbot is composed of dialog and domain parts, we took the Both chatterbot and broke its responses into its constituent parts of dialog and domain. We then compared those results against the actual Dialog and Domain chatterbots. This comparison is shown in Table 2.

When comparing the dialog component of Both against the actual Dialog chatterbot, the Both component rated higher in response appropriateness, 68.4% to 66.3%, as per our expectation. When looking at the domain component of Both against the Domain chatterbot, again the Both component rated higher, 39.6% compared to 21.6%. Likewise, Response satisfaction scores from the Both chatterbot rate higher than the corresponding "Actual" chatterbots. This analysis shows the Both chatterbot performed better in its constituent areas compared against the standalone chatterbots. We believe this is the result of the dialog portion responding to unrecognized queries and steering communication back to terrorism topics.

The 'wh*' interrogatives appear to be a good place to focus future knowledge acquisition activities. For the third question—what are the most frequent types of questioning used in the study—we investigated the input/response pairs of the Both chatterbot. In particular we were interested in only those user inputs that were in the form of a question, (68.4% of the terrorism domain inputs were interrogatives). Table 3 summarizes the most frequently observed interrogatives.

Investigating the interrogatives further, it was found that the interrogative "what" started the most user queries at 27.5% of all queries. We had expected that interrogatives beginning with "wh*" would be the most prevalent, and indeed they were, making up 51.5% of all interrogatives. It is interesting to note how often "Do" and "Is" were used, as these were unexpected surprises. In the vein of work done by Moore and Gibbs [3], where students used the chatterbot as a search engine, focusing future efforts of knowledge collection at these selected interrogatives should best improve chatterbot accuracy.

Back to Top


We discovered that users appeared to prefer a natural flow of conversation instead of a definitional approach to knowledge dissemination. This means that terrorism-specific knowledge bases must be adjusted, as well as be able to perform a more careful screening of terrorism sources incorporated into the knowledge bases. It was also found that the Both chatterbot with dialog and domain knowledge bases performed better than its stand-alone counterparts. It would appear to be the result of dialog and domain working together to provide more appropriate results together rather than apart. Lastly, consistent with Voorhees, interrogatives are a major source of user inquiries. The "wh*" interrogatives, and "what" especially, appear to be a good place to focus future knowledge acquisition activities.

Since most users appear to use ALICEbots as specialized search engines [3], it would make sense to approach ALICEbot input in the same terms. Following up on the search engine context, Voorhees [7] noted that search terms are predominately definitional in nature. This would imply that the best method of acquiring knowledge for the ALICEbot would be to obtain definitional responses keyed on interrogative input.

Finally, as discussed earlier, ALICEbots already meet many of the challenges required of a C3 system. ALICEbots were built first and foremost for conversation and they can leverage that ability in specialized knowledge domains.

In the future, it would be a good idea to implement a spell-checking component on user inquiries. This would eliminate 6.6% of the observed bad responses due to misspelled domain terms. It would also be good to investigate adding more knowledge to the system. Although our domain-specific knowledge base appeared to be sufficient for the task, it would be interesting to test even larger corpuses of knowledge and see what impact they may have over dialog knowledge. Another possible aspect worth considering is the addition of a C3 variant: the "I'm Alive" boards. Following the 9/11 attacks, multiple boards sprang up around New York City announcing the names and present shelter location of survivors to concerned friends and family members outside of the disaster area. Adding such functionality would be a trivial programming exercise, and would provide a quicker and more concentrated way for bidirectional communications.

Back to Top


1. Beer, M.D. et al. Deploying an agent-based architecture for the management of community care. In Proceedings of the 2nd International Joint Conference on Autonomous Agents and Multiagent Systems. (Melbourne, Australia, 2003).

2. Light, M. and Maybury, M.T. Personalized multimedia information access. Communications of the ACM 45, 5 (May 2002), 54–59.

3. Moore, R. and Gibbs, G. Emile: Using a chatbot conversation to enhance the learning of social theory. Univ. of Huddersfield, Huddersfield, U.K., 2002.

4. Pasca, M. and Harabagiu S.M. High Performance Question/Answering. In Proceedings of the Annual ACM Conference on Research and Development in Information Retrieval (New Orleans, LA, 2001). ACM Press, NY.

5. Potter, S. A survey of knowledge acquisition from natural language. TMA of Knowledge Acquisition from Natural Language. Edinburgh, Scotland, 2003.

6. Radiv, al. Probabilistic question answering on the Web. J. American Society for Information Science and Technology 56, 6 (2005) 571–583.

7. Voorhees, E.M. Overview of the TREC 2001 question answering track. In Proceedings of Text REtrieval Conference (2001).

8. Voorhees, E.M. and Tice, D.M. Building a Question Answering Test Collection. 2000, 200–207.

9. Vrajitoru, D. Evolutionary sentence building for chatterbots. In Proceedings of the Genetic and Evolutionary Computation Conference (Chicago, IL, 2003).

10. Wallace, R.S. The anatomy of A.L.I.C.E. A.L.I.C.E. Artificial Intelligence Foundation, 2004

11. Winograd, T. Five lectures on artificial intelligence. Fundamental Studies in Computer Science. A. Zampolli, Ed. North Holland, 1977, 399–520.

12. Woods, W.A. Lunar rocks in natural English: Explorations in natural language question answering. Fundamental Studies in Computer Science. A. Zampolli, Ed. North Holland, 1977, 521–569.

Back to Top


Robert P. Schumaker ( is an assistant professor in the Information Systems Department at Iona College, New Rochelle, NY.

Ying Liu ( is an assistant professor in the Information Systems Department at California State University, Long Beach.

Mark Ginsburg ( is a former professor in the Department of Management Information System, The University of Arizona, Tucson.

Hsinchun Chen ( is McClelland Endowed Professor in the Department of Management Information Systems, The University of Arizona, Tucson.

Back to Top


F1Figure 1. Differences between original ALICE Program D and the TARA chatterbot.

F2Figure 2. The evaluation interface.

Back to Top


T1Table 1. End Survey analysis of the three chatterbots.

T2Table 2. Comparing the components of "Both" against the Dialog and Domain chatterbots.

T3Table 3. Results of the most frequently observed interrogatives.

Back to top

©2007 ACM  0001-0782/07/0700  $5.00

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2007 ACM, Inc.


No entries found