Sign In

Communications of the ACM

Communications of the ACM

Testing and Evaluating Computer Intrusion Detection Systems

On May 22, 1996, the U.S. General Accounting Office (GAO) disclosed that approximately 250,000 break-ins into Federal computer systems were attempted over the previous year. At least 10 major agencies, comprising 98% of the total Federal budget, had been attacked. The GAO went on to say that an estimated 64% of these attacks (about 160,000) were successful. It gets worse: the number of attacks is doubling every year. Based on previous studies, the GAO estimates that only 1% and 4% of these attacks will be detected and only about 1% will be reported.1

Inventing and refining computer intrusion detection (ID) techniques is an ongoing research problem at the Defense Advanced Research Projects Agency (DARPA), which has sponsored over 20 different projects at various companies and universities. The Air Force Research Laboratory (AFRL), which manages over $10 million worth of these ID programs, recently accepted responsibility for developing a common test architecture and process to evaluate DARPA-sponsored research efforts. AFRL will publicize the results of its ongoing evaluations at semiannual DARPA Intrusion Detection program reviews—to date AFRL has completed one iteration of the test-and-publicize cycle. The results are described here.

Back to Top

Threats and Attack Techniques

One problem facing government computer network managers is that many threats are undefined and open-ended. Worse, software for breaking into computer networks is freely available on the Internet, complete with instructions.2 Hackers constantly invent new attacks and disseminate them over the Web. These problems are not limited to the government—many corporations fight industrial espionage daily to protect trade secret information. Juvenile hackers, while not necessarily as malicious as dedicated adversaries, can still wreak significant damage to systems and their defenses.3 Moreover, disgruntled employees, bribery, and coercion make networks vulnerable to attacks from the inside. ActiveX, Java, and increasing reliance on "commercial off-the-shelf" technology help infiltrators make unknowing accomplices of legitimate users. Add accidental vulnerabilities lying undiscovered in large software programs, and the only reliable guiding principles read like a paranoid manifesto: your mission-critical software has vulnerabilities; sooner or later those vulnerabilities will be exploited; there will always be super-hackers more clever than you; and you can never prevent all attacks.

Combinations of host-based, network-based, and router-based ID systems enhance an organization's ability to detect attacks automatically. In practice, however, most attention is given to the point of entry into an entire domain (a firewall, for example) and little elsewhere. This leaves the entire network susceptible to the full range of insider attacks just discussed. Furthermore, encryption and other coding techniques easily bypass network-based ID systems by scrambling network packet contents.

Back to Top

Intrusion Detection Basics

Host-based ID systems directly monitor the computers on which they run, often through tight integration with the operating system. Although they monitor insiders with the same vigilance as outsiders, and although network encryption doesn't affect them, the number and diversity of computers often make it impossible to protect each computer individually with a host-based ID system. Furthermore, while host-based ID systems have direct access to a wealth of trustworthy information straight from the operating system, they are generally power-hungry. Several DARPA efforts (including Purdue University COAST Laboratory's Enhanced Intrusion and Misuse Detection Techniques program) currently focus on new techniques for leveraging the advantages of direct operating system access with a much smaller impact on performance and with much greater reliability. Other efforts will provide a trustworthy automated mechanism for managing hundreds of host-based ID sensors so that eventually every computer in an organization can be directly protected.

Network-based ID systems monitor network traffic between hosts. Unlike host-based ID systems, which detect malicious behavior outright, these systems deduce behavior based on the content and format of data packets on the network. Among other things, they analyze overt requests for sensitive information and repeated failed attempts to violate security policy. Many current network-based ID systems are quite primitive, only watching for the words and commands of a hacker's vocabulary. A few are more sophisticated and analyze protocol-specific information. If host-based ID sensors are analogous to a guard dog for each computer "home," network-based ID systems are the neighborhood police patrols. Many of the network monitors under research at DARPA (such as Boeing's Automatic Response to Intrusions and Stanford Research Institute's Emerald project) can even respond to calls for help, either by decisively terminating an intrusion or by more graduated responses. Such responses include filtering, isolation, changing logging or even more drastic actions such as disconnection. Other DARPA efforts (Columbia University's learning-based JAM project and Berkeley's Common Intrusion Detection Framework, for example) are currently investigating techniques for more reliable detection of intrusions through collaboration between different kinds of detection systems.

Router-based ID systems protect the network infrastructure. These systems ensure safe, reliable connections between computers over large networks. In the home-host analogy mentioned previously, router-based monitors are like highway patrols between neighborhoods. Unfortunately, although monitors installed on routers stop intruders from even entering a network, the slightest additional load imposed by an ID system impairs a router's ability to shuttle huge amounts of data between networks. For this reason, some DARPA projects (MCNC's Scalable Intrusion Detection for the Emerging Network Infrastructure, and Boeing's Dynamic Cooperating Boundary Controllers) are currently investigating methods of retaining router monitors' advantages while minimizing the impact to performance.

Back to Top

Evaluating Intrusion Detection Systems

The Air Force Research Laboratory began testing many of the DARPA-sponsored intrusion detection systems in 1998. Because few network administrators and no security professionals wanted to invite AFRL to stage attacks against these systems over their networks, AFRL established a self-contained testing environment to model an actual basenetwork.

Supporting the AFRL test program, the Massachusetts Institute of Technology's Lincoln Laboratory developed non-real-time evaluation tools for assessing the performance of individual ID systems. Their simple testbed enabled dramatically accelerated playback of huge volumes of data to quickly produce high-confidence performance specifications. However, when these individual ID systems are deployed, they will have to defend larger, more complex networks and will interact with other components. This interaction and the resulting overall performance is impossible to reproduce in a non-real-time environment.

If host-based ID sensors are analogous to a guard dog for each computer "home," network-based ID systems are the neighborhood police patrols.

Network architecture. To explore these real-time relationships, AFRL developed an immersive test environment simulating the complexity of a typical metropolitan-area network (MAN) found at many military installations. In theory, a top-level firewall protects a single point of entry into the base MAN from the outside. From the firewall, arrays of routers branch out to their respective subnetworks. In addition to Unix and Microsoft hosts using the Internet protocol to communicate, base networks carry a large volume of traffic using proprietary and legacy protocols including Novell IPX, Microsoft NetBEUI, DECNET, VAX clusters and others. The AFRL network model simulates a network as illustrated in Figure 1, but the actual physical network is illustrated in Figure 2. AFRL's physical test network models the base network shown in Figure 1. In the model, there is a single point of entry from the Internet into the base network, with a firewall at that single point protecting the entire base. Diamonds denote routers and circles denote hosts; shaded routers and hosts are virtual components, and are indistinguishable from real components as far as the physical routers and hosts can determine. AFRL simulated the size and diversity of a base MAN in Figure 1 by developing software to dynamically assign arbitrary source IP addresses to individual network sessions running on two real computers outside the physical network of Figure 2 (these two computers were "traffic generators"). The actual physical network consists of a small array of routers and hosts and a firewall inside the base, and two traffic-generating controllers outside the base. Two border routers seem like normal routers to inside components, but connect to the external controller network. Components on the inside cannot see beyond the border routers or the firewall. Traffic-generating hosts on this external network simulate both the Internet and other hosts deep inside the test network, and also flood the test network with prerecorded background traffic. In the simulation, the "inside" traffic generator pretended to be hosts within the base MAN's sphere of control, simulating the presence of a much larger network than actually represented by the real machines. The "outside" traffic generator pretended to be hosts outside the base MAN, simulating the presence of the Internet. The entire testbed, including the physical network under test and the two traffic generators, was completely isolated in AFRL's laboratory and was not connected to any live network.4

The traffic generators implemented a technique to assign an arbitrary source IP address on a per-process basis. For example, the outside traffic generator could run 10 simultaneous network sessions (Telnet, FTP, and so forth), with each session appearing to originate from a unique IP address. AFRL implemented this IP-swapping technique by modifying the Linux 2.0 kernel,5 and the modifications allow assigning arbitrary addresses to TCP, UDP, ICMP and raw-protocol IP packets. Examination of packets produced by the traffic generators shows that they are indistinguishable from packets produced by an array of real computers, with one exception: all packets leaving the traffic generator have the same source Ethernet address. AFRL connected the traffic generators to the network under test through "border routers," and this removed all artifacts when the packets actually entered the network. In order to preserve the integrity of the simulated network, AFRL permitted ID systems under test to be installed anywhere on the physical network except the border routers.

The IP-swapping technique allowed arbitrary assignment of IP addresses to network sessions, but actually creating and running those sessions was a significant challenge as well. Lincoln Laboratory already had committed to developing a suite of test sessions and procedures for its non-real-time test, so AFRL made its network a superset of the Lincoln Laboratory network and used Lincoln Laboratory's session-generation tools and data files.

Lincoln Laboratory invented a generic session data file format and an engine for playing back those sessions. The sessions included various services such as Telnet, FTP, HTTP, email, finger, ping, and so on. In their non-real-time evaluation, Lincoln Laboratory played the sessions across their network and recorded the resulting network packets and audit trails. This recorded data was sent to ID researchers, who applied their own ID systems to locate the embedded attacks. The ID researchers' results were sent back to Lincoln Laboratory for scoring.

The AFRL testbed was designed to evaluate the ID systems in real time, by playing a four-hour subset of the Lincoln Laboratory sessions. Its architecture allowed ID systems to interact with other machines in the evaluation, both physical machines within the network under test and virtual machines synthesized by the traffic generators.6

Evaluation procedure. All ID systems under evaluation were installed in the physical AFRL test network. For consistency and reliability of results, all candidate ID systems were tested against the same static physical network configuration.7 AFRL planned to allow any system relying on special hardware to include it in the physical network as long as the additional hardware observed the same constraints as the other hosts in the network—namely, no machine inside the network was allowed to see directly beyond the firewall or border routers—but no ID systems tested in this first iteration required additional hardware. Also, any system that supported automated responses such as blocking or counterattacking would have been required to disable those features in order to test its sensor components. Again, no systems included this feature.

Detection rates for each system were calculated using all the attacks that were performed against the entire network.

Each candidate ID system faced the same four-hour set of Lincoln Laboratory's network sessions, and these sessions were run from a master script that controlled the timing and sequence of each session. The set AFRL ran provided hundreds of network sessions as normal background traffic, implementing various network services. It also included attacks with a variety of severities and against a variety of operating systems represented in the physical test network. Those attacks are described in Table 1.

To actually conduct the evaluation, AFRL first started each ID system to be evaluated. Then various other network packet loggers were started to monitor the sessions when they ran and to provide a consistent yardstick against which to compare ID systems' results. Next the traffic generators were started. Four hours later, all traffic-generation and ID systems were stopped and the ID systems' results were archived for later analysis. Finally, AFRL personnel manually inspected each ID system's results and compared them to various other recorded results to determine the ID system's performance.

Actual evaluation was partly objective and partly subjective. Some attacks manifested themselves as single discrete network sessions and ID systems either caught them or missed them (or often caught "something" but incorrectly identified them). Other attacks such as probes or ping sweeps consisted of hundreds of sessions, so if an ID system caught a fraction of these sessions and identified them as an attack, it counted as a detection instead of hundreds of misses. Sessions that were incorrectly identified but still tagged as evidence of an attack were generally scored as positive attack detections. Sessions that were grossly mislabeled, or that were labeled as attacks when they were legitimate sessions that happened to be part of an overall sequence of sessions implementing an attack,8 were scored as false alarms.

Evaluation results. AFRL evaluated three DARPA-funded ID systems chosen for their maturity for a real-time evaluation. One of these systems comprised several network monitors placed at strategic points around the network, providing 100% coverage of the infrastructure. The other two systems were host-based systems that protected isolated computers. In addition, AFRL also evaluated a government off-the-shelf (GOTS) system as a baseline—this system is typical of the type used at many military installations. The GOTS system was placed at the top of the network and monitored all traffic crossing the boundary between the inside and outside domains, as is typically done at military bases.

Each system was integrated into AFRL's physical test network environment, and then evaluated using a mix of normal traffic mixed with approximately 30 attacks (spanning the entire array of attack types described in Table 1). The normal traffic included the services listed in Table 2; topology and complexity of intrusions are listed in Table 3. The attacks were drawn roughly evenly from four general attack classes: 1) surveillance, 2) denial of service, 3) user to root, and 4) remote to local. The surveillance attacks consisted of a mix of ping sweeps and port scans. The denial of service attacks in this evaluation were directed against individual hosts and not against the network at large. The user-to-root attacks were all buffer-overflow attacks9 and assumed the attacker already had access, but all took place over the network in the clear. The remote to local attacks all took advantage of vulnerable or misconfigured services to obtain unauthorized actions on the victim host and to gain information or privileges.

Detection rates and false alarm rates were measured for all systems and these measurements were used to form receiver operating characteristic (ROC) curves. An ROC curve plots correct detection rate against false alarm rate; or, perhaps a more natural way to interpret, an ROC curve shows the false alarm rate incurred by choosing a particular detection rate. Detection rates for each system were calculated using all the attacks that were performed against the entire network—not just the attacks the detector was designed to cover—because the goal of the evaluation was to measure each system's contribution to the mission of protecting a critical installation.

Current government systems are protected by a patchwork collection of tools of dubious quality.

Figure 3 shows the ROC curve for the baseline GOTS system. This system logs and analyzes all sessions that pass its sensor and assigns each session a warning value from 1 to 10, based on keyword matches found in the packet data. Ideally, attacks always get warning values higher than legitimate sessions, but the ID system often false-alarms or under-scores attacks. The network log, annotated with warning values for each session, is then analyzed by human operators who try to identify genuine attacks from false alarms. Because the installations these GOTS systems are protecting may pass hundreds of thousands of sessions, even a 1% false alarm rate produces too much data to examine, so the operators generally only inspect those sessions that scored highest. The detection/false alarm data was captured for discrete data points, but we can connect adjacent points on the graph with straight lines by reasoning that an intrusion detection system could get an increasing percentage of attacks going from one point to the next by randomly guessing (thus linearly increasing both the percentage detected and the percentage of false alarms). The area under the curve is one measure of an ID system's effectiveness. Figure 4, which is the distribution of warning values reported by the GOTS system, may help to clarify the meaning of Figure 3. Four of the attacks scored 3.162 or lower by the GOTS system, so to confirm that they were indeed attacks, an analyst would have to review transcripts of over 12,000 sessions that scored higher than 3.162.

All of the DARPA systems tested in this part of the evaluation were signature-based systems that produced a binary result of 1 for a declared attack and 0 for a normal session. For such a system only one point on the ROC curve is produced, percent detected vs. percent false alarms. The rest of the curve is filled out by drawing a line from the origin (representing 0% detected, 0% false alarms) to the point on the graph, and from there a line to the upper-right corner (representing 100% detected, 100% false alarms).

Figure 5 shows the performance of the three DARPA systems compared to the GOTS system. The graph only shows performance for false alarm rates up to 1%, since anything greater in a large enterprise would be unmanageable. From this experiment the DARPA systems clearly performed better than the existing GOTS technology. The most significant improvement was due to the reduction in false alarms.

Although the sample of attacks was small for the 1998 real-time evaluation, several rough trends emerged from the test:

  • Signature-based detection systems can be effective in reducing false alarms if implemented properly.
  • Neither one of the network-based systems did very well against host-based, user-to-root attacks.
  • Several of the surveillance attacks were able to probe the network and retrieve significant information, undetected, by limiting the speed and scope of the probes.
  • Attacks for which there was no training data available from the Lincoln Laboratory evaluation were generally missed, possibly indicating that techniques other than signature detection need to be developed in order to catch novel attacks.
  • String-matching network monitors, typical of most ID systems employed by the military, are expensive in terms of false alarm rates and miss most kinds of attacks.

For 1999 several modifications will be made to the evaluation method. As the testbed becomes more stable, AFRL will increase the duration of each test, which will allow better measurements on false alarm rates and will execute a larger and more diverse set of attacks. In addition, several DARPA systems that did well in the 1998 Lincoln Laboratory non-real-time test will be ported to the testbed for evaluation.

Back to Top


Both industry and the government have an active (and growing) interest in computer security. DARPA's sponsorship and AFRL's evaluation of $10 million worth of groundbreaking intrusion detection research will set a new standard for automated security monitoring at the host, LAN, regional, and global levels. Current government systems are protected by a patchwork collection of tools of dubious quality. The picture is even worse for the commercial world. Recent major acquisitions totaling hundreds of millions of dollars for little more than ad-hoc security solutions indicate a desperate, indiscriminate need for computer security. If it sounds like panic, it very well may be. The frequency and sophistication of attacks, like the power and performance of the microprocessors they run on, are ever increasing. The number of computers interconnected through the Internet and other networks is increasing at a tremendous rate. Always remember that the attacks publicized in the news are only the ones detected. The other 96% slipped by. Who will catch those?

Back to Top


Robert Durst (, formerly with the Air Force Research Laboratory's INFOSEC Technology Base, is a research scientist at SenCom Corporation.

Terrence Champion ( is a research scientist and program manager in the Air Force Research Laboratory's INFOSEC Technology Office at Hanscom Air Force Base.

Brian Witten ( is a research scientist and program manager in the Air Force Research Laboratory's INFOSEC Technology Office at Hanscom Air Force Base.

Eric Miller (, formerly a research scientist and program manager with the Air Force Research Laboratory's INFOSEC Technology Office at Hanscom Air Force Base, is a graduate student at Stanford University.

Luigi Spagnuolo ( is the chief of the Air Force Research Laboratory's INFOSEC Technology Office at Hanscom Air Force Base.

Back to Top


1Information Security: Computer Attacks at Department of Defense Pose Increasing Risks, GAO/AIMD-96-84, May 22, 1996; (Testimony before the Permanent Subcommittee of Investigations, Committee on Governmental Affairs, U.S. Senate), May 22, 1996; Information Security: Opportunities for Improved OMB Oversight of Agency Practices, GAO/AIMD-96-110, Sept. 24, 1996.

2Information Security: Computer Hacker Information Available on the Internet (Testimony before the Permanent Subcommittee of Investigations, Committee on Governmental Affairs, U.S. Senate), June 5, 1996.

3Two California teens suspected of breaking into government computers. The Washington Post (Feb. 28, 1998), A6.

4Over the course of conducting evaluations, additional hosts were added to the traffic generators' local subnet, and for convenience this subnet was attached through firewalls to AFRL's office computer network, which is accessible via the Internet.

5The IP-swapping technique and the Linux kernel modifications to implement it are under consideration for a U.S. patent.

6None of the ID systems tested in this first iteration performed any interaction with other hosts inside or outside the network under test.

7As much as possible; MCNC required the use of routers running FreeBSD so all of the Linux routers were converted to accommodate them.

8For example, crashing an anonymous FTP server to try to get the shadow password file from the resulting core dump is an attack, but labeling the actual anonymous FTP connection and nothing else as an attack was considered a false alarm. Anonymous FTP is a legitimate service, and identifying only it as an attack misses the more sinister actions of crashing the FTP server and reconnecting to download the core file.

9"Buffer overflow" or "buffer overrun" attacks consist of providing to vulnerable programs data that is too big to fit in the program's internal buffer. The programs will then crash and begin executing the data that was provided to them as input.

Back to Top


F1Figure 1. Virtual network.

F2Figure 2. Actual physical network.

F3Figure 3. GOTS ID system's ROC, as a measure of its detection and false alarm rates.

F4Figure 4. Distribution of warning values GOTS ID system assigned to sessions.

F5Figure 5. comparison of DARPA ID system with GOTS ID system.

Back to Top


T1Table 1. Types and descriptions of attacks run against ID systems.

T2Table 2. Approximate distribution of session types in the AFRL test suite.

T3Table 3. Topology and complexity of computer intrusions.

Back to Top

©1999 ACM  0002-0782/99/0700  $5.00

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 1999 ACM, Inc.