Sign In

Communications of the ACM

ACM TechNews

Cornell Develops Analysis Tools for Large-Scale Web Data

View as: Print Mobile App Share:

The Web Lab project has developed a family of data analysis tools for searching the Internet Archive. The project, a joint effort by researchers at Cornell University and the Internet Archive, is funded by the National Science Foundation (NSF). "The aim of the Web Lab is to organize large portions of these collections, so that they can be used by researchers who are not experts in data-intensive computing," says Cornell professor William Arms.

One of the tools, the Web Lab Collaboration Server, is a service for large-scale collaborative Web data analysis that demonstrates how to support nontechnical users during the search, extraction, and analysis of Web data. Cornell periodically transfers Web crawls from the Internet Archive in San Francisco to the Cornell Center for Advanced Computing using a high-speed NSF TeraGrid connection, and has completed more than four Web crawls consisting of billions of pages, says Cornell professor Johannes Gehrke.

Gehrke says there are three major obstacles in creating data analysis applications: Customized data sets that must be prepared by writing extraction scripts tailored for the specific task; data sets that must be cleaned or formatted, which is often needlessly repeated by end users; and analysis code that must be written to take advantage of parallelism, shared memory, or distributed computing power and storage. To solve these problems, the researchers developed a graphical user interface for complex extraction and analysis tasks, enabled reuse and sharing of data among a community of researchers, and packaged the tools in a Web-based, software-as-a-service architecture to enable users to use a distributed computing and archiving platform for extraction and analysis tasks.

From Cornell University
View Full Article


No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account