Sign In

Communications of the ACM

ACM TechNews

Data Mining the Past

View as: Print Mobile App Share:
Digitizing a newspaper article.

The algorithm ranks peoples names by importance based on a number of attributes, including the context of the name, title before the name, article length, and how frequently the name was mentioned in an article. The algorithm learns these attributes from


Researchers at the State University of New York at Buffalo and India's International Institute of Information Technology Bangalore have developed an algorithm to convert old newspapers into searchable data by identifying and ranking people's names in order of their importance.

The researchers worked with the New York Public Library to analyze over 14,000 articles from The Sun published in November and December of 1894.

The algorithm keys on attributes exclusively from text produced by optical character recognition (OCR) software, like name context, title before the name, article length, and how often the name is mentioned in an article.

Because the OCR text was garbled, the researchers modeled the attributes statistically, and tested the algorithm on raw OCR-generated text and articles cleaned up manually by schoolchildren.

They found it could rank names very precisely, even from the OCR text, when compared to the cleaned-up versions.

From University at Buffalo News Center
View Full Article


Abstracts Copyright © 2021 SmithBucklin, Washington, DC, USA


No entries found