Sign In

Communications of the ACM

ACM TechNews

JPL Creates PDF Archive to Aid Malware Research

View as: Print Mobile App Share:
The 8 million PDFs were downloaded from websites across the globe.

The Digital Corpora project hosts the huge data archive as part of Amazon Web Services’ Open Data Sponsorship Program, and the files have been packaged in easily downloadable zip files.

Credit: Science RF/Adobe

Data scientists at the U.S. National Aeronautics and Space Administration's Jet Propulsion Laboratory (JPL) have compiled 8 million PDF files into an open source archive for enhancing online security.

The corpus is part of the Defense Advanced Research Projects Agency (DARPA) Safe Documents program.

Experts can look through this archive to find information on malware that could be concealed within a file's code to help predict emerging online threats and to augment PDF technology.

The researchers identified the PDFs for inclusion using Common Crawl, a public repository of Web-crawl data, while specialized software re-fetched truncated files.

The approximately 8-terabyte dataset is the largest publicly available corpus of its type.

From Jet Propulsion Laboratory
View Full Article


Abstracts Copyright © 2023 SmithBucklin, Washington, D.C., USA


No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account