Credit: Getty Images
Massive and highly repetitive text collections are arising in several modern applications. For example, a U.K. project managed in 2018 to sequence 100,000 human genomes, which stored in plain form require 300 terabytes. Further, the data structures needed to efficiently perform the complex searches required in bioinformatics would add another order of magnitude to the storage space, reaching the petabytes.
How to cope with this flood of repetitive data? We can think of compression (after all, two human genomes differ by about 0.1%), but it is not the definitive answer—we need a way to decompress the data before we can use it. A more ambitious research area, compressed data structures, promises to store the data and the structures required to efficiently handle it, within space close to that of the compressed data. The data will never be decompressed; it will always be used directly in compressed form.
No entries found