Parallel vs. Distributed Data Access for Giga-pixel-resolution Histology Images: Challenges and Opportunities

Parallel vs. Distributed Data Access for Giga-pixel-resolution Histology Images: Challenges and Opportunities

Parallel vs. Distributed Data Access for Giga-pixel-resolution Histology Images: Challenges and Opportunities 780 435 Journal of Biomedical and Health Informatics (JBHI)

Parallel vs. Distributed Data Access for Giga-pixel-resolution Histology Images: Challenges and Opportunities

Recent advances in digital pathology technology have led to significant improvements in terms of both the quality and resolution of the resulting images, which now often exceed several Gigabytes each. Today, several leading institutions across the US, utilize whole-slide imaging (WSI) as part of their routine workflow. WSIs have utility in a wide range of diagnostic and investigative pathology applications. The fact that, these images are both large in size (about 30-50GB when uncompressed), and are generated in proprietary formats has limited wider adoption of these technologies and makes the task of accessing, processing and analyzing them in high-throughput, extremely challenging. The common approach for such data analytics applications is to pre-process these images into smaller size files and store them in a generic format. Such strategies introduce extra processing time to the workflow and are not flexible for dynamically changing resolution levels and tile sizes. In this paper, we present, novel scalable access methods for parallel and distributed file/object storage systems. The first approach introduces an algorithm that takes directory path of the WSIs and automatically calculates the total number of tiles given the tile sizes. It then distributes these coordinates to the parallel processes, which have access to a parallel file system. Each process could retrieve the tile assigned to it using the coordinates. The second approach includes a parallel preprocessing step to convert the unstructured WSIs into structured compressed files where each tile has a key including its coordinates and a value including the pixel values. The source files can reside on a web server, a cloud storage system or a local distributed file system.  Experimental results show both approaches scale well and while with parallel access, the files can be kept in their original format, with distributed access they have to be converted into key-value pairs.