Review of 'Processing Megapixel Images with Deep Attention-Sampling Models'

‘Processing Megapixel Images with Deep Attention-Sampling Models’ (referred as ‘ATS’ below) [1] proposes a new model that can save unnecessary computations from Deep MIL [2].

  1. They first compute an attention map of all possible patch locations from an image. They do so by feeding a downsampled image to a shallow CNN without much pooling operations.
  2. They sample a small number of patches from the attention distribution and show that feeding these samplied patches to MIL classifier is an unbiased minimum-variance estimator of the prediction made with all patches.
  3. They show that both the visualization of the attention maps and the test error closely approximate Deep MIL while being at most 30 times more efficient.

ATS enables using high-capacity CNNs on high-resolution images without requiring pixel-level labels or too much computation. It can also provide visualizations of where the model is looking at.


I am impressed with both the rigorous mathematical formulation of the proposed model and the availability of their open source repository. They even provide documentation which explains how to best utilize each part of their code. I appreciate the author’s effort towards making their work available in terms of clarity of the explanations and reproducibility. (The implementation is in TensorFlow. If you are interested in PyTorch implementation, there is a third-party implementation at this link.)

Training deep neural networks on large datasets of megapixel images can easily take weeks if not months. This work can significantly speed up the development iterations for deep learning researchers in medical imaging field.

Entropy regularizer seems to be a crucial piece to make ATS work. Without this regularizer, the model ends up with extremely sparse attention maps which highlight only a couple of patch locations. In this case, I suspect that (1) the feature model will overfit quickly and (2) the attention model will take a long time to train. While the authors included figures of ablation study for the attention map, it would have been nice to see its effect on the classification task.

For practitioners who plan on using this model, there are a couple of caveats that should be noted.

  1. This work requires heavily downsampling the original image to be efficient.
    • If the important features in your dataset consists of only a handful of pixels (e.g. calcifications in mammography), it might disappear entirely.
    • Depending on the dataset, it might be better to use convolutional layers with large strides than to resize images.
  2. Attention map can be more difficult to understand for radiologists than weakly- or strongly-supervised saliency maps.
    • Attention map is relative to other locations in an image and must sum up to 1.
    • In the cancer detection task, for example, the attention score of ATS will not necessarily correspond to the probability of malignancy.
      • If only one cancer is found in a given image, it might have an attention score close to 1 in that location.
      • If multiple cancers are found, however, suddenly their attention score will be spread out over the equally informative regions and have much smaller values.
      • Most concerningly, if there is no cancer in the image, the attention map might end up highlighting every location equivalently. Small variations will be emphasized, and it might end up highlighting some random location in the same way it highlights cancer.

Lastly, if you are interested in learning more about efficient deep neural networks, I recently found a PhD Thesis on this topic [3]. It describes 4 different types of efficiency: model training and inference, data acquisition, hardware acceleration and architecture search. It seems to be a comprehensive overview of recent efforts.



[1] Processing Megapixel Images with Deep Attention-Sampling Models. Angelos Katharopoulos and François Fleuret. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 3282–3291, 2019.

[2] Attention-based Deep Multiple Instance Learning. Maximilian Ilse, Jakub M. Tomczak and Max Welling. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.

[3] Efficient Deep Neural Networks. Bichen Wu. arXiv:1908.08926, 2019.

Jungkyu (JP) Park
Deep Learning Researcher, PhD Student

Deep learning in medical imaging


comments powered by Disqus