The main claims of the paper 
- A certain level of localization labels are inevitable for WSOL. In fact, prior works that claim to be weakly supervised use strong supervision implicitly.
- Therefore, let’s standardize a protocol where the models are allowed to use pixel-level masks or bounding boxes to a limited degree.
- According to their proposed evaluation method, they have not observed any improvement in WSOL performances since CAM (2016) in this protocol.
This paper presents convincing arguments and visual examples to show that, when we constrain the extent of consulting human judgement for all models, there is no evidence that the recently proposed WSOL methods perform any better than CAM. They also propose better evaluation metrics for WSOL in order to fully measure the model’s capability of distinguishing foreground and background pixels without being fixated to an arbitrarily-chosen threshold. It would be a big contribution to WSOL if this work can help researchers save a significant amount of computing time by focusing on few-shot learning rather than hyperparameter search and implicit strong supervision.
However, their claim that a certain level of localization labels are inevitable does not apply for classifiers which depend heavily on the localization performance. For example, GMIC  has a local module that works on small patches cropped based on the saliency map generated by another CNN on the original image. By choosing the model that has best classification performance on the local module, we can select a model with high PxAP without having to make sure whether the localization performance didn’t fail with pixel level ground truth labels.
What the authors call “few-shot learning” is simply N-way-K-shot classification during training where 100 <= N <= 1000 and 0 <= K <= 15 without using any special architectures for few shot learning. Using a maximum of 15 examples per class might seem like a very small subset, but due to the large number of classes, they end up using up to 10k images for training. Then it’s not very surprising that these models perform very well, given that the task is relatively simple. The model just needs to distinguish foreground and background features within a dataset which either (1) mostly consists of similar categories (subspecies of birds in CUB) or (2) many categories share a similar-looking background (e.g. sky, water, ground).
For medical imaging, on the other hand, this might not necessarily work well. For example, The NYU Breast Cancer Screening Dataset  has only two classes (benign and malignant) associated with each image for the cancer detection task. We do not yet have any evidence that using a small subset of full-supervision will still work with 30 images at max. Moreover, the WSOL task in medical imaging is much harder than natural images (e.g. no obvious signs of background such as water or sky) and will likely require much more fully-supervised examples to reproduce similar results.
Lastly, the authors claim that WSOL is an ill-posed problem when a background is more strongly associated with a class than foreground. They are correct that in this case the model will never be able to predict the intended mask. However, this seems like the problem of the data rather than the WSOL formulation. As the authors suggest themselves, this should rather be fixed by providing more diverse data so that using wrong features such as water is no longer beneficial for duck prediction. With or without data cleaning, the model is still finding patterns present in the given dataset.
 Evaluating Weakly Supervised Object Localization Methods Right. Junsuk Choe, Seong Joon Oh, Seungho Lee, Sanghyuk Chun, Zeynep Akata and Hyunjung Shim. arXiv:2001.07437, 2020.
 An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization. Yiqiu Shen, Nan Wu, Jason Phang, Jungkyu Park, Kangning Liu, Sudarshini Tyagi, Laura Heacock, S. Gene Kim, Linda Moy, Kyunghyun Cho and Krzysztof J. Geras. arXiv:2002.07613, 2020.
 The NYU Breast Cancer Screening Dataset v1.0. Nan Wu, Jason Phang, Jungkyu Park, Yiqiu Shen, S. Gene Kim, Laura Heacock, Linda Moy, Kyunghyun Cho and Krzysztof J. Geras. https://cs.nyu.edu/~kgeras/reports/datav1.0.pdf