-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about some of the details in the MIMIC-CXR-VQA dataset construction process #5
Comments
Hi @zihui-debug, Sorry for the late reply. Thank you for asking about the sampling strategy when constructing the MIMIC-CXR-VQA dataset. Initially, we struggled to setup the sampling strategies, as you mentioned. We faced several questions such as "How many (image, question, answer) triples do we need?" and "For each template, do we need to sample all objects and attributes equally, or should we let them be sampled randomly?". After much consideration, I can briefly summarize our VQA sampling strategy (similar to the "VQA dataset generation - dataset balancing" part in Appendix B.2.2 section of our paper) as follows:
Specifically, to address your questions, we aimed to use as many possible question combinations as we could (i.e., all combinations of placeholders for each template). In other words, while not all combinations were fully utilized, most were considered. For the negative option, we sampled images to maximize the answer entropy. If you have any detailed questions or would like to see the sampling code, I can share the raw-level code with you, but please note that it's a little bit messy. Best, |
@zihui-debug Here is an addendum: To ensure a balanced frequency of different objects and attributes, I almost certainly sampled a similar number of diverse questions within each template. In other words, I sampled different combinations of placeholders with roughly the same frequency (though I'm not 100% certain). However, due to the long-tailed distribution of X-ray findings itself, my algorithm cannot guarantee sampling all combinations equally. As a result, some combinations fail (e.g., there are no cases with both object1 and attribute1 to sample for a specific question template, especially in the gold dataset), and some are not frequent (e.g., there are frequent cases (i.e., related attributes) about the left lung but few cases about the trachea). |
Thank you very much for your reply! Can I understand the whole sampling process like this:
Some supplementary questions:
Maybe I misunderstood and some of the questions seemed elementary...Thanks for your meticulous answers about this great work! If convenient, could you please send the raw-level code to my email address: [email protected], which will be very helpful to me. |
@baeseongsu Moreover, I see that the data doesn't use attributes of the nlp and texture types, what's the consideration? |
@zihui-debug Yes, that's exactly what we've done. Thank you for the clear summarization. Regarding the supplementary questions, my response will be as follows:
A1: Not exactly. The combinations for placeholder values depend on the template used for sampling. For example: (1) If the template contains only one {object} placeholder, the possible combinations would be all objects (i.e., less than 40 possible combinations); (2) If the template contains multiple placeholders (e.g., {attribute1}, {attribute2}, and {object}), there could be over 1,000 possible combinations to sample. Note that we consider all possible combinations during sampling. However, some combinations may not be sampled if corresponding studies don't exist in the MIMIC-CXR cases.
A2: This is an important question. We assume that if there's no information in the report, especially regarding the five categories we've covered, we regard it as "no." This is because radiology reports should be comprehensive; incompleteness could lead to patient care issues. This is why we do further preprocessing on the original Chest Imagenome dataset.
A3: No, we did not balance the answer distribution for query types. It is more complicated to construct such logic, but the main scheme is the same as others, sampling answers that can maximize the answer entropies. Also, I will share the code asap. |
@zihui-debug For the 'nlp' category, we've replaced its concept with our own "abnormality" concept by pre-defining it as a superset of the other four categories (i.e., 'anatomicalfinding', 'device', 'disease', and 'tubesandlines', excluding 'technicalassessment'). As for the 'texture' category, it occurs less frequently compared to the other categories and is tricky; it's more of a modifier that can attach to other attributes rather than an independent attribute. |
Very helpful answer, thanks! About the nlp attribute (with the value normal or abnormal), how to deal with the case where both normal and abnormal annotations exist for the same object in the Chest Imagenome gold dataset? |
@zihui-debug Sorry for the late reply! |
Thank you for your outstanding work! I'm sorry to bother you, but I still have questions about some of the details of the data construction process. I'm wondering how you choose the value of each placeholder when fill in the question templates? The Chest ImaGenome dataset contains a large number of annotations for anatomical regions and attributes that, if used in full, would yield quite a number of samples. So what's the specific sampling strategy? For example, when constructing an abnormality choose type question related to an X-ray, are the anatomical area randomly selected? In addition, how to set the negative sample in the candidate option?
The text was updated successfully, but these errors were encountered: