Tanveer shared with the participants an exciting challenge that was taken up at IBM Research. Is artificial intelligence (AI) as accurate as a radiologist? Can AI pass the Turing test (a method of inquiry in AI, used to ascertain whether a computer can think like a human) in chest x-ray interpretation?
The idea was to solve a problem at full-scale using a single modality – from the image of chest x-ray to text – at a level that it cannot be figured out as to who wrote them. Why were chest x-rays chosen? Out of the two billion x-rays taken in the United States, 60 percent are chest x-rays. The findings had not been completely catalogued before. Benchmarks and evaluation mechanisms needed to be set. AI would need to capture fine-grained descriptions [that is, distinguish subordinate categories within entry-level categories] in natural language. Here, the challenge was to prove that artificial intelligence was at the level of an entry-level radiologist.
Chest x-rays are one of the hardest to interpret in medical imaging. This is because: (a) there are technical assessment issues; (b) the opacities (grey-coloured areas) could be due to many reasons such as pneumonia; (c) the same finding could have multiple origins; (d) the tubes/lines present could be, for example, a Swan-Ganz catheter or nasogastric tube; there may be placement issues, artifacts and ambiguities. These pose hard data science problems.
Radiologists were asked how they interpreted chest x-rays. First, a technical assessment is done. The viewpoint and position (frontal or lateral view; anterior or posterior view) are ascertained. Are there any tubes or lines? Are they properly positioned? Are there any devices or artifacts? Are there any diseases or anatomical abnormalities? From these observations, a report is generated.
All the possible findings in chest x-rays were catalogued. This is the largest assembly of chest x-ray findings – 237 discrete findings validated from text books, radiology education and other resources. An IBM DLA (discovery library adapter) tool-accelerated bottom-up vocabulary curation was done.
Instead of separate models for specialised findings, a single fine-grained deep learning model was built for all the findings. The question was “Is a given chest x-ray normal or abnormal?” The approach used for report generation was to combine deep learning with document retrieval methods. For example: if the coarse finding was ‘elaborated hemidiaphragm’, the fine-grained finding would be ‘elevated hemidiaphragm mild’ and the report would say ‘The right hemidiaphragm is mildly elevated.’ This automatic report generation was tested against a ground truth dataset of 2964 unique reports from a certain collection.
Clinical studies were performed to reach the benchmark. Studies to discriminate between normalcy and abnormality were also performed. A field pilot study was conducted at Deccan Hospital, Hyderabad, India. A performance comparison was done between radiologists and artificial intelligence. In some cases, it was found that the machine performed better, and in other cases, artificial intelligence. There were differences between the radiologists themselves. Overall, there was no statistically-significant difference in sensitivity between artificial intelligence and an entry-level radiologist!