Subscribe to Outbreak, a daily roundup of stories on the coronavirus pandemic and its impact on global business, delivered free to your inbox.
Artificial intelligence is better at analyzing medical images for illnesses like pneumonia and skin cancer than doctors are, according to a number of academic papers. But that conclusion is being called into question by recent research.
A paper published in March in medical journal The BMJ found that many of those studies exaggerated their conclusions, making A.I. technologies seem more effective than they were in reality. The finding is significant because it undermines a huge ongoing shift in the health care industry, which is looking to use technology to more quickly diagnose ailments.
It also calls into question a tech industry that is scrambling to develop and sell A.I. technology for analyzing medical imagery. The paper’s authors are worried that overzealous companies and their investors may push to sell the technology before it has been thoroughly vetted.
“With no disrespect to venture capitalists—obviously they’re an important part of the funding process for a lot of this innovation—but obviously their enthusiasm is always to try and get things to market as quickly as possible,” says Myura Nagendran, a coauthor of the BMJ paper. “While we share that enthusiasm, we’re also acutely aware of how important it is to make sure these things are safe and work effectively if we institute them en masse.”
The finding also touches on the current coronavirus pandemic, which has claimed over 30,000 lives in the U.S. Some researchers maintain that they’ve developed A.I. systems that are faster than humans at examining chest CT scans for COVID-19 infections.
The recent BMJ review looked at nearly 100 studies of a type of artificial intelligence called deep learning that had been used on medical scans of various disorders including macular degeneration, tuberculosis, and several types of cancers.
The review found that 77 studies that lacked randomized testing included specific comments in their abstracts, or summaries, comparing their A.I. system’s performance to that of human doctors. Of those, 23 said that their A.I. was “superior” to clinical physicians at diagnosing certain illnesses.
One of the main problems with these papers is that there’s “an artificial, contrived nature of a lot of these studies” in which researchers basically claim that their technology “outperformed a doctor,” says Eric Topol, one of the BMJ paper’s authors and the founder and director of the nonprofit Scripps Research Translational Institute. It’s absurd to compare an A.I.’s performance to that of human doctors, he explains, because in the real world, choosing between an A.I. system or a human doctor is not an either-or situation. Doctors will always review the findings.
“There’s this kind of nutty inclination to pit machines versus doctors, and that’s really a consistent flaw because it’s not only going to be machines that do readings of medical images,” Topol says. “You’re still going to have oversight if there’s anything reported that’s life-threatening or serious.”
Topol adds, “The point I’m just getting at, is if you look at all these papers, the vast majority—90%—do the man-versus-machine comparison, and it really isn’t necessary to do that.”
Nagendran, an academic clinical fellow for the U.K.’s National Institute for Health Research, says that studies describing A.I.’s superiority to human doctors can mislead people.
“There’s been a lot of hype out there, and that can very quickly translate through the media into stories that patients hear, saying things like, ‘It’s just around the corner, the A.I. will be seeing you rather than your doctor,’” says Nagendran.
Besides the core fallacy of pitting A.I. versus humans, one of the big problems is that these papers typically fail to follow more robust reporting standards that health care professionals have been trying to make standard over the past decade, Nagendran says. One sore point, for instance, is that the papers generally fail to measure the accuracy of their deep-learning models on multiple data sets, which could include different populations of people, as opposed to just a limited number.
Luke Oakden-Rayner, a director of medical imaging research at the Royal Adelaide Hospital in Australia, noticed a similar problem when he examined a handful of recently published papers on using deep learning to diagnose COVID-19 via chest CT scans. Like the faulty medical imaging studies that the BMJ paper described, the coronavirus-related papers based their conclusions on a limited amount of data that was not representative of the entire population, a problem that’s known as selection bias.
In one paper Oakden-Rayner noted that the researchers developed a deep-learning system to recognize the coronavirus from data taken from 1,014 patients at Tongji University in Shanghai. These patients were diagnosed as having COVID-19 via the conventional swab tests used to detect the illness; they also had chest CT scans to see if there was any of the infection in their lungs.
But that deep-learning system was likely trained on skewed data. Doctors probably suspected those patients were having lung problems related to COVID-19, which is why they ordered CT scans of the patients’ chests. The same technology would be unlikely to work well with people who have COVID-19, but don’t have any symptoms in their lungs.
“As a general rule more accurate and complete data sets are more useful,” Oakden-Rayner says in an email to Fortune.
Oakden-Rayner questioned the need for A.I. researchers to even publish papers about using deep learning to diagnose the coronavirus, explaining that current testing is already effective and that there are more important jobs that A.I. can help with.
“Simply detecting COVID-19 on CT scans is unlikely to be very helpful,” Oakden-Rayner says in the email. “If there is a bottleneck that A.I. can solve in the medical workflow, then data for that task specifically will need to be collected.”
Topol agrees with Oakden-Rayner, saying, “It can be useful to have an algorithm review of a CT scan of lungs as to whether they are potentially related to COVID, but you don’t really need a CT scan.”
More conventional testing tools are increasingly being distributed worldwide, making them more available than CT scans, which are more expensive, Topol explains.
The takeaway from all these recent A.I. medical imaging studies is that people should use some skepticism in considering their findings, says Topol. These are essentially preliminary research papers that highlight potential uses of A.I. in the current health care system, but researchers still need deeper clinical trials to verify the technology’s effectiveness.
“You can’t just go right ahead to a prospective study,” Topol says regarding a more formal type of academic study that typically follows preliminary research. “You just don’t want to overstate the conclusions.”
More must-read tech coverage from Fortune:
—How the coronavirus stimulus package would change gig worker benefits
—Zoom meetings keep getting hacked. How to prevent “Zoom bombing”
—Why China’s tech-based fight against the coronavirus may be unpalatable in the U.S.
—Hospitals are running low on the most critical supply of all: oxygen
—Listen to Leadership Next, a Fortune podcast examining the evolving role of CEOs
—WATCH: Best earbuds in 2020: Apple AirPods Pro vs. the Sony WF-1000XM3
Catch up with Data Sheet, Fortune’s daily digest on the business of tech.