Dilation may be an important step to reduce the rate of ungradable images
Aaron Y. Lee, MD, MSCI
This article was reviewed by Aaron Y. Lee, MD, MSCI
Artificial intelligence (AI) screening systems of diabetic retinopathy (DR) are not all created equal, and as a result, their performances differ.
Researchers compared several AI screening systems using the US Department of Veterans Affairs (VA) screening program. The results highlight the need for independent validation studies before clinical use, according to Aaron Y. Lee, MD, MSCI, from the University of Washington in Seattle.
Related: Artificial intelligence: The good, the bad, and the scary
The study
Initially, 23 companies were invited to participated in the masked study. Several agreed to participate and contributed AI models: ADCIS (Advanced Concepts in Imaging Software), Airdoc, Eyenuk, Retina-AI Health, and Retmarker.
The investigators extracted data from the VA teleretinal screening program for VA Puget Sound in Seattle and the Atlanta VA Health Care System in Georgia, including the images and the original VA teleretinal grades.
This created a data set of 311,604 images from 23,724 patients with diabetes for the full data set and a subset of about 7000 images that was set aside for arbitration. All the patients had type 2 diabetes and no previous diagnoses of the disease.
The 7 algorithms were run, and the output indicated whether the patients should be referred. The performances were then compared with the VA teleretinal grades.
In the subset used for arbitration, 2 ophthalmologists graded the images independently and a retinal specialist performed masked arbitration, according to Lee.
Related: Teaching AI algorithms to identify corneal pathology: The future is now
A few baseline differences were seen between the Atlanta and Seattle groups. Investigators found a 10-fold difference in the proliferative DR (PDR) rate between the 2 locations.
In Atlanta, dilatation is a routine practice, but, Lee noted, that is not the case in Seattle. This resulted in a large discrepancy in the rate of ungradable images between the 2 sites.
The algorithm output, he explained, was set for no DR versus the presence of any degree of DR because of the VA’s practice pattern.
Results
Analysis of the full data sets from both sites showed an overall high negative predictive value and a low positive predictive value.
In the arbitrated data set, the VA teleretinal grader was compared directly with the AI models: Algorithm A had a significantly lower sensitivity but higher specificity, B had lower sensitivity and specificity, C and D had the same sensitivity but lower specificity, E and F had significantly higher sensitivity and lower specificity, and G could not be differentiated from the VA grader.
Lee explained what he considers to be the most important finding: the performance of the sensitivity of the various algorithms with different severities of retinopathy.
Related: Blockchain technology aims to drive big data to 'train' AI
“The VA grader demonstrated 100% sensitivity for moderate and severe non-PDR [NPDR] and PDR. Algorithms E, F, and G were statistically similar to the VA grader for moderate NPDR or higher, and these algorithms were carried forward for future analysis,” he said.
The investigators simulated a 2-stage screening system to measure the amount of labor savings in a cost analysis if the 3 algorithms were implemented within the VA. The cost of an ophthalmologist reading the images would be about $15 per encounter.
A limitation of this study is that some analyses, including cost, are applicable to the VA setting.
According to Lee, the investigators found that dilation may be important to reduce the rate of ungradable images.
“The algorithms varied tremendously in performance despite having regulatory approval and/or having been clinically deployed somewhere,” he concluded. “It is important to understand that the AI models in the context of the underlying disease prevalence in order to understand the negative and positive predictive values. We believe that external, independent validation with real-world imaging is crucial before deployment, even after algorithms receive regulatory."
approval.”
Read more by Lynda Charters
---
Aaron Y. Lee, MD, MSCI
e:leeay@uw.edu
Lee is a consultant to Genentech, Verana Health, and Topcon.