How can AI assist in the screening process of a systematic review?

September 16, 2025

Note: These are just my first impressions and experiments with AI-assisted screening tools. I haven’t done an comprehensive literature check — this post is more about sharing early observations and sparking discussion.

Framing the questions

As part of my PhD, I am in the pilot phase of a systematic review, currently experimenting with available approaches to screening. Given that I use LLMs like ChatGPT every day, I wondered if AI could have an impact. You could imagine an application where LLMs screen the abstract and the title of thousands or tens of thousands of papers, and let you know which ones are relevant and worth reading full-text. Could AI realistically automate title/abstract screening?

In this series of blog posts, I’ll be going through the series of things I tried. But before that, I want to highlight that it’s not just about automation.

The impact on reproducibility

Most published reviews state that “two independent reviewers screened the records, with a third in case of conflicts”. However, to my knowledge, very little detail is shared about how the thousands of abstracts are actually handled. What was the thinking for excluding a paper? Can a human reviewer truly apply the exact same criteria when screening thousands of abstracts? Humans are inevitably prone to subtle biases and inconsistencies that are difficult to detect. In contrast, an AI could potentially offer greater consistency when handling such large volumes of work.

Testing Rayyan’s AI

With this in mind, I set up Rayyan, as it stood out as a more affordable option compared to Covidence and Distiller. The first I noticed is that Rayyan needs at least 50 manual screening decisions before its AI can start suggesting classifications. This suggested to me that Rayyan is not using LLMs, but maybe an earlier generation of ML models.

Once trained, it sorts references into five categories: most likely to include, likely to include, no recommendation, likely to exclude, and most likely to exclude. The AI can be re-run as more manual decisions are added.

Being used to LLMs, I had thought that like ChatGPT, it’d be able to share its thinking process. Instead, the tool does not disclose which algorithms it uses, how decisions are made, or whether it has been validated — leaving it opaque and difficult to trust.

Beyond Rayyan: the potential of LLMs

But could LLMs address some of these gaps?

Unlike pre-set systems, an LLM like ChatGPT could be instructed with tailored prompts, making the process reproducible and customizable. The prompt itself could be reported in the methods section, increasing transparency. An LLM could even provide its rationale for each decision, rather than only a binary include/exclude label. In theory, this would allow reviewers not only to check AI output but also to compare it with their own reasoning.

The critical question, of course, is reliability: could such an approach be robust enough to act as an additional screener, or even replace a human reviewer? This is what I have begun to explore by building a prototype using ChatGPT-5, which I will share in my next post.

Toward greater transparency

Finally, I think there is room for broader improvement. Tools like Rayyan already record individual screening decisions — but what if this information were shared at the time of publication? Making the screening logs accessible would provide proof that dual reviewing was actually carried out and could add substantial value for reproducibility and quality assurance in systematic reviews.

Ioanna Constantinidi

Ioanna Constantinidi

How can AI assist in the screening process of a systematic review?