Can AI screen titles/abstracts in a systematic review?

My experience with Rayyan and implications for reproducibility and transparency

September 16, 2025

As part of my PhD, I am in the pilot phase of a systematic review, currently experimenting with different approaches to screening. Given that I use large language models (LLMs) like ChatGPT every day, I wondered if AI could help me do it faster. You could imagine a workflow where LLMs screen the titles and abstracts of thousands or tens of thousands of papers, and indicate which ones are relevant and worth reading full-text. Could AI realistically automate title/abstract screening?

In this series of blog posts, I’ll document my attempts to answer the question. But before that, I want to highlight that it’s not just about automation.

The impact on transparency and consistency

Most systematic reviews state that “two independent reviewers screened the records, with a third reviewer in case of conflicts”. However, to my knowledge, very little detail is shared about how the thousands of abstracts were actually handled. What was the thinking for excluding a paper? Which papers did the reviewers disagree on? To me, the process suffers from a lack of transparency.

Additionally, can a human reviewer truly apply the exact same criteria when screening thousands of abstracts? Humans are inevitably prone to subtle biases and inconsistencies that are difficult to detect. In contrast, an AI workflow could potentially offer greater consistency when handling such large volumes of work.

Testing Rayyan’s AI

With this in mind, I set up Rayyan, as it is more affordable than Covidence or Distiller. The first thing that I noticed is that Rayyan needed me to manually screen at least 50 titles/abstracts before its AI can start suggesting classifications. This made me think that Rayyan is not using LLMs, but maybe an earlier generation of machine learning (ML) models, which typically needed a few training examples.

Once trained, Rayyan's AI rates papers into five categories: most likely to include, likely to include, no recommendation, likely to exclude, and most likely to exclude. Some of these can be wrong. The reviewer is welcome to adjust them. The AI can run again as more manual decisions are added.

The papers rated as most likely to exclude were relatively few compared to the total. Many classifications were wrong, especially on papers whose abstract was missing. I was left feeling that this does not speed up the screening meaningfully.

But most importantly, being used to LLMs, I had thought that, like ChatGPT, it’d be able to share its thinking process. Instead, the tool does not share how its decisions are made. I will explain right now why I think this is important.

Toward greater reproducibility

My experience with AI has made me think of it more or less as an automated logical person. I can give it tasks, and it will give me back answers, while also explaining its reasoning.

In the context of title/abstract screening, I should be able to give it a precise list of inclusion/exclusion criteria and it should apply them consistently on each paper, while also sharing its thought process.

This means that the prompt itself along with the rationale for each decision could be reported in the methods section and the supplementary material of a systematic review, increasing transparency. A systematic review could include:

the instructions to the AI (how the inclusion/exclusion criteria were translated into a very precise and logical prompt)
the decision of the AI for each paper
the AI's thought process in making this decision.

With this setup, if other researchers used the same prompt, the same AI tool and the same papers database, they would get the same exclusion results. Improving consistency and transparency enables more reproducibility.

Of course, this could be happening already without AI, if researchers shared their screening decisions for each paper. However, the huge volume of papers practically prohibits it.

Beyond Rayyan: the potential of LLMs

Could applying GPT-5 help tackle some of these questions? This is what I have begun to explore by building a prototype using GPT-5, which I will share in my next post.

Notes:

These are just my first impressions and experiments with AI-assisted screening tools. I haven’t done a comprehensive literature check — this post is more about sharing early observations and sparking discussion.

Ioanna Constantinidi

Ioanna Constantinidi

Can AI screen titles/abstracts in a systematic review?