🧠 Can AI Improve Scientific Reviews?

PLUS: How a Research Team Almost Made a Perfect Scientific LLM

Welcome back AI prodigies!

In today’s Sunday Special:

  • 🔬Superb Science, Restricted Reviewers

  • 🫐Meta’s Failure

  • 🧪A Refined Approach

  • 🔑Key Takeaway

Read Time: 8 minutes

🎓Key Terms

  • Natural Language Processing (NLP): the ability of computer programs to understand, interpret, and generate human language.

  • Large Language Models (LLMs): AI systems pre-trained on vast amounts of data to generate human-like text.

  • Hallucinations: AI confidently generates inaccurate, misleading, or incorrect information.

  • Retrieval-Augmented Generation (RAG): a framework designed to make AI models more reliable and accurate by pulling relevant, up-to-date data directly related to a user’s query.

  • Stochastic: involving random random variables or chance.

🩺 PUSLE CHECK

Will incorporating AI into literature reviews reduce trust in science?

Vote Below to View Live Results

Login or Subscribe to participate in polls.

🔬SUPERB SCIENCE, RESTRICTED REVIEWERS

As the number of controversial issues across the world grows, some ideas or frameworks must say above the fray of social media rants, comment section battles, and rapper feuds.

We hope the scientific method is one of them. Like many things in modern society, the scientific method is on a collision course with AI advancements, a technology some label as the “misinformation maker.” We previously highlighted four benefits and three risks of incorporating AI systems into the academic research process. Let’s recall the exact steps of the academic research process before reiterating its Pros and Cons.

First, researchers observe a phenomenon and formulate a null hypothesis describing the status quo and an alternate hypothesis about their observation. Second, they design and conduct experiments, gather data, and test this null hypothesis, ensuring rigor and anticipating potential biases. Third, they analyze the results to determine whether they support or refute the null hypothesis. Researchers compile the findings into a detailed report if they are significant and reproducible. Finally, they submit this report to a peer-reviewed journal, where experts in the field, known as editors, evaluate its validity and significance. Upon approval, the research is published.

We believe generative AI (GenAI) shows immense potential as a research assistant in the experimental design phase. AI-powered academic search engines like Consensus can assist humans in retrieving relevant past academic research. Once human labelers teach GenAI the “truth,” industry-grade frameworks like SegmentsAI can label large datasets in seconds. However, these are narrow tasks where efficiency benefits may trump error risks. However, this tradeoff is less clear during the literature review process. It requires subject matter experts to wrestle with nuances in technical ecosystems, the epitome of critical thinking. A literature review is far from perfect, but GenAI may be able to help.

Peer reviewers aren’t paid, and reviewing manuscripts takes several months, so many researchers aren’t interested in dedicating themselves to unpaid labor to ensure scientific research quality. According to the International Association of Scientific, Technical, and Medical Publishers (STM), fewer and fewer researchers are available to conduct literature reviews as the number of articles submitted to journals is continuously growing. Literature review over the years has proven to have three problems:

  1. Lack of Rigor and Clarity: Reviewers may not follow the Committee on Publication Ethics (COPE) guidelines, as its best practices are somewhat subjective and entirely voluntary.

  2. Research Affiliation Is Overvalued: Though there is no conclusive evidence that researcher affiliation affects approval chances, one study found evidence of bias towards faculty from a journal’s home institution.

  3. Potential Gender Bias: Over 60% of reviewers are men. Based on scientifically verifiable gender differences (e.g., personality), this may produce blindspots, particularly in the hard sciences.

🫐META’S FAILURE

For these reasons, various AI experts wonder whether part of the review process could be automated or accelerated by Natural Language Processing (NLP). As one might expect, prior attempts to automate the review process have fallen short. For example, the Bidirectional Encoder Representations from Transformers (BERT) framework was designed to understand the context of words in a sentence. Unfortunately, BERT couldn’t process images, charts, or graphs and lacked reasoning capabilities.

Several AI experts have deemed Large Language Models (LLMs) to be the solution to these problems because they excel at learning to make accurate predictions by training on a very small number of labeled examples. However, LLMs fail to continue learning after being trained on vast amounts of pre-trained data and suffer from hallucinations. For instance, Meta’s Galactica was an LLM trained in the scientific domain to generate automatic reviews. Still, it was quickly discredited due to hallucinations, as outlined by the MIT Technology Review:

“A fundamental problem with Meta’s Galactica is that it cannot distinguish truth from falsehood, a basic requirement for a language model designed to generate scientific text. People found that it made up fake papers and generated wiki articles about the history of bears in space as readily as about the speed of light.”

-MIT Technology Review/“Why Meta’s Galactica survived only three days online.”

Though many attempts have proven unsuccessful, there remains particular interest in automatic review generation. Theoretically, this would reduce the backlog of articles to be analyzed, save time for article submitters, standardize the review process, and reduce bias. AI models must be specialists trained in specific niches (e.g., physics of light in solar environments) with Retrieval-Augmented Generation (RAG) techniques to generate automatic reviews valuable to human reviewers.

🧪A REFINED APPROACH

To some, the hubris of Meta’s Galactica led to its failure. Meta developers underestimated the heterogeneity of the scientific domain. To build on Meta’s attempt, Chinese scientists tried to use an LLM for automatic literature review generation while limiting the LLM to one narrow task: Propane Dehydrogenation (PDH) Catalysts. The Chinese scientists started by finding all articles in a specific field from a curated list of chemical engineering journals. They filtered out obvious duplicates and then analyzed them with an LLM to select all articles relevant to the project, explaining:

“To address the challenge of hallucinations in LLMs, a high priority has been placed on the detection and prevention of such phenomena. We adopted a multi-level filtering and verification quality control strategy in the automated review generation process, similar to Retrieval-Augmented Generation (RAG).”

-School of Chemical Engineering and Technology Tianjin University/“Automated Review Generation Method Based on Large Language Models (LLMs)”

Since hallucinations are the fundamental limitation of using LLMs in scientific publications, the Chinese scientists were meticulous about their methodology. They decomposed the review process into several subtasks (e.g., “Reading and Summarization”) to maintain factual consistency and flexibility. Then, they established a list of questions to help the AI model extract relevant content while answering questions based on the context. Moreover, they followed up with several additional steps:

  1. Standardized Formatting: Some hallucinations originated from text structure disruptions, so they verified that the text structure (e.g., XML) was accurate.

  2. Traceability: Installing a complete data stream traceability mechanism ensures that outputs can be traced back to their stages in the text-generation process.

  3. Verification of DOI, Relevance, and Self-Consistency: The DOI is a unique identifier for each research article, allowing them to filter out some made-up content.

After a rigorous refinement process, the Chinese scientists checked for hallucinations with a focus on two types of inaccuracies:

  1. False Positives: Fabricated or inconsistent information.

  2. False Negatives: Overlooked or partially extracted content.

They focused on reducing false positives while adopting a tolerant amount of false negatives. This process dramatically reduced hallucinations. As a result, the AI model achieved a 93% accuracy rate, and the Chinese scientists were 95% confident that the likelihood of hallucinations in the “Knowledge Extraction Phase” was less than 0.5%.

🔑KEY TAKEAWAY

AI can potentially revolutionize scientific research, particularly in tasks like literature review and data analysis. However, significant challenges, primarily related to hallucinations and bias, must be overcome.

The successful integration of AI into the scientific process will require a combination of human expertise and advanced AI techniques.

If produced with an NLP expert and a manual verification of cited sources, LLM-based reviews may serve as a template or spark ideation for time-starved reviewers.

📒FINAL NOTE

If you found this useful, follow us on Twitter or provide honest feedback below. It helps us improve our content.

How was today’s newsletter?

❤️TAIP Review of the Week

“The AI Pulse always supplies valuable content.👍”

-Logan (⭐️⭐️⭐️⭐️⭐️Nailed it!)
REFER & EARN

🎉Your Friends Learn, You Earn!

You currently have 0 referrals, only 1 away from receiving ⚙️Ultimate Prompt Engineering Guide.

Refer 5 friends to enter 🎰October’s $200 Gift Card Giveaway.

Reply

or to participate.