China Focus: Researchers use AI algorithm to reveal hidden RNA viruses
Xinhua | Updated: 2024-10-23 17:28
BEIJING -- This year's Nobel Prize results signify that artificial intelligence (AI) technology is not only leading trends in computer science, but also has a growing impact in disciplines such as biology and chemistry. It offers scientists a new research approach: using AI to unlock the secrets of nature.
One of the latest examples comes from virology. An international research team used AI technology to discover hundreds of thousands of RNA viruses from global ecosystems, showing the immense potential of AI algorithms in virus discovery and paving new paths for virology.
A team of researchers from Sun Yat-sen University School of Medicine, as well as Zhejiang University, Guangzhou University, the University of Sydney and other institutions carried out the study, reporting the discovery of 180 RNA virus supergroups and over 160,000 global RNA virus species.
The study, which was published recently in the journal Cell, is the largest RNA virus study to date, significantly expanding the knowledge about global RNA viruses.
NEW AI ALGORITHM
Viruses are an essential component of Earth's ecosystems and closely related to human health. However, the number of known virus species is still quite limited. Scientists can use gene sequencing technology to compare the similarity of unknown viruses with known viral nucleic acid sequences, thereby identifying new viruses.
However, this method relies on the existing knowledge of viruses. When studying RNA viruses, which are highly divergent, numerous and prone to mutation, the method of sequence homology comparison couldn't work well anymore.
The researchers have proposed a new solution using AI technology. According to Shi Mang from Sun Yat-sen University School of Medicine, who is also one of the corresponding authors of the research paper, the AI algorithm models can uncover viruses that were previously overlooked or not even known.
"During epidemics, the speed and accuracy of AI technology can help scientists quickly pinpoint potential pathogens," Shi said.
He led the team to use a core algorithm dubbed LucaProt, a deep-learning Transformer model for the study. After extensive learning of viral and non-viral genomic sequences, it can autonomously form a set of criteria for virus identification to find viral sequences from large RNA sequencing datasets.
NEW RNA VIRUS SPECIES
According to the study, LucaProt demonstrated high accuracy and specificity, with a false positive rate of 0.014 percent and a false negative rate of 1.72 percent.
The team conducted viral search on 10,487 RNA sequencing data from global biological environmental samples, and discovered over 510,000 viral genomes representing more than 160,000 potential viral species and 180 supergroups of RNA viruses.
Among them, 23 supergroups could not be identified by traditional sequence homology methods. They can be referred to as the "dark matter" of the viral community.
The study found that these viruses are distributed across various ecological environments on Earth. The highest viral diversity is found in leaf litter, wetlands, freshwater, and wastewater environments. Considerable virus diversity and abundance are also found in extreme environments such as antarctic sediments, deep-sea hydrothermal vents, activated sludge, and saline-alkali wastelands.
According to Hou Xin, the first author of the paper, these viruses include not only pathogens that infect humans but also those that exist in the environment and infect various organisms. They can infect a variety of animals, plants, single-celled protists, fungi and bacteria.
"A deeper understanding of viruses in the environment can help us better study the workings of the entire ecosystem. Moreover, we can use this method to discover viruses closely related to human diseases for the surveillance and early warning of emerging diseases," Hou said.
"The traditional classification system has become inadequate for the new viruses, whose diversity far exceeds human imagination. What we see now is just the tip of the iceberg," Shi said.
NEW TOOL FOR MORE STUDIES
It is a model specifically designed for discovering RNA viruses, but it also integrates the ability to recognize protein sequences and implicit structural information, and can be used to identify protein functions.
According to the study, the LucaProt model helped researchers identify genomic structures beyond previous virus knowledge, revealing the flexibility of RNA virus genomic evolution.
It also revealed a variety of viral functional proteins, especially those related to bacteria, indicating that there are more types of RNA bacteriophages, the viruses that infect bacteria, to be explored.
The research team has open-sourced the model and shared it with scientists worldwide online.
Li Zhaorong from Apsara Lab of Alibaba Cloud Intelligence, another corresponding author, believed that AI is gradually changing the way scientists tackle various scientific challenges.
"This model is becoming a cutting-edge tool in virus identification and is also being applied to other types of protein identification and discovery of functions," Li said.
Xu Jianguo, an academician of the Chinese Academy of Engineering, said that the success of LucaProt marks a breakthrough for AI algorithms in virus discovery. In the future, AI is expected to become a major tool in microbiology and can be applied to predict the pathogenicity of viruses to humans.