Microbial communities found within the human gut have a strong influence on human health. Gastrointestinal diseases such as inflammatory bowel disease (IBD) are driven by intestinal bacteria and viruses [1]. Viruses infecting bacteria, known as bacteriophages, play a key role in modulating bacterial communities residing within the human gut [2,3]. However, the identification and characterization of novel bacteriophages in the gut microbiome remains a challenge.
High-throughput sequencing technologies have paved the way for metagenomics to study uncultivated microbial and viral communities. There are a variety of tools to identify viral sequences from metagenomic data [4-7]. These tools often make use of similarity between sequences, nucleotide composition, and the presence of viral genes/proteins. Most existing tools consider the individual sequences and determine whether they are of viral origin. Due to the challenging nature of viral assembly, their genomes can be fragmented [8], and per-sequence based viral identification tools may not produce optimal results [9].
Metagenomic assemblers build a structure known as the assembly graph by overlapping reads to produce longer sequences called contigs [10]. Genomes typically correspond to long paths within the assembly graph, and contigs of connected components are more likely to belong to the same genome [11]. Hence, the assembly graph retains connectivity information and neighbourhood information within fragmented assemblies. Previous studies have made use of assembly graphs in taxonomy-independent metagenomic binning [12-14], mostly to identify bacterial species. This work explores the use of assembly graphs and machine learning techniques to identify viral sequences from fragmented assemblies. Specifically, we demonstrate the identification of bacteriophage genomes within samples collected from patients with IBD.