Motif-based algorithms for detection of functional classes of long non-coding RNAs
Final Report Abstract
The function of non-coding RNA sequences is largely determined by their spatial conformation. This is the secondary structure of the molecule, which is formed by Watson–Crick interactions between nucleotides. Hence, modern RNA alignment algorithms routinely take structural information into account. Essential tasks for discovering yet unknown RNA families and inferring their possible functions are the structural alignment of RNAs and the subsequent search of the derived structural motifs. In WP 1 – WP 4, two SeqAn-based software tools have been developed that implement algorithms for finding sequence-structure motifs in genomic sequences. In contrast to other programs the tools can handle arbitrary pseudoknots. They use multithreading for parallel execution and are implemented in modern C++ code for maximal longevity and performance. The first tool is called LaRA 2, and is significantly faster than comparable software for accurate pairwise and multiple alignments of structured RNA sequences. It uses a new heuristic for computing a lower boundary to the solution and employs vectorization techniques for speeding up the time-critical parts of the algorithm, such that it is up to 130× faster than its predecessor LaRA 1. For computing multiple structural alignments, we provide two methods that use MAFFT or T-Coffee for progressively combining pairwise alignments. The second tool, MaRs, can be applied in a workflow right after LaRA 2. In a first step it derives sequence-structure motifs from the structural alignments. The motifs are sophisticated descriptors that we have designed to store the relevant information in the form of stem loops. The stem loop descriptors usually characterize an RNA family, and they are utilized to find homologs in a genome sequence. These are the positions of the genome where further members of the same RNA family are encoded. In a second step, MaRs employs an optimized multithreaded search strategy for finding the stem loop matches really fast. For the search we employ a bi-directional index data structure, which allows performing searches in sub-linear time and extending the search pattern in both directions, as it is the nature of stem loops. The use of a thread pool, effective pruning strategies, and a low memory footprint ensure that MaRs is capable of processing extremely large data sets. Similarly to proteins, RNA function is mainly dictated by its structure. In particular, lncRNAs perform their function via binding to RNA Binding Proteins (RBPs), or by interactions with other nucleic acids. We therefore considered that analyzing binding patterns of RBPs directly, from in vivo data, together with other genomic features, might be more meaningful when trying to predict lncRNA function. During the duration of this project we focused our efforts in improving machine learning methods for prediction of protein-RNA interactions leveraging large amounts of RNA-bound sequences from publicily available CLIP-seq data. In addition, prompted by the increasing number of studies reporting subcellular localization and dynamics of lncRNAs as primary determinants of lncRNA function, we investigated three main questions. a) To which extent does lncRNA localization reflect their molecular function? We bring together data from recent studies, as well as our own work indicating that nuclear lncRNAs produced from active enhancers and their localization dynamics (i.e. dissociation flows from chromatin) constitute an additional layer of enhnacer-targeted gene expression. b) What is the role played by chromatin interactions in the context of lncRNA-mediated gene regulation? To better understand this we have developed a novel multi-step graph modeling approach to examine the chromatin interaction network involving lncRNAs, genes and other genomic regions in the K562 human cell line. Our approach implements Markov State Models (MSM) clustering to detect regulatory modules based on network properties and co-expression analysis between genes and gene-lncRNA to annotate novel lncRNA functions. c) What molecular factors dictate lncRNA subcellular localization and ultimately their functional classification? To answer this question we developed a synergic approach combining novel experimental data, generated in our lab to measure lncRNA chromatin dissociation dynamics in MCF7, with machine learning models, trained on several high-throughput genomic data, to predict localization. We could link localization dynamics of lncRNAs to their regulatory functions.
Publications
-
(2019). Functional impacts of non-coding rna processing on enhancer activity and target gene expression. J Mol Cell Biol, 10(11):868–879
Ntini, E. and Marsico, A.
-
(2020). Predictive modeling of long non-coding rna chromatin (dis-)association. Biorxiv
Ntini, E., Budach, S., Vang Orom, U., and Marsico, A.
-
LaRA 2: parallel and vectorized program for sequence–structure alignment of RNA sequences. BMC Bioinformatics, 23(1)
Winkler, J., Urgese, G., Ficarra, E., and Reinert, K.