TRIPBASE
Long-non-coding RNAs (lncRNAs) are defined as RNA sequences which are >200 nt with no coding capacity. These lncRNAs participate in various biological mechanisms and are widely abundant in a diversity of species. There is well-documented evidence that lncRNAs can interact with genomic DNAs by forming triple helices (triplexes). Previously, several computational methods have been designed based on the Hoogsteen base-pair rule to find theoretical RNA–DNA:DNA triplexes. However, these methods suffer from a high false-positive rate between predicted triplexes and biological experiments. In this paper, we present TRIPBASE, a new database built as the first comprehensive collection of genome-wide triplex predictions of human lncRNAs.
A Database for Identifying the Human Genomic DNA and lncRNA Triplexes
Long-non-coding RNAs (lncRNAs) are a type of non-coding RNA greater than 200 nt in length. Advances in sequencing technology have revealed that lncRNAs are widely transcribed in species ranging from invertebrates to humans. There is well-documented evidence that lncRNAs can interact with double-stranded DNA (dsDNA) by forming triple helices (triplexes) to mediate gene expression.
Chromatin isolation by RNA purification (ChIRP) and other experimental methods like RADICL-seq have been developed to investigate chromatin-RNA associations, though they are not always suitable for studying direct RNA-DNA interactions. ASO-based sequencing has provided a genome-wide approach to specifically detect lncRNA-DNA:DNA triplexes without cross-link interference.
Methodology
Dataset
The human whole genome was downloaded from the Ensembl database (GRCh38, release 99). Only the longest isoforms of lncRNAs were selected from GENCODE (release 35). Enhancer regions were collected from EnhancerAtlas 2.0, including regions from 197 human cell and tissue types. A total of 7,325,744 enhancer regions were used.
Computational Approach
We used Triplexator, the most widely used tool for lncRNA-DNA interaction, with parameters set as follows:
- Triplex length: 15
- Maximal error rate: 20
- Consecutive error number: 2
For a given lncRNA, the nucleotide sequence that can form hydrogen bonds with the DNA duplex is called a “triplex-forming oligonucleotide” (TFO), while the triplex on DNA is called a “triplex-target site” (TTS).
Results and Discussion
The ASO-capture data of NEAT1-associated DNA were used as ground truth to evaluate the TMRs (TTS-Merged Regions) of the lncRNA NEAT1.
Filters for Selecting Potential lncRNA-dsDNA Triplexes
The original dataset contains 7,601,780 raw TMRs, of which 18,986 TMRs intersected with assay peaks (positive group), and 7,582,794 TMRs were located outside of peak regions (negative group). The following criteria were used to filter TMRs:
Attribute | Description |
---|---|
Width | Number of base pairs in a TMR |
Area | Sum of the hitting numbers of a TMR |
Maximal Height | Highest hitting number in the TMR sequence |
Average Height | Average hitting number across the TMR sequence |
n-edge | Number of triplex formations in a TFO-lncRNA with DNA |
Triplexator Filtering Results
We found that filtering the data using width ≥ 15, area ≥ 30, and maximal height ≥ 2 reached the highest true positive rate (TPR) of 85%, with a false positive rate (FPR) of 54%.
How to cite TRIPBASE
TRIPBASE allows a user to search for TMRs in cis-regulatory regions, including promoters, enhancer regions, and segments of chromosomes.
@article{Lin2023,
author = {Lin, Tzu-Chieh and Liu, Yen-Ling and Liu, Yu-Ting and Liu, Wan-Hsin and Liu, Zong-Yan and Chang, Kai-Li and Chang, Chin-Yao and Ni, Hung Chih and Huang, Jia-Hsin and Tsai, Huai-Kuang},
title = {TRIPBASE: A database for identifying the human genomic DNA and lncRNA triplexes},
journal = {NAR Genomics and Bioinformatics},
volume = {5},
number = {2},
year = {2023},
doi = {10.1093/nargab/lqad043},
URL = {https://doi.org/10.1093/nargab/lqad043},
eprint = {https://academic.oup.com/nargab/article/5/2/lqad043/7175332}
}