Chloroplasts Predicted Operon Data (CpPOD)
Introduction
​
CpPOD is a database of predicted primary transcriptomes of chloroplasts based on a random-forest classifier. The prediction pipeline follows the procedures described in Shahar et al., 2019, Nucleic Acids Research.
​
IMPORTANT. Due to the nature of primary polycistronic processing in chloroplasts and the presence of multiple promoters, the database reflects the primary plastid transcriptome (i.e. primary and dominant polycistrons). Moreover, it has the best predictive power for standard growing conditions.
Files composition
​
The database is comprised of 2,018 plastomes that are distinguished into three main groups: green plastids, which contains most available plastomes of green algae, higher-plants and euglenoids; red plastids, which incorporates the Rhodophyta division and other secondary-endosymbiosis plastids (e.g. Heterokonts); and glaucophytes that currently contains Cyanophora paradoxa only.
​
There are two main excel files; ‘pair’ and ‘operon’:
-
Pair – preliminary predictions of gene-pairs. Each pair of adjacent genes is predicted to be co-transcribed (di-cistron; label ‘1’) or transcribed separately (non di-cistron; label ‘0’). A probability score between zero to one is calculated for each gene-pair - an index which reflects the likelihood for co-transcription (i.e. 0.9 is a predicted 90% chance of co-transcription).
-
Operon – predicted primary transcripts. Each plastome’s ‘pair table’ is concatenated into an ‘operon table’. If subsequent gene-pairs are predicted to be co-transcribed (and they are found on the same DNA strand), they will be concatenated to a primary polycistron as follows:
Organism search
​​
Each excel file contains 2,018 plastomes, whereas each one is represented in a unique tab/sheet. A chosen plastome can be found by its scientific name as follows:​
-
Within the excel press F5. A ‘Go To’ window will pop up.
-
Below the ‘Reference’ type the desirable organism’s name as: ‘Genus species’!A1
-
for example, searching for Chlamydomonas reinhardtii: ‘Chlamydomonas reinhardtii’!A1
-
If the organism is found in the database, its tab will open.
Intergenic spacer sequences
​
Additionally, it is possible to easily obtain the sequences found between each pair of adjacent genes.
Each plastome is represented by a FASTA file, organized as follows:
>gene1_gene2/locus_tag1~locus_tag2/gene-type1_gene-type2/predicted_label/probability_score
--SEQUENCE--
​
For example:
>petG_rpoBa/AT029_gp003~AT029_gp006/CDS_CDS/0.0/0.377
TTGTTAAACTTTTAAAGA...
​
In cases where the distance between two adjacent genes is <= 0 (e.g. two overlapping genes), no intergenic sequence will be available.
​
To retrieve the FASTA files, refer to: https://github.com/noamshahar/Plastids-intergenic-spacers
NOTE: Due to the large number of FASTA files, GitHub truncates only the list of files shown to 1000 files. When pressing the buttons 'Clone or download' -> 'Download ZIP', all fasta files will be compressed into the downloaded ZIP file.
​