The purpose of this project is to consider the predictors of longevity in organisms across the tree of life. I was particularly interested in whether there is a correlation between the degree of difference in size of heterogametic sex chromosomes and organism lifespan.
The paper The sex with the reduced sex chromosome dies earlier: a comparison across the tree of life (Xirocostas et al., 2020) performs a meta-analysis that shows that the homogametic sex (meaning XX-female or ZZ-male) lives longer than the heterogametic sex. This opens up questions on the importance of chromosome morphology. I wanted to look at the data from this paper and supplement it with my own data on the difference in size between the heteromorphic sex chromosomes. The purpose being to determine whether the degree of differentiation correlates with the lifespan differences in species.
Expanding the code below will show you the packages used in this project.
# CRAN Packages
library(phangorn)
library(tidyverse)
library(janitor)
library(ape)
library(dplyr)
library(kableExtra)
library(seqinr)
library(reutils)
library(shiny)
library(modelr)
library(MASS)
library(easystats)
library(plotly)
# Bioconductor Packages
library(Biostrings)
library(DECIPHER)
library(ggtree)
library(msa)
# Github Packages
library(ggmsa)
The table below gives a glimpse into the data that was gathered. The majority of the data available was longevity data. Where we are constrained in this study is in the availability of sex specific genome size data. In my digging through literature, I only found about 25 species that have sex specific genome size and lifespan data.
class | order | family | species | sex_determination | heterogametic_lifespan | homogametic_lifespan | ln_r_rlifespan | population_source | lifespan_data_type | gs_female_mb | gs_male_mb | gs_diff_mb |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Actinopterygii | Salmoniformes | Salmonidae | Coregonus hoyi | male heterogametic | 9.00 | 11.00 | 0.2006707 | wild | max | NA | NA | NA |
Actinopterygii | Salmoniformes | Salmonidae | Coregonus kiyi | male heterogametic | 8.00 | 10.00 | 0.2231436 | wild | max | NA | NA | NA |
Amphibia | Anura | Alytidae | Discoglossus pictus | female heterogametic | 3.08 | 3.00 | -0.0263173 | wild | mean | NA | NA | NA |
Amphibia | Anura | Dicroglossidae | Fejervarya limnocharis | male heterogametic | 2.00 | 2.20 | 0.0953102 | wild | mean | NA | NA | NA |
Amphibia | Anura | Pyxicephalidae | Pyxicephalus adspersus | female heterogametic | 4.00 | 6.00 | 0.4054651 | wild | mean | NA | NA | NA |
Amphibia | Anura | Ranidae | Rana sakuraii | male heterogametic | 3.25 | 3.45 | 0.0597192 | wild | mean | NA | NA | NA |
Here, we see a lot of NA’s. These are due to the lack of data found for genome sizes. We will sift through this data, first focusing on lifespan differences which we have more information for. Then we will do a tighter filter, exploring the genome size differences.
Lets look at all the species we’ve gathered that have lifespan data for the sexes.
The figure above shows the difference between the lifespan of homogametic and heterogametic sexes of certain species. The x-axis plots ln(homogametic lifespan/heterogametic lifespan) to normalize the results. If the point is greater than 0, the homogametic sex lives longer. If it is less than 0, the heterogametic sex lives longer. The colors are based on whether the species’ sex determination system is female or male heterogametic.
The following interactive version of this plot allows you to look at the data based on class, order, and family by changing the colors of the points.
The following plot is another way of illustrating how much of the difference in lifespan data resides above zero.
From our lifespan plots, we can see that the homogametic sex lives longer. However, this is not new information. We want to see if the degree of difference in the size of the sex chromosomes contributes to the difference in longevity. To do that we have to narrow our data again to the observations where genome size data was found.
Let’s now look at all the species who have both lifespan and genome size data. This is the smaller dataset of 25 species.
This plot reveals that most of the data we have is male heterogametic. There is only one female heterogametic data point. Also, the outliers make it appear as though there is a trend, but really it is hard to believe that there is actually a positive trend worth noting. From this plot, we can not conclude a correlation between the difference in size of the heterogametic sex chromosomes and the difference in lifespan between the sexes.
Lets quickly restrict the data to a subset of -250:250 Mb to see how the trendline behaves.
As expected, when the more extreme values are removed. the trendline starts to flatten out. There does not seem to be a significant relationship between difference in size of the sex chromosomes and longevity.
I want to collect more data on the longevity and differences in genome size of species so that I can further investigate a pattern. Until then, let’s look at the linear model of the data and see what predictions would be made if the data followed the model.
The only predictor being explored for this data set is the difference in genome size data. This makes a simple linear model that predicts along the equation y = 0.1276 + 0.0004x. Only 0.55% of the variance is explained by the predictor. Here are the predictions of difference in lifespan data based off the real data.
This model informs us that the p-value is 0.742. There is no significant relationship between the difference in lifespan and the difference in genome size shown by our sample.
Though I still have the goal of collecting more data, I wanted to learn a method for creating a phylogenetic tree so that I could implement it in further analyses.
Dr. Carl Hjelman at Utah Valley University worked with me on this piece of code. It’s purpose is to take a list of species and genes from the user and create a data frame of their accession numbers from GenBank. The lengths of the genes found are included in the table to help identify suspicious results.
#### Retrieving Accession Numbers from Genbank ####
#### function that takes a term and finds genbank data for it ####
grab.results <- function (term) {
# Search for the given term on nuccore. This gives us a list of
# record IDs.
ids <- esearch(term, db="nuccore")
# Grab summaries for the given record IDs, as a sort-of data frame.
sum <- esummary(ids, db="nuccore")
data <- content(sum, as="parsed")
# For some reason, this parser gives us lists of lists instead of a
# proper data frame (which should be lists of vectors). Return a
# fixed-up version.
data.frame(lapply(data, as.character), stringsAsFactors=FALSE)
}
# read in my data and only use the data that I have both lifespan and genome size data for
spec<-read.csv("lifespan_species.csv")
# put in list of genes to collect accessions for
gene.list<-c("mitochondrion","coi","coii", "coiii", "cytb", "hunchback")
# make matrix to be populated
accession_out<-matrix(,nrow=length(spec$names),ncol=(length(gene.list)*2+1))
# make list of column names for table
columns <- list()
columns <- append(columns, "Spec")
for(i in 1:length(gene.list)){
columns <- append(columns, paste(gene.list[i],"Acc"))
columns <- append(columns, "Len")
}
colnames(accession_out) <- columns
#loop to add data to table
for(i in 1:length(spec$names)){
accession_out[i,1] <- spec$names[i]
for(k in 1:6){
testing<-(grab.results(term=(paste(spec$names[i],gene.list[k]))))
if(length(testing)<1){
(k<-(k+1))}
#the below lets us know if after we filter to only DNA and remove other species that we still have data
#this lets us filter out RNA and also organisms that don't match but might appear
#like parasites, worms, etc.
else if(length((subset(testing, testing$MolType == "dna" & testing$Organism == spec$names[i] & as.numeric(testing$Slen)<21000))[1])
<1){
k<-(k+1)
}
#this pulls data if we have data, DNA, and the right species, and filters out things larger than mitochondria
else{
testagain<-(subset(testing, testing$MolType == "dna" & testing$Organism == spec$names[i] & as.numeric(testing$Slen)<21000))
accession_out[i,(k*2)]<-testagain$AccessionVersion[1]
accession_out[i,(k*2+1)]<-testagain$Slen[1]}
}
}
write.csv(accession_out,"accession/accession_numbers.csv", row.names = FALSE)
Spec | mitochondrion Acc | Len…3 | coi Acc | Len…5 | coii Acc | Len…7 | coiii Acc |
---|---|---|---|---|---|---|---|
Amblyomma cajennense | NC_020333.1 | 14780 | MF363087.1 | 489 | KY284602.1 | 488 | NA |
Aphelocoma coerulescens | NA | NA | DQ432737.1 | 652 | NA | NA | NA |
Nauphoeta cinerea | NA | NA | JN615381.1 | 1362 | MH360271.1 | 709 | NA |
Periplaneta americana | NA | NA | JX402724.1 | 642 | NC_016956.1 | 15584 | NC_016956.1 |
Callosobruchus maculatus | NC_053358.1 | 18321 | MG458971.1 | 657 | CAACVG010000153.1 | 14841 | CAACVG010002514.1 |
Chrysomya megacephala | NC_019633.1 | 15273 | JX402727.1 | 618 | JN014846.1 | 2303 | AJ426041.2 |
Cochliomyia hominivorax | NA | NA | MT020449.1 | 494 | MT876038.1 | 686 | NA |
Lucilia cuprina | NA | NA | OR574367.1 | 663 | JN014889.1 | 2304 | NA |
Musca domestica | NC_024855.1 | 16108 | JX402726.1 | 657 | NC_024855.1 | 16108 | NC_024855.1 |
Episyrphus balteatus | NC_036481.1 | 16175 | OM794674.1 | 623 | AY573151.1 | 669 | NA |
Ovis aries | OR459774.1 | 16618 | OR459774.1 | 16618 | OR459774.1 | 16618 | OR459774.1 |
Pan troglodytes | NC_001643.1 | 16554 | AY846293.1 | 634 | AY549430.1 | 330 | NA |
Symphalangus syndactylus | NC_014047.1 | 16516 | NC_014047.1 | 16516 | NC_014047.1 | 16516 | NC_014047.1 |
Drosophila mojavensis | NA | NA | EU493608.1 | 2011 | EU493738.1 | 1460 | EU493868.1 |
Drosophila mulleri | NA | NA | DQ437708.1 | 408 | DQ437711.1 | 688 | DQ437714.1 |
Drosophila montana | MK659829.1 | 15984 | MH423339.1 | 658 | MK659829.1 | 15984 | EU493880.1 |
Drosophila virilis | HQ849831.1 | 15480 | MH423345.1 | 658 | KX275252.1 | 549 | EU493881.1 |
Drosophila kikkawai | MK742876.1 | 16019 | MW122898.1 | 658 | MK659822.1 | 15655 | NA |
Drosophila erecta | MK659815.1 | 16309 | KX771108.1 | 658 | GQ244453.1 | 684 | NA |
Drosophila yakuba | NC_001322.1 | 16019 | KX771113.1 | 658 | X03240.1 | 16019 | X03240.1 |
Drosophila sechellia | AF200832.1 | 14950 | KJ426007.1 | 560 | MK659840.1 | 16689 | NA |
Drosophila simulans | AY518674.1 | 14946 | OK380938.1 | 619 | MN046104.1 | 18956 | MN046104.1 |
Drosophila melanogaster | OR555819.1 | 19116 | OK380942.1 | 574 | EU493757.1 | 1369 | EU493887.1 |
Drosophila bipectinata | MK659811.1 | 15780 | MK551134.1 | 503 | MK659811.1 | 15780 | NA |
Drosophila ananassae | CM029947.1 | 15821 | MW122904.1 | 658 | MK659805.1 | 15830 | NA |
We surprisingly got pretty complete coi data! The lengths reveal some issues where the sequence pulled from GenBank probably wasn’t the best option. I manually checked on those that were further away from 600 bp to see if they have more comparable options on GenBank. The coi gene will be what I construct the tree off of.
The next code chunk converts the accession data into a multi fasta with simple descriptor lines.
# read in your file of accession numbers to feed into script
df <- read_csv("accession/accession_numbers_fixed.csv") %>%
clean_names()
gene_accession_numbers <- df$coi_acc
# Fetching the sequences from GenBank
gene_sequences <- read.GenBank(gene_accession_numbers)
# creates a list of DNAbins
#accession number corresponding to species names/gene
gene_sequences_GenBank_IDs <- paste(attr(gene_sequences, "species"), names(gene_sequences), sep=" coi ")
#, nbcol = 6, colsep = " ", colw = 10 -> possible addition but I do not think it is needed.
# Write the sequences to a FASTA file
write.dna(gene_sequences, file = "accession/coi.fasta", format = "fasta", append = FALSE)
# Read the sequences from the FASTA file
gene_seq_format <- read.FASTA(file = "accession/coi.fasta")
# Modify the names of the sequences
names(gene_seq_format) <- gene_sequences_GenBank_IDs
# Write sequences to a new FASTA file
write.FASTA(gene_seq_format, file = "accession/coi_seq_format.fasta")
# this output a nice multi fasta, I just don't know for sure if I have all the species I'm aiming for
My future goals for making multi fastas is to create a super matrix that consists of multiple genes for all the species so that I can build larger trees.
Next, I wanted to learn how to do an alignment and a simple tree. Here is a section of the alignment I performed in R. There is lots of variation since I am using species that are not closely related.
With DECIPHER and Phangorn, I was able to make a tree with neighbor joining. I do not believe this tree is very accurate.
gimme_fast <- readDNAStringSet("accession/coi_seq_format.fasta") # read in multi fasta as a string set
alignment <- AlignSeqs(gimme_fast) # perform profile-to-profile alignment
distance <- dist.dna(as.DNAbin(alignment)) # find pairwise distances
tree <- njs(distance) # have to use this because there is studd missing from the distance matrix
ape::write.tree(tree, file='tree/tree.txt') # write tree out in newick format
This tree shows weird groupings that I do not trust. I have a lot of drosophila data, but it is weird that it is popping up in random areas.
Using software outside of R (MAFFT & BEAST), I performed an alignment and made a tree with the sequences. While using MAFFT online, I saw that some of my sequences could have been reversed which could explain some of the weird relationships seen with the neighbor joined tree. The relationships shown in the BEAST tree follow expectations more closely.
Let’s look at the phylogeny along side the data. Here are visualizations of the BEAST tree that show the data in continuous scale gradient.
This final plot shows all the data that I was focusing on side by side. While there still doesn’t seem to be a correlation between the data, this visualization helps us to ask more questions. Do organisms with smaller genomes tend to have smaller differences in sex chromosome size? Is this affecting the trends that we are seeing?
I want to collect more data to see if the lack of correlation is due to a small sample size. I also want to create more robust and accurate trees to perform phylogenetically independent contrasts that will help me to see if phylogenetic relationships affect the data.