Project Aim

The purpose of this project is to consider the predictors of longevity in organisms across the tree of life. I was particularly interested in whether there is a correlation between the degree of difference in size of heterogametic sex chromosomes and organism lifespan.


Background Information

The paper The sex with the reduced sex chromosome dies earlier: a comparison across the tree of life (Xirocostas et al., 2020) performs a meta-analysis that shows that the homogametic sex (meaning XX-female or ZZ-male) lives longer than the heterogametic sex. This opens up questions on the importance of chromosome morphology. I wanted to look at the data from this paper and supplement it with my own data on the difference in size between the heteromorphic sex chromosomes. The purpose being to determine whether the degree of differentiation correlates with the lifespan differences in species.


Project Methods

Expanding the code below will show you the packages used in this project.

# CRAN Packages
library(phangorn)
library(tidyverse)
library(janitor)
library(ape)
library(dplyr)
library(kableExtra)
library(seqinr)
library(reutils)
library(shiny)
library(modelr)
library(MASS)
library(easystats)
library(plotly)

# Bioconductor Packages
library(Biostrings)
library(DECIPHER)
library(ggtree)
library(msa)

# Github Packages
library(ggmsa)

Take a look at the data we have

The table below gives a glimpse into the data that was gathered. The majority of the data available was longevity data. Where we are constrained in this study is in the availability of sex specific genome size data. In my digging through literature, I only found about 25 species that have sex specific genome size and lifespan data.

class order family species sex_determination heterogametic_lifespan homogametic_lifespan ln_r_rlifespan population_source lifespan_data_type gs_female_mb gs_male_mb gs_diff_mb
Actinopterygii Salmoniformes Salmonidae Coregonus hoyi male heterogametic 9.00 11.00 0.2006707 wild max NA NA NA
Actinopterygii Salmoniformes Salmonidae Coregonus kiyi male heterogametic 8.00 10.00 0.2231436 wild max NA NA NA
Amphibia Anura Alytidae Discoglossus pictus female heterogametic 3.08 3.00 -0.0263173 wild mean NA NA NA
Amphibia Anura Dicroglossidae Fejervarya limnocharis male heterogametic 2.00 2.20 0.0953102 wild mean NA NA NA
Amphibia Anura Pyxicephalidae Pyxicephalus adspersus female heterogametic 4.00 6.00 0.4054651 wild mean NA NA NA
Amphibia Anura Ranidae Rana sakuraii male heterogametic 3.25 3.45 0.0597192 wild mean NA NA NA

Here, we see a lot of NA’s. These are due to the lack of data found for genome sizes. We will sift through this data, first focusing on lifespan differences which we have more information for. Then we will do a tighter filter, exploring the genome size differences.

Make some initial plots of the data we have

Initial Plots: Lifespan

Lets look at all the species we’ve gathered that have lifespan data for the sexes.

The figure above shows the difference between the lifespan of homogametic and heterogametic sexes of certain species. The x-axis plots ln(homogametic lifespan/heterogametic lifespan) to normalize the results. If the point is greater than 0, the homogametic sex lives longer. If it is less than 0, the heterogametic sex lives longer. The colors are based on whether the species’ sex determination system is female or male heterogametic.

The following interactive version of this plot allows you to look at the data based on class, order, and family by changing the colors of the points.

The following plot is another way of illustrating how much of the difference in lifespan data resides above zero.

From our lifespan plots, we can see that the homogametic sex lives longer. However, this is not new information. We want to see if the degree of difference in the size of the sex chromosomes contributes to the difference in longevity. To do that we have to narrow our data again to the observations where genome size data was found.

Initial Plots: Genome Size

Let’s now look at all the species who have both lifespan and genome size data. This is the smaller dataset of 25 species.

This plot reveals that most of the data we have is male heterogametic. There is only one female heterogametic data point. Also, the outliers make it appear as though there is a trend, but really it is hard to believe that there is actually a positive trend worth noting. From this plot, we can not conclude a correlation between the difference in size of the heterogametic sex chromosomes and the difference in lifespan between the sexes.

Lets quickly restrict the data to a subset of -250:250 Mb to see how the trendline behaves.

As expected, when the more extreme values are removed. the trendline starts to flatten out. There does not seem to be a significant relationship between difference in size of the sex chromosomes and longevity.

I want to collect more data on the longevity and differences in genome size of species so that I can further investigate a pattern. Until then, let’s look at the linear model of the data and see what predictions would be made if the data followed the model.

View predictions

Model for the smaller subset of Genome Size data

The only predictor being explored for this data set is the difference in genome size data. This makes a simple linear model that predicts along the equation y = 0.1276 + 0.0004x. Only 0.55% of the variance is explained by the predictor. Here are the predictions of difference in lifespan data based off the real data.

This model informs us that the p-value is 0.742. There is no significant relationship between the difference in lifespan and the difference in genome size shown by our sample.

Build a phylogeny

Though I still have the goal of collecting more data, I wanted to learn a method for creating a phylogenetic tree so that I could implement it in further analyses.

Get accession numbers

Dr. Carl Hjelman at Utah Valley University worked with me on this piece of code. It’s purpose is to take a list of species and genes from the user and create a data frame of their accession numbers from GenBank. The lengths of the genes found are included in the table to help identify suspicious results.

#### Retrieving Accession Numbers from Genbank ####
#### function that takes a term and finds genbank data for it ####
grab.results <- function (term) {
  # Search for the given term on nuccore. This gives us a list of
  # record IDs.
  ids <- esearch(term, db="nuccore")
  # Grab summaries for the given record IDs, as a sort-of data frame.
  sum <- esummary(ids, db="nuccore")
  data <- content(sum, as="parsed")
  # For some reason, this parser gives us lists of lists instead of a
  # proper data frame (which should be lists of vectors). Return a
  # fixed-up version.
  data.frame(lapply(data, as.character), stringsAsFactors=FALSE)
}

# read in my data and only use the data that I have both lifespan and genome size data for
spec<-read.csv("lifespan_species.csv")
# put in list of genes to collect accessions for
gene.list<-c("mitochondrion","coi","coii", "coiii", "cytb", "hunchback")
# make matrix to be populated
accession_out<-matrix(,nrow=length(spec$names),ncol=(length(gene.list)*2+1))
# make list of column names for table
columns <- list()
columns <- append(columns, "Spec")
for(i in 1:length(gene.list)){
  columns <- append(columns, paste(gene.list[i],"Acc"))
  columns <- append(columns, "Len")
}
colnames(accession_out) <- columns

#loop to add data to table
for(i in 1:length(spec$names)){
  accession_out[i,1] <- spec$names[i]
  for(k in 1:6){
    testing<-(grab.results(term=(paste(spec$names[i],gene.list[k]))))
    if(length(testing)<1){
      (k<-(k+1))}
    #the below lets us know if after we filter to only DNA and remove other species that we still have data
    #this lets us filter out RNA and also organisms that don't match but might appear
    #like parasites, worms, etc.
    else if(length((subset(testing, testing$MolType == "dna" & testing$Organism == spec$names[i] & as.numeric(testing$Slen)<21000))[1])
            <1){
      k<-(k+1)
    }
    #this pulls data if we have data, DNA, and the right species, and filters out things larger than mitochondria
    else{
      testagain<-(subset(testing, testing$MolType == "dna" & testing$Organism == spec$names[i] & as.numeric(testing$Slen)<21000))
      accession_out[i,(k*2)]<-testagain$AccessionVersion[1]
      accession_out[i,(k*2+1)]<-testagain$Slen[1]}
  }
}

write.csv(accession_out,"accession/accession_numbers.csv", row.names = FALSE)
Spec mitochondrion Acc Len…3 coi Acc Len…5 coii Acc Len…7 coiii Acc
Amblyomma cajennense NC_020333.1 14780 MF363087.1 489 KY284602.1 488 NA
Aphelocoma coerulescens NA NA DQ432737.1 652 NA NA NA
Nauphoeta cinerea NA NA JN615381.1 1362 MH360271.1 709 NA
Periplaneta americana NA NA JX402724.1 642 NC_016956.1 15584 NC_016956.1
Callosobruchus maculatus NC_053358.1 18321 MG458971.1 657 CAACVG010000153.1 14841 CAACVG010002514.1
Chrysomya megacephala NC_019633.1 15273 JX402727.1 618 JN014846.1 2303 AJ426041.2
Cochliomyia hominivorax NA NA MT020449.1 494 MT876038.1 686 NA
Lucilia cuprina NA NA OR574367.1 663 JN014889.1 2304 NA
Musca domestica NC_024855.1 16108 JX402726.1 657 NC_024855.1 16108 NC_024855.1
Episyrphus balteatus NC_036481.1 16175 OM794674.1 623 AY573151.1 669 NA
Ovis aries OR459774.1 16618 OR459774.1 16618 OR459774.1 16618 OR459774.1
Pan troglodytes NC_001643.1 16554 AY846293.1 634 AY549430.1 330 NA
Symphalangus syndactylus NC_014047.1 16516 NC_014047.1 16516 NC_014047.1 16516 NC_014047.1
Drosophila mojavensis NA NA EU493608.1 2011 EU493738.1 1460 EU493868.1
Drosophila mulleri NA NA DQ437708.1 408 DQ437711.1 688 DQ437714.1
Drosophila montana MK659829.1 15984 MH423339.1 658 MK659829.1 15984 EU493880.1
Drosophila virilis HQ849831.1 15480 MH423345.1 658 KX275252.1 549 EU493881.1
Drosophila kikkawai MK742876.1 16019 MW122898.1 658 MK659822.1 15655 NA
Drosophila erecta MK659815.1 16309 KX771108.1 658 GQ244453.1 684 NA
Drosophila yakuba NC_001322.1 16019 KX771113.1 658 X03240.1 16019 X03240.1
Drosophila sechellia AF200832.1 14950 KJ426007.1 560 MK659840.1 16689 NA
Drosophila simulans AY518674.1 14946 OK380938.1 619 MN046104.1 18956 MN046104.1
Drosophila melanogaster OR555819.1 19116 OK380942.1 574 EU493757.1 1369 EU493887.1
Drosophila bipectinata MK659811.1 15780 MK551134.1 503 MK659811.1 15780 NA
Drosophila ananassae CM029947.1 15821 MW122904.1 658 MK659805.1 15830 NA

We surprisingly got pretty complete coi data! The lengths reveal some issues where the sequence pulled from GenBank probably wasn’t the best option. I manually checked on those that were further away from 600 bp to see if they have more comparable options on GenBank. The coi gene will be what I construct the tree off of.

Make multi fasta

The next code chunk converts the accession data into a multi fasta with simple descriptor lines.

# read in your file of accession numbers to feed into script
df <- read_csv("accession/accession_numbers_fixed.csv") %>% 
  clean_names()

gene_accession_numbers <- df$coi_acc

# Fetching the sequences from GenBank
gene_sequences <- read.GenBank(gene_accession_numbers)
# creates a list of DNAbins

#accession number corresponding to species names/gene
gene_sequences_GenBank_IDs <- paste(attr(gene_sequences, "species"), names(gene_sequences), sep=" coi ")
#, nbcol = 6, colsep = " ", colw = 10 -> possible addition but I do not think it is needed. 

# Write the sequences to a FASTA file 
write.dna(gene_sequences, file = "accession/coi.fasta", format = "fasta", append = FALSE)

# Read the sequences from the FASTA file
gene_seq_format <- read.FASTA(file = "accession/coi.fasta")

# Modify the names of the sequences
names(gene_seq_format) <- gene_sequences_GenBank_IDs

# Write sequences to a new FASTA file 
write.FASTA(gene_seq_format, file = "accession/coi_seq_format.fasta")

# this output a nice multi fasta, I just don't know for sure if I have all the species I'm aiming for

My future goals for making multi fastas is to create a super matrix that consists of multiple genes for all the species so that I can build larger trees.

Conduct an alignment

Next, I wanted to learn how to do an alignment and a simple tree. Here is a section of the alignment I performed in R. There is lots of variation since I am using species that are not closely related.

Build Trees

With DECIPHER and Phangorn, I was able to make a tree with neighbor joining. I do not believe this tree is very accurate.

gimme_fast <- readDNAStringSet("accession/coi_seq_format.fasta") # read in multi fasta as a string set
alignment <- AlignSeqs(gimme_fast) # perform profile-to-profile alignment
distance <- dist.dna(as.DNAbin(alignment)) # find pairwise distances
tree <- njs(distance) # have to use this because there is studd missing from the distance matrix
ape::write.tree(tree, file='tree/tree.txt') # write tree out in newick format

This tree shows weird groupings that I do not trust. I have a lot of drosophila data, but it is weird that it is popping up in random areas.

Using software outside of R (MAFFT & BEAST), I performed an alignment and made a tree with the sequences. While using MAFFT online, I saw that some of my sequences could have been reversed which could explain some of the weird relationships seen with the neighbor joined tree. The relationships shown in the BEAST tree follow expectations more closely.

Let’s look at the phylogeny along side the data. Here are visualizations of the BEAST tree that show the data in continuous scale gradient.

This final plot shows all the data that I was focusing on side by side. While there still doesn’t seem to be a correlation between the data, this visualization helps us to ask more questions. Do organisms with smaller genomes tend to have smaller differences in sex chromosome size? Is this affecting the trends that we are seeing?

Future Steps

I want to collect more data to see if the lack of correlation is due to a small sample size. I also want to create more robust and accurate trees to perform phylogenetically independent contrasts that will help me to see if phylogenetic relationships affect the data.