Relationship Between Heteromorphic Sex Chromosomes and Organism Longevity

Project Aim

The purpose of this project is to consider the predictors of longevity in organisms across the tree of life. I was particularly interested in whether there is a correlation between the degree of difference in size of heterogametic sex chromosomes and organism lifespan.

Background Information

The paper The sex with the reduced sex chromosome dies earlier: a comparison across the tree of life (Xirocostas et al., 2020) performs a meta-analysis that shows that the homogametic sex (meaning XX-female or ZZ-male) lives longer than the heterogametic sex. This opens up questions on the importance of chromosome morphology. I wanted to look at the data from this paper and supplement it with my own data on the difference in size between the heteromorphic sex chromosomes. The purpose being to determine whether the degree of differentiation correlates with the lifespan differences in species.

Project Methods

Expanding the code below will show you the packages used in this project.

# CRAN Packages
library(phangorn)
library(tidyverse)
library(janitor)
library(ape)
library(dplyr)
library(kableExtra)
library(seqinr)
library(reutils)
library(shiny)
library(modelr)
library(MASS)
library(easystats)
library(plotly)

# Bioconductor Packages
library(Biostrings)
library(DECIPHER)
library(ggtree)
library(msa)

# Github Packages
library(ggmsa)

Take a look at the data we have

The table below gives a glimpse into the data that was gathered. The majority of the data available was longevity data. Where we are constrained in this study is in the availability of sex specific genome size data. In my digging through literature, I only found about 25 species that have sex specific genome size and lifespan data.

class	order	family	species	sex_determination	heterogametic_lifespan	homogametic_lifespan	ln_r_rlifespan	population_source	lifespan_data_type	gs_female_mb	gs_male_mb	gs_diff_mb
Actinopterygii	Salmoniformes	Salmonidae	Coregonus hoyi	male heterogametic	9.00	11.00	0.2006707	wild	max	NA	NA	NA
Actinopterygii	Salmoniformes	Salmonidae	Coregonus kiyi	male heterogametic	8.00	10.00	0.2231436	wild	max	NA	NA	NA
Amphibia	Anura	Alytidae	Discoglossus pictus	female heterogametic	3.08	3.00	-0.0263173	wild	mean	NA	NA	NA
Amphibia	Anura	Dicroglossidae	Fejervarya limnocharis	male heterogametic	2.00	2.20	0.0953102	wild	mean	NA	NA	NA
Amphibia	Anura	Pyxicephalidae	Pyxicephalus adspersus	female heterogametic	4.00	6.00	0.4054651	wild	mean	NA	NA	NA
Amphibia	Anura	Ranidae	Rana sakuraii	male heterogametic	3.25	3.45	0.0597192	wild	mean	NA	NA	NA

Here, we see a lot of NA’s. These are due to the lack of data found for genome sizes. We will sift through this data, first focusing on lifespan differences which we have more information for. Then we will do a tighter filter, exploring the genome size differences.

Make some initial plots of the data we have

Initial Plots: Lifespan

Lets look at all the species we’ve gathered that have lifespan data for the sexes.

The figure above shows the difference between the lifespan of homogametic and heterogametic sexes of certain species. The x-axis plots ln(homogametic lifespan/heterogametic lifespan) to normalize the results. If the point is greater than 0, the homogametic sex lives longer. If it is less than 0, the heterogametic sex lives longer. The colors are based on whether the species’ sex determination system is female or male heterogametic.

The following interactive version of this plot allows you to look at the data based on class, order, and family by changing the colors of the points.

The following plot is another way of illustrating how much of the difference in lifespan data resides above zero.

From our lifespan plots, we can see that the homogametic sex lives longer. However, this is not new information. We want to see if the degree of difference in the size of the sex chromosomes contributes to the difference in longevity. To do that we have to narrow our data again to the observations where genome size data was found.

Initial Plots: Genome Size

Let’s now look at all the species who have both lifespan and genome size data. This is the smaller dataset of 25 species.

This plot reveals that most of the data we have is male heterogametic. There is only one female heterogametic data point. Also, the outliers make it appear as though there is a trend, but really it is hard to believe that there is actually a positive trend worth noting. From this plot, we can not conclude a correlation between the difference in size of the heterogametic sex chromosomes and the difference in lifespan between the sexes.

Lets quickly restrict the data to a subset of -250:250 Mb to see how the trendline behaves.

As expected, when the more extreme values are removed. the trendline starts to flatten out. There does not seem to be a significant relationship between difference in size of the sex chromosomes and longevity.

I want to collect more data on the longevity and differences in genome size of species so that I can further investigate a pattern. Until then, let’s look at the linear model of the data and see what predictions would be made if the data followed the model.

View predictions

Model for the smaller subset of Genome Size data

The only predictor being explored for this data set is the difference in genome size data. This makes a simple linear model that predicts along the equation y = 0.1276 + 0.0004x. Only 0.55% of the variance is explained by the predictor. Here are the predictions of difference in lifespan data based off the real data.

This model informs us that the p-value is 0.742. There is no significant relationship between the difference in lifespan and the difference in genome size shown by our sample.

Build a phylogeny

Though I still have the goal of collecting more data, I wanted to learn a method for creating a phylogenetic tree so that I could implement it in further analyses.

Get accession numbers

Dr. Carl Hjelman at Utah Valley University worked with me on this piece of code. It’s purpose is to take a list of species and genes from the user and create a data frame of their accession numbers from GenBank. The lengths of the genes found are included in the table to help identify suspicious results.

#### Retrieving Accession Numbers from Genbank ####
#### function that takes a term and finds genbank data for it ####
grab.results <- function (term) {
  # Search for the given term on nuccore. This gives us a list of
  # record IDs.
  ids <- esearch(term, db="nuccore")
  # Grab summaries for the given record IDs, as a sort-of data frame.
  sum <- esummary(ids, db="nuccore")
  data <- content(sum, as="parsed")
  # For some reason, this parser gives us lists of lists instead of a
  # proper data frame (which should be lists of vectors). Return a
  # fixed-up version.
  data.frame(lapply(data, as.character), stringsAsFactors=FALSE)
}

# read in my data and only use the data that I have both lifespan and genome size data for
spec<-read.csv("lifespan_species.csv")
# put in list of genes to collect accessions for
gene.list<-c("mitochondrion","coi","coii", "coiii", "cytb", "hunchback")
# make matrix to be populated
accession_out<-matrix(,nrow=length(spec$names),ncol=(length(gene.list)*2+1))
# make list of column names for table
columns <- list()
columns <- append(columns, "Spec")
for(i in 1:length(gene.list)){
  columns <- append(columns, paste(gene.list[i],"Acc"))
  columns <- append(columns, "Len")
}
colnames(accession_out) <- columns

#loop to add data to table
for(i in 1:length(spec$names)){
  accession_out[i,1] <- spec$names[i]
  for(k in 1:6){
    testing<-(grab.results(term=(paste(spec$names[i],gene.list[k]))))
    if(length(testing)<1){
      (k<-(k+1))}
    #the below lets us know if after we filter to only DNA and remove other species that we still have data
    #this lets us filter out RNA and also organisms that don't match but might appear
    #like parasites, worms, etc.
    else if(length((subset(testing, testing$MolType == "dna" & testing$Organism == spec$names[i] & as.numeric(testing$Slen)<21000))[1])
            <1){
      k<-(k+1)
    }
    #this pulls data if we have data, DNA, and the right species, and filters out things larger than mitochondria
    else{
      testagain<-(subset(testing, testing$MolType == "dna" & testing$Organism == spec$names[i] & as.numeric(testing$Slen)<21000))
      accession_out[i,(k*2)]<-testagain$AccessionVersion[1]
      accession_out[i,(k*2+1)]<-testagain$Slen[1]}
  }
}

write.csv(accession_out,"accession/accession_numbers.csv", row.names = FALSE)

Spec	mitochondrion Acc	Len…3	coi Acc	Len…5	coii Acc	Len…7	coiii Acc
Amblyomma cajennense	NC_020333.1	14780	MF363087.1	489	KY284602.1	488	NA
Aphelocoma coerulescens	NA	NA	DQ432737.1	652	NA	NA	NA
Nauphoeta cinerea	NA	NA	JN615381.1	1362	MH360271.1	709	NA
Periplaneta americana	NA	NA	JX402724.1	642	NC_016956.1	15584	NC_016956.1
Callosobruchus maculatus	NC_053358.1	18321	MG458971.1	657	CAACVG010000153.1	14841	CAACVG010002514.1
Chrysomya megacephala	NC_019633.1	15273	JX402727.1	618	JN014846.1	2303	AJ426041.2
Cochliomyia hominivorax	NA	NA	MT020449.1	494	MT876038.1	686	NA
Lucilia cuprina	NA	NA	OR574367.1	663	JN014889.1	2304	NA
Musca domestica	NC_024855.1	16108	JX402726.1	657	NC_024855.1	16108	NC_024855.1
Episyrphus balteatus	NC_036481.1	16175	OM794674.1	623	AY573151.1	669	NA
Ovis aries	OR459774.1	16618	OR459774.1	16618	OR459774.1	16618	OR459774.1
Pan troglodytes	NC_001643.1	16554	AY846293.1	634	AY549430.1	330	NA
Symphalangus syndactylus	NC_014047.1	16516	NC_014047.1	16516	NC_014047.1	16516	NC_014047.1
Drosophila mojavensis	NA	NA	EU493608.1	2011	EU493738.1	1460	EU493868.1
Drosophila mulleri	NA	NA	DQ437708.1	408	DQ437711.1	688	DQ437714.1
Drosophila montana	MK659829.1	15984	MH423339.1	658	MK659829.1	15984	EU493880.1
Drosophila virilis	HQ849831.1	15480	MH423345.1	658	KX275252.1	549	EU493881.1
Drosophila kikkawai	MK742876.1	16019	MW122898.1	658	MK659822.1	15655	NA
Drosophila erecta	MK659815.1	16309	KX771108.1	658	GQ244453.1	684	NA
Drosophila yakuba	NC_001322.1	16019	KX771113.1	658	X03240.1	16019	X03240.1
Drosophila sechellia	AF200832.1	14950	KJ426007.1	560	MK659840.1	16689	NA
Drosophila simulans	AY518674.1	14946	OK380938.1	619	MN046104.1	18956	MN046104.1
Drosophila melanogaster	OR555819.1	19116	OK380942.1	574	EU493757.1	1369	EU493887.1
Drosophila bipectinata	MK659811.1	15780	MK551134.1	503	MK659811.1	15780	NA
Drosophila ananassae	CM029947.1	15821	MW122904.1	658	MK659805.1	15830	NA

We surprisingly got pretty complete coi data! The lengths reveal some issues where the sequence pulled from GenBank probably wasn’t the best option. I manually checked on those that were further away from 600 bp to see if they have more comparable options on GenBank. The coi gene will be what I construct the tree off of.

Make multi fasta

The next code chunk converts the accession data into a multi fasta with simple descriptor lines.

# read in your file of accession numbers to feed into script
df <- read_csv("accession/accession_numbers_fixed.csv") %>% 
  clean_names()

gene_accession_numbers <- df$coi_acc

# Fetching the sequences from GenBank
gene_sequences <- read.GenBank(gene_accession_numbers)
# creates a list of DNAbins

#accession number corresponding to species names/gene
gene_sequences_GenBank_IDs <- paste(attr(gene_sequences, "species"), names(gene_sequences), sep=" coi ")
#, nbcol = 6, colsep = " ", colw = 10 -> possible addition but I do not think it is needed. 

# Write the sequences to a FASTA file 
write.dna(gene_sequences, file = "accession/coi.fasta", format = "fasta", append = FALSE)

# Read the sequences from the FASTA file
gene_seq_format <- read.FASTA(file = "accession/coi.fasta")

# Modify the names of the sequences
names(gene_seq_format) <- gene_sequences_GenBank_IDs

# Write sequences to a new FASTA file 
write.FASTA(gene_seq_format, file = "accession/coi_seq_format.fasta")

# this output a nice multi fasta, I just don't know for sure if I have all the species I'm aiming for

My future goals for making multi fastas is to create a super matrix that consists of multiple genes for all the species so that I can build larger trees.

Conduct an alignment

Next, I wanted to learn how to do an alignment and a simple tree. Here is a section of the alignment I performed in R. There is lots of variation since I am using species that are not closely related.

Build Trees

With DECIPHER and Phangorn, I was able to make a tree with neighbor joining. I do not believe this tree is very accurate.

gimme_fast <- readDNAStringSet("accession/coi_seq_format.fasta") # read in multi fasta as a string set
alignment <- AlignSeqs(gimme_fast) # perform profile-to-profile alignment
distance <- dist.dna(as.DNAbin(alignment)) # find pairwise distances
tree <- njs(distance) # have to use this because there is studd missing from the distance matrix
ape::write.tree(tree, file='tree/tree.txt') # write tree out in newick format

This tree shows weird groupings that I do not trust. I have a lot of drosophila data, but it is weird that it is popping up in random areas.

Using software outside of R (MAFFT & BEAST), I performed an alignment and made a tree with the sequences. While using MAFFT online, I saw that some of my sequences could have been reversed which could explain some of the weird relationships seen with the neighbor joined tree. The relationships shown in the BEAST tree follow expectations more closely.

Let’s look at the phylogeny along side the data. Here are visualizations of the BEAST tree that show the data in continuous scale gradient.

This final plot shows all the data that I was focusing on side by side. While there still doesn’t seem to be a correlation between the data, this visualization helps us to ask more questions. Do organisms with smaller genomes tend to have smaller differences in sex chromosome size? Is this affecting the trends that we are seeing?

Future Steps

I want to collect more data to see if the lack of correlation is due to a small sample size. I also want to create more robust and accurate trees to perform phylogenetically independent contrasts that will help me to see if phylogenetic relationships affect the data.