GENERAL INFORMATION

1. This is data for the project DEPP: Deep Learning Enables Extending Species Trees using Single Genes (Yueyu Jiang, Metin Balaban, Qiyun Zhu and Siavash Mirarab). 

2. Author information
  * Yueyu Jiang
    - affiliation: Department of Electrical and Computer Engineering, UC San Diego, CA 92093, USA
    - email: y5jiang@eng.ucsd.edu
  * Metin Balaban
    - affiliation: Bioinformatics and Systems Biology Graduate Program, UC San Diego, CA 92093, USA
  * Qiyun Zhu
    - affiliation: Center for Fundamental and Applied Microbiomics, Arizona State University, Tempe, AZ 85281, USA
  * Siavash Mirarab
    - affiliation: Department of Electrical and Computer Engineering, UC San Diego, CA 92093, USA

3. This dataset includes trees, sequences and pretrained models in the manuscripts. Pretrained models are in .ckpt extension. Note that is can be loaded using train_depp.py and depp_distance.py in DEPP package. For more information, please refer to https://github.com/yueyujiang/DEPP Model training and calculating distance matrix sections.

DATA & FILE OVERVIEW

1. accessory.tar.gz
This contains the data needed for placing queries onto WoL tree using marker genes or rRNA data
  * ${gene}_a.fasta: backbone sequences for gene ${gene}
  * ${gene}.nwk: gene tree inferred using FastTree. This is for aligning queries to the backbone sequences using UPP.
  * ${gene}.ckpt: DEPP model
  * wol.nwk: backbone WoL species tree
If you want to place your query sequences onto WoL tree using DEPP, this is the file you need (Use DEPP version >= 0.1.51). The command to use is wol_placement.sh. For more information on the usage, please refer to https://github.com/yueyujiang/DEPP Wol placement section.

2. 16s_accessory.10k.tar.gz
This contains the data needed for placing queries onto WoL tree using rRNA data
  * ${gene}_a.fasta: backbone sequences for gene ${gene}
  * ${gene}.nwk: gene tree inferred using FastTree. This is for aligning queries to the backbone sequences using UPP.
  * ${gene}.ckpt: DEPP model
  * ${gene}.recon.ckpt: DEPP model with reconstruction module
  * ${gene}_emb.pt: backbone embeddings. This is a PyTorch tensor. Each row is an embedding is a backbone sequence. It can be loaded using PyTorch package using torch.load.
  * ${gene}_id.pt: backbone sequence ID. This is a list stored by PyTorch. It can be loaded using PyTorch package using torch.load.
  * ${gene}_gap.pt: Gap information of the backbone alignments. This is a PyTorch tensor. Each row is a vector with the same length of alignment sequences indicating whether the site in the alignment is a gap or a letter. It can be loaded using PyTorch package using torch.load.
  * ${gene}_emb.recon.pt: backbone embeddings for DEPP model with reconstruction module. This is a PyTorch tensor. Each row is an embedding is a backbone sequence. It can be loaded using PyTorch package using torch.load.
If you want to place your query sequences onto WoL tree using DEPP, this is the file you need (Use DEPP version >= 0.1.51). The queries should only contains rRNA sequences. The command to use is wol_placement.sh. For more information on the usage, please refer to https://github.com/yueyujiang/DEPP Wol placement section.

3. Simulated data
  * simulated_data/ils_data
    this directory contains the simulated data with ILS we used in the paper. For multiple genes, we concatenate sequences from all the genes.
    - simulated_data/ils_data/$c: data for $c model conditions. model.200.500000.0.000001, model.200.2000000.0.000001, model.200.10000000.0.000001 corresponding to high, medium or low discordance in the paper
    - simulated_data/ils_data/$c/$r/$n contains data for $r tree replicates with $n genes, files in it includes:
	- seq.fa: sequence file
	- query_label.txt: sequences id for the selected queries
        - model.ckpt: DEPP model
    - simulated_data/ils_data/$c/$r/backbone.nwk: backbone tree used in the paper with branch length reestimated using 32 genes
    - simulated_data/ils_data/$c/$r/s_trees.tree: complete species tree
    - simulated_data/ils_data/$c/$r/RAxML_bestTree.r100.run: backbone tree with branch length reestimated by RAxML
    - simulated_data/ils_data/$c/$r/subsample.fa: sequences for reestimating branch length which is generated by subsampling sites from 32 genes (each providing 500 sites).
  * simulated_data/hgt_data
    this directory contains the simulated data with HGT we used in the paper.
    - simulated_data/hgt_data/rep.$r: data for tree replicate $r.
    - simulated_data/hgt_data/rep.$r/$g/seq.fa: sequences data for tree replicate $r, gene $g
    - simulated_data/hgt_data/rep.$r/$g/model.ckpt: DEPP model
    - simulated_data/hgt_data/rep.$r/subsample.fa: sequences for reestimating branch length which is generated by subsampling sites from 5 genes (each providing 500 sites).
    - simulated_data/hgt_data/rep.$r/RAxML_bestTree.r100.run: backbone tree with branch length reestimated by RAxML
    - simulated_data/hgt_data/rep.$r/s_tree.trees: species tree for tree replicate $r
    - simulated_data/hgt_data/rep.$r/query_label.txt: sequences id for the selected queries

4. WoL data
  * WoL_data
    this directory contains the WoL data we used in the paper. 
    - wol.nwk: WoL species tree
    - WoL_data/30_marker_genes: data for 30 marker genes with high, medium or low discordance (10 genes for each condition)
	- WoL_data/30_marker_genes/$gene_id: data for gene with id $gene_id
    - WoL_data/50_marker_genes: data for 50 randomly selected marker genes
	- WoL_data/50_marker_genes/$gene_id: data for gene with id $gene_id
    - WoL_data/5s: 5S data
    -  WoL_data/16s: 16S data
	- WoL_data/16s/full_length: full length 16S data
	- WoL_data/16s/v3_v4: V3+V4 region of 16S data
	- WoL_data/16s/v4_150: part of V4 region in length ~150bp
	- WoL_data/16s/v4_100: part of V4 region in length ~100bp
    - files in the above directory include:
	- query_label.txt: sequences id for selected queries
        - seq.fa: all the sequences of the gene
        - model.ckpt: DEPP model for the gene
        - model.recon.ckpt (for 30 marker genes): DEPP model with reconstruction network
    - WoL/380_marker_genes: sequences for the 380 WoL marker genes (backbone sequences for the Traveler's Diarrhea experiment in the paper).

5. Traveler's Diarrhea
  * travelers_diarrhea
    this directory contains the Traveler's Diarrhea data we used in the paper.     
    - travelers_diarrhea/MAG: MAG data
	- travelers_diarrhea/MAG/$gene_id.fa: query sequences for gene with id $gene_id
    - travelers_diarrhea/ASV: ASV data, files in the director include:
	- query.fa: query sequences from Traveler's Diarrhea data
	- backbone.fa: backbone sequences in WoL tree (trimed to have the same region as the ASV in queries)