This is data for the DEPP project

1. If you want to place your query sequences onto WoL tree using DEPP, the file you need is accessory.tar.gz (Use DEPP version <= 0.1.36)

2. Simulated data
  * simulated_data/agg_dis_mat
    this directory contains the simulated data we used in the paper. For multiple genes, we summarized distance matrix from all the genes.
    - simulated_data/agg_dis_mat/$c.tar.gz: data for $c discordance. $c is high, medium or low
    - simulated_data/agg_dis_mat/$c/$r/$n contains data for $r tree replicates and the $nth genes, files in it include:
	- depp.csv: distance matrix from DEPP model before correction
	- depp_correction.csv: distance matrix from DEPP mode after correction
	- model: model for the $r tree replicate and the nth genes
	- seq: query sequences for $r tree replicate and the nth genes
    - simulated_data/agg_dis_mat/$c/$r/placement contains placement results for $r tree replicates uisng different number of genes, files in it include:
	- $n.csv: distance matrix by summarizing (n+1) genes
	- $n_placement.newick and $n_placement.jplace: placement tree using (n+1) genes in newick and jplace format
  * simulated_data/cat_seq
    this directory contains the simulated data we used in the paper. For multiple genes, we concatenate sequences from all the genes.
    - simulated_data/cat_seq/$c.tar.gz: data for $c discordance. $c is high, medium or low
    - simulated_data/cat_seq/high/$r/$n contains data for $r tree replicates with $n genes, files in it includes:
	- $id_seq.fasta: query sequences with id $id
	- $id_true.newick: true placement tree for query $id
	- true.newick: backbone tree file
	- backbone_tree_seq_f.fasta: backbone sequences file
	- $id_depp.csv: distance matrix from DEPP model after correction of query $id 

3. WoL data
  * WoL_data
    this directory contains the WoL data we used in the paper. 
    - wlogdate.nwk: WoL species tree
    - WoL_data/30_marker_genes.tar.gz: data for 30 marker genes with high, medium or low discordance (10 genes for each condition)
	- WoL_data/30_marker_genes/$gene_id: data for gene with id $gene_id
    - WoL_data/50_marker_genes.tar.gz: data for 50 randomly selected marker genes
	- WoL_data/50_marker_genes/$gene_id: data for gene with id $gene_id
    - WoL_data/5s: 5S data
    -  WoL_data/16s.tar.gz: 16S data
	- WoL_data/16s/full_length: full length 16S data
	- WoL_data/16s/v3_v4: V3+V4 region of 16S data
	- WoL_data/16s/v4_150: part of V4 region in length ~150bp
	- WoL_data/16s/v4_100: part of V4 region in length ~100bp
    -  WoL_data/50_marker_genes_cat: data for 50 randomly selected marker genes. The 50 genes are concatenated and a single DEPP model is trained for the concatenated sequences.
    - files in the above directory include:
	- backbone_seq_id.txt: sequences id for the backbone
        - seq.fa: all the sequences of the gene
        - model: DEPP model for the gene

4. Traveler's Diarrhea
  * travelers_diarrhea
    this directory contains the Traveler's Diarrhea data we used in the paper.     
    - travelers_diarrhea/MAG.tar.gz: MAG data
	- travelers_diarrhea/MAG/$gene_id.fa: query sequences for gene with id $gene_id
    - travelers_diarrhea/ASV: ASV data, files in the director include:
	- query.fa: query sequences from Traveler's Diarrhea data
	- backbone.fa: backbone sequences in WoL tree (trimed to have the same region as the ASV in queries)