This is data for the DEPP project 1. If you want to place your query sequences onto WoL tree using DEPP, the file you need is accessory.tar.gz (Use DEPP version <= 0.1.36) 2. Simulated data * simulated_data/agg_dis_mat this directory contains the simulated data we used in the paper. For multiple genes, we summarized distance matrix from all the genes. - simulated_data/agg_dis_mat/$c.tar.gz: data for $c discordance. $c is high, medium or low - simulated_data/agg_dis_mat/$c/$r/$n contains data for $r tree replicates and the $nth genes, files in it include: - depp.csv: distance matrix from DEPP model before correction - depp_correction.csv: distance matrix from DEPP mode after correction - model: model for the $r tree replicate and the nth genes - seq: query sequences for $r tree replicate and the nth genes - simulated_data/agg_dis_mat/$c/$r/placement contains placement results for $r tree replicates uisng different number of genes, files in it include: - $n.csv: distance matrix by summarizing (n+1) genes - $n_placement.newick and $n_placement.jplace: placement tree using (n+1) genes in newick and jplace format * simulated_data/cat_seq this directory contains the simulated data we used in the paper. For multiple genes, we concatenate sequences from all the genes. - simulated_data/cat_seq/$c.tar.gz: data for $c discordance. $c is high, medium or low - simulated_data/cat_seq/high/$r/$n contains data for $r tree replicates with $n genes, files in it includes: - $id_seq.fasta: query sequences with id $id - $id_true.newick: true placement tree for query $id - true.newick: backbone tree file - backbone_tree_seq_f.fasta: backbone sequences file - $id_depp.csv: distance matrix from DEPP model after correction of query $id 3. WoL data * WoL_data this directory contains the WoL data we used in the paper. - wlogdate.nwk: WoL species tree - WoL_data/30_marker_genes.tar.gz: data for 30 marker genes with high, medium or low discordance (10 genes for each condition) - WoL_data/30_marker_genes/$gene_id: data for gene with id $gene_id - WoL_data/50_marker_genes.tar.gz: data for 50 randomly selected marker genes - WoL_data/50_marker_genes/$gene_id: data for gene with id $gene_id - WoL_data/5s: 5S data - WoL_data/16s.tar.gz: 16S data - WoL_data/16s/full_length: full length 16S data - WoL_data/16s/v3_v4: V3+V4 region of 16S data - WoL_data/16s/v4_150: part of V4 region in length ~150bp - WoL_data/16s/v4_100: part of V4 region in length ~100bp - WoL_data/50_marker_genes_cat: data for 50 randomly selected marker genes. The 50 genes are concatenated and a single DEPP model is trained for the concatenated sequences. - files in the above directory include: - backbone_seq_id.txt: sequences id for the backbone - seq.fa: all the sequences of the gene - model: DEPP model for the gene 4. Traveler's Diarrhea * travelers_diarrhea this directory contains the Traveler's Diarrhea data we used in the paper. - travelers_diarrhea/MAG.tar.gz: MAG data - travelers_diarrhea/MAG/$gene_id.fa: query sequences for gene with id $gene_id - travelers_diarrhea/ASV: ASV data, files in the director include: - query.fa: query sequences from Traveler's Diarrhea data - backbone.fa: backbone sequences in WoL tree (trimed to have the same region as the ASV in queries)