INSTALLATION AND IMPLEMENTATION OF D-I-TASSER SUITE (Copyright 2020 by Zhang Lab, University of Michigan, All rights reserved) (Version 2.0, 2021/04/30) 1. What is D-I-TASSER Suite? The D-I-TASSER Suite is a composite package of programs for protein structure prediction and function annotations. The Suite includes the following programs: a) D-I-TASSER: A hierarchical program for protein structure prediction b) DeepMSA/DeepMSA2: A program for multiple sequence alignmnet generation c) MUSTER: A threading program for protein template identification d) CEthreader: A contact-based threading program for protein template identification e) LOMETS3: A meta-server approach consisting of multiple threading programs f) AttentionPotential: An attention network based deep-learning algorithm for residue-residue contact/distance prediction g) DeepPotential: A residual convolutional network based deep-learning alforithm for residue-residue contact/distance/hydrogen bond prediction h) ResTriplet,TripletRes,ResPre,ResPLM and DeepPLM: Deep-learning-based programs for residue-residue contact prediction i) DeepFold: A protein ab initio structure prediction program based on AttentionPotential or DeepPotential predicted restraints j) PotentialFold: A protein ab initio structure prediction program based on AttentionPotential or DeepPotential predicted restraints k) SPICKER: A clustering program for structure decoy selection l) HAAD: Quickly adding hydrogen atoms to protein heavy atom structure m) EDTSurf: Construct triangulated surfaces of protein molecules n) ModRefiner: Construct and refine atomic model from C-alpha traces o) NWalign: Protein sequence alignments by Needleman-Wunsch algorithm p) PSSpred: A program for Protein Secondary Structure PREDiction q) ResQ: An algorithm to estimate B-factor and residue-level error of models r) COACH: A function annotation program based on COFACTOR, TM-SITE and S-SITE s) COFACTOR: A program for ligand-binding site, EC number & GO term prediction t) TM-SITE: A structure-based approach for ligand-binding site prediction u) S-SITE: A sequence-based approach for ligand-binding site prediction v) AlphaFold2: A third-party protein structure prediction software developed by DeepMind used in D-I-TASSER-AF2 pipeline 2. How to install the D-I-TASSER Suite? a) download the D-I-TASSER Suite 'D-I-TASSER-2.0.tar.bz2' from http://zhanglab.dcmb.med.umich.edu/D-I-TASSER/download.html and unpack 'D-I-TASSER-2.0.tar.bz2 by > tar -xvf D-I-TASSER-2.0.tar.bz2 The root path of this package is called $pkgdir, e.g. /home/yourname/D-I-TASSER-2.0. You should have all the programs under this directory. You can install the package at any location on your computer. b) Download D-I-TASSER and COACH library files from https://zhanglab.dcmb.med.umich.edu/D-I-TASSER/download.html http://zhanglab.dcmb.med.umich.edu/BioLiP/ A script 'download_lib.pl' is provided in the package for automated library download and update of the libraries. We recommend putting the library files under the path /home/$yourname/ITLIB. c) Third-party software installation: While the majority of programs in the package 'D-I-TASSER-2.0.tar.bz2' are developed in the Zhang Lab herein the permission of use is released, there are some programs and databases (including alphafold2, blast, nr, GOparser, uniclust30,uniref90, bfd, mgnify and metaclust) which were developed by third-party groups. A default version of alphafold2 (modified by our group), blast and nr are included in the package. It is user's obligation to obtain license permission from the developers for all the third-party software before using them. In addition, your system needs to have Java, python2, python3 (which supports pytorch 1.9.0 for AttentionPotential, DeepPotential and PotentialFold, Anaconda3+pytorch is recommanded) installed. To use DeepMSA, you need download uniclust30, uniref90 and metaclust from http://gwdu111.gwdg.de/~compbiol/uniclust/2017_04/uniclust30_2017_04_hhsuite.tar.gz , ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz , and https://metaclust.mmseqs.org/2017_05/metaclust_2017_05.fasta.gz. after you unpack them, put the entire folder to the D-I-TASSER library folder, (i.e. where the folder you put your PDB, MTX, DEP folders). Then rename the folder uniclust30_xxx_xxx to uniclust30, uniref90_xxx to uniref90, metaclust_xxx to metaclust. Then use $pkgdir/contact/DeepMSA/bin/esl-sfetch to create .ssi index for uniref90 and metaclust, here $pkgdir means the path where you put the D-I-TASSER suite package. For example, if the uniref90 database in uniref90 folder is named as uniref90.fasta, then go to uniref90 folder, run $pkgdir/contact/DeepMSA/bin/esl-sfetch --index uniref90.fasta, you will find a new file named as uniref90.fasta.ssi after the command done. Then do the same thing to metaclust database. If you use different version of uniclust30, uniref90 or metaclust, please go to $pkgdir/I-TASSERmod/runI-TASSER.pl, change the variables $hhbdbdir = "$libdir/uniclust30"; $jacdbdir = "$libdir/uniref90"; $hmsdbdir = "$libdir/metaclust"; $hhbdb = "$libdir/uniclust30/uniclust30_2017_04"; $jacdb = "$libdir/uniref90/uniref90.fasta"; $hmsdb = "$libdir/metaclust/metaclust.fasta"; To use DeepMSA2, you need download all DeepMSA databases and bfd, mgnify databases. Please follow the instruction of DeepMSA, download and creat index file for all uniclust30, uniref90, and metaclust. then download bfd and mgnify databases from http://wwwuser.gwdg.de/~mmirdit/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz https://zhanglab.dcmb.med.umich.edu/D-I-TASSER/lib/mgnify/mgy_clusters.clean.fasta https://zhanglab.dcmb.med.umich.edu/D-I-TASSER/lib/mgnify/mgy_clusters.clean.fasta.ssi put the all mgnify database files to the folder named as mgnify into the D-I-TASSER library folder, (i.e. where the folder you put your PDB, MTX, DEP folders). After you ubpack the bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz, put all files in one folder and rename it as bfd, then put the bfd folder to D-I-TASSER library folder. if you run command "ls $ITLIB/mgnify" then you will find the following results mgy_clusters.clean.fasta mgy_clusters.clean.fasta.ssi if you run command "ls $ITLIB/bfd" then you will find the following results bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex Here, $ITLIB is the path of your D-I-TASSER library If you use different version of mgnify and bfd, please go to $pkgdir/I-TASSERmod/runI-TASSER.pl, change the variables $hh3dbdir = "$libdir/bfd"; $mgydbdir = "$libdir/mgnify"; $hh3db = "$libdir/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"; $mgydb = "$libdir/mgnify/mgy_clusters.clean.fasta"; and $pkgdir/contact/DeepMSA2/scripts/DeepMSA2_noIMG.pl, change the variables my $qhhblitsdb="$ITlibdir/uniclust30/uniclust30_2017_04"; my $qjackhmmerdb="$ITlibdir/uniref90/uniref90.fasta"; my $qhhblits3db="$ITlibdir/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"; my $qhmmsearchdb="$ITlibdir/mgnify/mgy_clusters.clean.fasta"; my $dhhblitsdb="$ITlibdir/uniclust30/uniclust30_2017_04"; my $djackhmmerdb="$ITlibdir/uniref90/uniref90.fasta"; my $dhmmsearchdb="$ITlibdir/metaclust/metaclust.fasta:$ITlibdir/mgnify/mgy_clusters.clean.fasta"; To use DeepMSA2-IMG, you need download all DeepMSA and DeepMSA2 databases and our JGI databases. Please follow the instruction of DeepMSA and DeepMSA2, download and creat index file for all uniclust30, uniref90, mgnify, bfd and metaclust. then download IMG/JGI databases from https://zhanglab.dcmb.med.umich.edu/D-I-TASSER/lib/index.html put the all individual database files to the folder named as JGI into the D-I-TASSER library folder, (i.e. where the folder you put your PDB, MTX, DEP folders). if you run command "ls $ITLIB/JGI" then you will find the following results list DB.fasta.xxxx DB.fasta.xxxx.ssi make sure the list file contains the same numbers of database files. for example if you have list DB.fasta.aa DB.fasta.aa.ssi DB.fasta.ab DB.fasta.ab.ssi in your JGI folder, then your list should contains two line, as DB.fasta.aa DB.fasta.ab Here, $ITLIB is the path of your D-I-TASSER library If you use different version of JGI, please go to $pkgdir/I-TASSERmod/runI-TASSER.pl, change the variables $jgidbdir = "$libdir/JGI"; and $pkgdir/contact/DeepMSA2/scripts/DeepMSA2_IMG.pl, change the variables my $JGI="$ITlibdir/JGI"; make sure the following variables in the $pkgdir/contact/DeepMSA2/scripts/DeepMSA2_IMG.pl set correctly. my $hhblitsdb="$ITlibdir/uniclust30/uniclust30_2017_04"; my $jackhmmerdb="$ITlibdir/uniref90/uniref90.fasta"; my $hhblits3db="$ITlibdir/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"; To use D-I-TASSER-AF2, you need to make sure $pkgdir/thirdparty/alphafold2/run_alphafold_msa_benchmark.sh correly runable in your system. please first go to $pkgdir/thirdparty/alphafold2/ folder, following the readme file, install python and libraries fro AlphaFold2 after you install the python for AlphaFold2, plase change the following variable in $pkgdir/thirdparty/alphafold2/run_alphafold_msa_benchmark.sh af2pythondir=/nfs/amino-home/zhng/local_library/miniconda3 to run AlphaFold2, you need three addtional database in $ITLIB folder params pdb_mmcif pdb70 please download those files by $pkgdir/thirdparty/alphafold2/scripts/download_alphafold_params.sh $pkgdir/thirdparty/alphafold2/scripts/download_pdb70.sh $pkgdir/thirdparty/alphafold2/scripts/download_pdb_mmcif.sh if your run command "ls $ITLIB/pdb70", you will find the following files md5sum pdb70_a3m.ffindex pdb70_cs219.ffdata pdb70_from_mmcif_220313.tar.gz pdb70_hhm.ffindex pdb70_a3m.ffdata pdb70_clu.tsv pdb70_cs219.ffindex pdb70_hhm.ffdata pdb_filter.dat seq (create by you) for run benchmark mode of D-I-TASSER-AF2, go to $ITLIB/pdb70 folder, run command "mkdir -p seq ; python2 $pkgdir/thirdparty/alphafold2/get-hhmseq.py pdb70_hhm.ffindex seq" if your run command "ls $ITLIB/pdb_mmcif", you will find the following files mmcif_files obsolete.dat if your run command "ls $ITLIB/params", you will find the following files params_model_1.npz params_model_2.npz params_model_3.npz params_model_4.npz params_model_5.npz d) Updates: (i) We include new MSA construction pipeline DeepMSA2 and DeepMSA2-IMG in the version 2.0 (ii) A new protein folding pipeline D-I-TASSER-AF2 has been included in version 2.0. D-I-TASSER-AF2 pipeline is designed by combining D-I-TASSER with AlphaFold2 through two aspects: (i) the top AlphaFold2 models, which are ranked by the default quality assessment ranking pipeline included in AlphaFold2 pipeline, are added to D-I-TASSER as additional templates, together with 220 templates generated by the 11 component servers of LOMETS3, where each server generates 20 top templates that are sorted by their Z-scores for each threading algorithm. The top 10 templates are finally selected from the 240 templates based on the scoring function. (ii) AlphaFold2-predicted contact and distance maps are combined with the DeepPotential and AttentionPotential-predicted contact and distance maps, and final contacts and distances are selected from them using scoring functions, respectively. 3. Bug report: Please report and post bugs and suggestions at D-I-TASSER message board: http://zhanglab.dcmb.med.umich.edu/forum ####################################################### # # # 4. Installation and implementation of D-I-TASSER # # # ####################################################### 4.1. Introduction of D-I-TASSER D-I-TASSER (Distance-guided Iterative Threading ASSEmbly Refinement) is a new method extended from I-TASSER for high-accuracy protein structure and function predictions. Starting from a query sequence, D-I-TASSER first generates inter-residue distance/contact/hydrogen-bond maps using multiple deep neural-network predictors, including AttentionPotential, DeepPotential, ResTriplet, ResPLM, DeepPLM, ResPRE, TripletRes, and AlphaFold2 (optional). It then identifies structural templates from the PDB by multiple threading approach LOMETS3, with full-length atomic models assembled by contact/distance/hydrogen-bond maps guided replica-exchange Monte Carlo simulations. The large-scale benchmark tests showed that D-I-TASSER generates significantly more accurate models than I-TASSER, especially for the sequences that do not have homologous templates in the PDB. For function annotation, the D-I-TASSER structure model is matched through the function library (BioLiP) to identify functional template. The biological insights (including ligand-binding, enzyme classification, and gene ontology) are inferred from the functional templates by COACH based on the consensus of predictions from COFACTOR, TM-SITE and S-SITE. 4.2. How to run D-I-TASSER? a) Main script for running D-I-TASSER is $pkgdir/I-TASSERmod/runI-TASSER.pl. Run it directly without arguments will output the help information. b) The following arguments must be set (mandatory arguments). One example is: "$pkgdir/I-TASSERmod/runI-TASSER.pl -libdir /home/yourname/ITLIB -seqname example -datadir /home/yourname/D-I-TASSER-1.0/example" -libdir means the path of the template libraries -seqname means the unique name of your query sequence -datadir means the directory which contains your sequence c) Other arguments are optional whose default values have been set. User can reset one or more of them. One example of command line is: ================== Optional arguments: ================== -runstyle default value is "serial" which means running D-I-TASSER simulation sequentially. "parallel" means running D-I-TASSER simulations in parallel, distributed on cluster nodes, using PBS/torque job scheduling system. "gnuparallel" means running D-I-TASSER simulations in parallel, distributed on multiple cores of one computer, using GNU parallel. -homoflag [real, benchmark],"real" will use all templates, "benchmark" will exclude homologous templates -idcut sequence identity cutoff for "benchmark" runs, default value is 0.3, range is in [0,1] -ntemp number of top templates output for each threading program, default is 20, range is in [1,50] -nmodel number of final models output by D-I-TASSER, default value is 5, range is in [1,10] -LBS [true or false], whether to predict ligand-binding site (default: false) -EC [true or false], whether to predict EC number (default: false) -GO [true or false], whether to predict GO terms (default: false) -traj true or false, (default: true) deposit the trajectory files -light true or false, (default: false) this option runs quick simulations -hours specify maximum hours of simulations (default=5 when -light=true) -outdir where the final results should be saved (default value is set to data_dir) -itmode what kind of simulation is used, "IT" for I-TASSER, "CIT" for C-I-TASSER, "DIT" for D-I-TASSER (default), "DIT-AF2" for D-I-TASSER-AF2 if DIT-AF2 selected, please first go to thirdparty folder, following the readme file, make AlphaFold2 (run_alphafold_msa_benchmark.sh) runable in your system. -msapipe what kind of MSA pipeline will be used, "DeepMSA" for DeepMSA pipeline, "DeepMSA2" for DeepMSA2 pipeline without IMG database searching, "DeepMSA-IMG" for DeepMSA2 pipeline with IMG database searching (require downloading or building IMG/JGI database and very long time running by single CPU), -Nmsa How many sequences will be used in MSA for MSA transformer and attention [1-1024], default=512 ====================== Tips for path setting: ====================== -pkgdir: directory of D-I-TASSER suite. go to the I-TASSERmod folder and enter the command "pwd", you may get similar message like this /home/myname/D-I-TASSER-1.0/I-TASSERmod then the path is /home/myname/D-I-TASSER-1.0 -libdir: directory of D-I-TASSER library. go to the MTX and enter the command "pwd", you may get similar message like this /home/myname/ITLIB/MTX then the path is /home/myname/ITLIB -java_home: enter the command "which java", you may get a path like /usr/bin/java, then the path is /usr -python2: path to python 2, for example /bin/python -python3: path to python 3 for contact/distance/hb prediction, need to support pytorch >=1.7.0, for example /bin/python3 -seqname: this name must be different for different targets so that you can run multiple jobs at the same time. -datadir: this is the directory where your input sequence "seq.fasta" is located. When you run multiple jobs, different targets need to be put under different folders We suggest testing your installation first with a short sequence (e.g., about 30 residues) before running production jobs for your proteins. An example command for running D-I-TASSER using a sequence "seq.fasta" under the folder /home/myname/data/example NOTE: a) Outline of steps for running D-I-TASSER by 'runI-TASSER.pl': a1) standardize 'seq.fasta' to 'seq.txt' and get the sequence length a2) run 'deepmsa' to generate deep multiple sequence alignment run 'psiblast' to generate 'chk', 'out', 'pssm', 'mtx' files run 'PSSpred' to get 'seq.dat', 'seq.dat.ss' run 'solve' to get 'exp.dat' run 'pairmod' to get 'pair1.dat' and 'pair3.dat' a3) run 'alphafold2' (optional), 'attentionpotential', 'deeppotential','restriplet','tripletres','respre','resplm' and 'deepplm' to predicted contact/distance/HB maps a4) run 'LOMETS3' threading programs sequentially run 'mkinit.pl' to generate restraints, run 'prepare.pl' to get additional energy potentials a5) run D-I-TASSER simulation a6) run SPICKER clustering program run 'get_cscore.pl' to get confidence score run 'EMrefinement.pl' to get full-atomic models run 'get_rsq_bfp.pl' to get local accuracy and B-factor estimations a7) run 'runCOACH.pl' to generate ligand-binding sites, EC number and GO terms predictions. b) 'seq.fasta' is the query sequence file in FASTA format, which is the only needed input file for running D-I-TASSER. This file should be put in $datadir before running this job. c) D-I-TASSER structure assembly simulations contains multiple independent runs by decided by protein type. This number can be modified if the user wants to run more simulations, especially for big protein without good templates. d) If working on a cluster with multiple nodes, it is recommended to set $runstyle="parallel". You need have SBATCH server installed in your system. Parallel jobs will run faster since jobs are distributed among different nodes. The default setting $runstyle="serial" will run all the jobs on a single computer. e) If the job has been executed partially and encounter some error, you can rerun the main script without modification. It will check the existing files and start from the correct position. 4.3 System requirement: a) x86_64 machine, Linux kernel OS, Free disk space of more than 60G. b) Perl and java interpreters should be installed. GO:Parser should be installed if you want to predict GO terms c) Basic compress and decompress package should be installed to support: tar and bunzip2. d) If you are using computer clusters, job management software PBS server should support 'qsub' and 'qstat'. If using other job management software, such as SGE and Slurm, some changes should be made following the instructions at: http://zhanglab.dcmb.med.umich.edu/bbs/?q=node/3561 4.4. How to cite D-I-TASSER and D-I-TASSER Suite? 1. Wei Zheng, Yang Li, Qiqige Wuyun, Xiaogen Zhou, Chengxin Zhang, Robin Pearce, Eric W. Bell, Yiheng Zhu, Yang Zhang. Integrating deep neural network models with I-TASSER for accurate protein structure prediction, in preparation 2. Wei Zheng, Yang Li, Chengxin Zhang, Xiaogen Zhou, Robin Pearce, Eric W. Bell, Xiaoqiang Huang, Yang Zhang. Protein structure prediction using deep learning distance and hydrogen-bonding restraints in CASP14. Proteins. 2021; 1- 18. 3. Wei Zheng, Chengxin Zhang, Yang Li, Robin Pearce, Eric W. Bell, Yang Zhang. Folding non-homology proteins by coupling deep-learning contact maps with I-TASSER assembly simulations. Cell Reports Methods, 1: 100014 (2021). 4. Wei Zheng, Yang Li, Chengxin Zhang, Robin Pearce, S. M. Mortuza, Yang Zhang. Deep-learning contact-map guided protein structure prediction in CASP13. Proteins: Structure, Function, and Bioinformatics, 87: 1149-1164 (2019). 5. Y Zhang. I-TASSER server for protein 3D structure prediction. BMC Bioinformatics, 9: 40 (2008). 6. A Roy, A Kucukural, Y Zhang. I-TASSER: a unified platform for automated protein structure and function prediction. Nature Protocols, 5: 725-738 (2010). 7. J Yang, R Yan, A Roy, D Xu, J Poisson, Y Zhang. The I-TASSER Suite: Protein structure and function prediction. Nature Methods, 12: 7-8 (2015) ############################################################################### # # # 5. Installation and implementation of contact/distance/HB predictors # # # ############################################################################### 5.1. Introduction of DeepPotential, AttentionPotential, ResTriplet, TripletRes, ResPRE, ResPLM and DeepPLM DeepPotential is deep-learning based contact/distance/hydrogen bond predictor using three co-evolutionary features: the covariance matrix (COV) proposed by DeepCov; the precision matrix (PRE) formulated by ResPRE; and the coupling parameters of the inverse Potts model obtained through pseudolikelihood maximization (PLM). AttentionPotential is an improved model that can predict various inter-residue geometry potentials. In AttentionPotential model, the coevolutionary information is directly extracted using the attention mechanism that can model the interactions between residues, instead of the precomputed evolutionary coefficients in DeepPotential. TripletRes and ResTriplet are deep-learning based contact predictors using three co-evolutionary features: the covariance matrix (COV) proposed by DeepCov; the precision matrix (PRE) formulated by ResPRE; and the coupling parameters of the inverse Potts model obtained through pseudolikelihood maximization (PLM). ResPRE is our in-house contact-map predictor, which consists of two consecutive steps of precision matrix-based feature generation and deep residual neural network-based contact inference. ResPLM is also an in-house contact-map predictor similar to ResPRE. The only difference is that ResPLM was trained using the PLM feature. DeepPLM is our in-house contact-map prediction approach that has the same deep-learning architecture as ResPRE, except it uses different features that are generated by CCMpred. 5.2. How to install those programs? When you unpack the D-I-TASSER Suite, AttentionPotential, DeepPotential, ResTriplet, TripletRes, ResPRE, ResPLM and DeepPLM programs are already installed. 5.3. How to cite contact? If you are using the TripletRes program, you can cite: Yang Li, Chengxin Zhang, Eric W Bell, Wei Zheng, Dongjun Yu, Yang Zhang. Deducing high-accuracy protein contact-maps from a triplet of coevolutionary matrices through deep residual convolutional networks. submitted, 2020. If you are using the ResTriplet program, you can cite: Yang Li, Chengxin Zhang, Eric W. Bell, Dongjun Yu, Yang Zhang. Ensembling multiple raw coevolutionary features with deep residual neural networks for contact-map prediction in CASP13. Proteins: Structure, Function, and Bioinformatics, 87: 1082-1091 (2019). If you are using the ResPre program, you can cite: Yang Li, Jun Hu, Chengxin Zhang, Dong-Jun Yu, and Yang Zhang. ResPRE: high-accuracy protein contact prediction by coupling precision matrix with deep residual neural networks. Bioinformatics, 35: 4647-4655 (2019). If you are using the ResPLM and DeepPLM programs, you can cite: Wei Zheng, Yang Li, Chengxin Zhang, Robin Pearce, S. M. Mortuza, Yang Zhang. Deep-learning contact-map guided protein structure prediction in CASP13. Proteins: Structure, Function, and Bioinformatics, 87: 1149-1164 (2019). ####################################################################### # # # 6. Installation and implementation of DeepFold and PotentialFold # # # ####################################################################### 6.1. Introduction of DeepFold and PotentialFold DeepFold is a deep-learning based method for ab initio protein structure prediction. Starting from a query sequence, it first collects multiple sequence alignments (MSAs) from whole- and meta-genome sequence libraries. Spatial restraints (contact/distance maps and inter-residue orientations) are then predicted by DeepPotential. Finally, full-length structural models are constructed using an L-BFGS folding algorithm. PotentialFold is a program for protein structure prediction based on protein inter-residue geometry prediction, which is similar with DeepFold. 6.2. How to install DeepFold and PotentialFold program? When you unpack the D-I-TASSER Suite, DeepFold and PotentialFold program is already installed. 6.3. How to cite DeepFold and PotentialFold? If you are using the DeepFold program, you can cite: Robin Pearce, Yang Li, Gilbert S. Omenn, Yang Zhang. Fast and Accurate Ab Initio Protein Structure Prediction Using Deep Learning Potentials. Submitted, 2021. If you are using the PotentialFold program, you can cite: Yang Li, Chengxin Zhang, Dong-Jun Yu, Yang Zhang. Deep learning geometrical potential for high-accuracy ab initio protein structure prediction. Submitted, 2021. ####################################################### # # # 7. Installation and implementation of MUSTER # # # ####################################################### 7.1. Introduction of MUSTER MUSTER (MUlti-Sources ThreadER) is a protein threading algorithm to identify the template structures from the PDB library. It generates sequence-template alignments by combining sequence profile-profile alignment with multiple structural information. 7.2. How to install MUSTER program? When you unpack the D-I-TASSER Suite, MUSTER program is already installed. 7.3. How to cite MUSTER? If you are using the MUSTER program, you can cite: S Wu, Y Zhang. MUSTER: Improving protein sequence profile-profile alignments by using multiple sources of structure information. Proteins, 72: 547-556 (2008). ####################################################### # # # 8. Installation and implementation of CEthreader # # # ####################################################### 8.1. Introduction of CEthreader CEthreader is a novel threading algorithm, which first predicts residue-residue contacts by coupling evolutionary precision matrices with deep residual convolutional neural-networks. The predicted contact maps are then integrated with sequence profile alignments to recognize structural templates from the PDB. 8.2. How to install CEthreader program? When you unpack the D-I-TASSER Suite, CEthreader program is already installed. 8.3. How to cite CEthreader? If you are using the CEthreader program, you can cite: W Zheng, Q Wuyun, Y Li, SM Mortuza, C Zhang, R Pearce, J Ruan, Y Zhang. Detecting distant-homology protein structures by aligning deep neural-network based contact maps. PLOS Computational Biology, 15: e1007411 (2019). ####################################################### # # # 9. Installation and implementation of LOMETS3 # # # ####################################################### 9.1. Introduction of LOMETS3 LOMETS3 (Local Meta-Threading-Server) is meta-server approach to protein fold-recognition. It consists of 15 individual threading programs: DeepFold2 (DeepFold+AttentionPotential), PotentialFold2 (PotentialFold+AttentionPotential), DeepFold (DeepFold+DeepPotenntial), PotentialFold (PotentialFold+DeepPotential), CEthreader, mCEthreader, eCEthreader, MUSTER, PPA, dPPA, dPPA2, sPPA, wPPA, wdPPA, wMUSTER. The mCEthreader and eCEthreader are variances of CEthreader which includes different scoring functions. The last 7 programs are variances of MUSTER which includes different optimized energy terms. 9.2. How to install LOMETS3 program? When you unpack the D-I-TASSER Suite, LOMETS3 programs are already installed. 9.3. How to run LOMETS3 program? The LOMETS3 main script is $pkgdir/I-TASSERmod/runLOMETS.pl. The running option of this program is similar to that in 'runI-TASSER.pl'. By running the program without argument, you can print all the running options. 9.4. How to cite LOMETS3? If you are using the LOMETS3 program, you can cite: Wei Zheng, Chengxin Zhang, Qiqige Wuyun, Robin Pearce, Yang Li, Yang Zhang. LOMETS2: improved meta-threading server for fold-recognition and structure-based function annotation for distant-homology proteins. Nucleic Acids Research, 47: W429-W436 (2019) S Wu, Y Zhang. LOMETS: A local meta-threading-server for protein structure prediction. Nucleic Acids Research, 35: 3375-3382 (2007). ####################################################### # # # 10. Installation and implementation of DeepMSA # # # ####################################################### 10.1. Introduction of DeepMSA DeepMSA is a new open-source method for sensitive MSA construction, which has homolo- gous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. 10.2. How to install DeepMSA program? When you unpack the D-I-TASSER Suite, DeepMSA program is already installed. 10.3. How to run DeepMSA program? The DeepMSA main script is $pkgdir/contact/DeepMSA/scripts/build_MSA.py. The running option of this program is similar to that in runI-TASSER.pl. By running the program without argument, you can print all the running options. 10.4. How to cite DeepMSA? If you are using the DeepMSA program, you can cite: C Zhang, W Zheng, S M Mortuza, Y Li, Y Zhang. DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins. Bioinformatics 36: 2105-2112 (2020). ####################################################### # # # 11. Installation and implementation of SPICKER # # # ####################################################### 11.1. Introduction of SPICKER SPICKER is a clustering algorithm to identify the near-native models from a pool of protein structure decoys. 11.2. How to install SPICKER program? When you unpack the D-I-TASSER Suite, SPICKER program is already installed at $pkgdir/I-TASSERmod/spicker50 11.3. How to run SPICKER program? To run SPICKER, you need to prepare following input files: 'rmsinp'---Mandatory, length of protein & piece for RMSD calculation; 'seq.dat'--Mandatory, sequence file, for output of PDB models. 'tra.in'---Mandatory, list of trajectory names used for clustering. In the first line of 'tra.in', there are 3 parameters: par1: number of decoy files par2: 1, default cutoff, best for decoys from template-based modeling; -1, cutoff based on variation, best for decoys from ab initio modeling. par3: 1, closc from all decoys; -1, closc clustered decoys From second lines are file names which contain coordinates of 3D structure decoys. All these files are mandatory. See attached 'rep1.tra1' for the format of decoys. 'CA'-------Optional, native structure, for comparison to native. Output files of SPICKER include: 'str.txt'-----list of structure in cluster; 'combo*.pdb'--PDB format of cluster centroids; 'closc*.pdb'--PDB format of structures closest to centroids; 'rst.dat'-----summary of clustering results; A detailed readme file can be found at http://zhanglab.dcmb.med.umich.edu/SPICKER/readme 11.4. How to cite SPICKER? If you are using the SPICKER program, you can cite: Y Zhang, J Skolnick, SPICKER: Approach to clustering protein structures for near-native model selection, Journal of Computational Chemistry, 25: 865-871 (2004). ####################################################### # # # 12. Installation and implementation of HAAD # # # ####################################################### 12.1. Introduction of HAAD HAAD is a computer algorithm for constructing hydrogen atoms from protein heavy-atom structures. The hydrogen is added by minimizing atomic overlap and encouraging hydrogen bonding. 12.2. How to install HAAD program? When you unpack the D-I-TASSER Suite, HAAD program is already installed at $pkgdir/abs/mybin/HAAD 12.3. How to run HAAD program? Hydrogen atoms in a PDB file(xx.pdb) can be added by running "./HAAD xx.pdb", the output is "xx.pdb.h". In "xx.pdb.h", the label in column 57 presents the label for the atoms that have been added by HAAD. When the value of the label is less than 2, the position of the added atom has higher confidence. 12.4. How to cite HAAD? If you are using the HAAD program, you can cite: Y Li, A Roy, Y Zhang, HAAD: A Quick Algorithm for Accurate Prediction of Hydrogen Atoms in Protein Structures, PLoS One, 4: e6701 (2009). ####################################################### # # # 13. Installation and implementation of EDTSurf # # # ####################################################### 13.1. Introduction of EDTSurf EDTSurf is a program to construct triangulated surfaces for macromolecules. It generates three major macromolecular surfaces: van der Waals surface, solvent-accessible surface and molecular surface (solvent-excluded surface). EDTsurf also identifies cavities which are inside of macromolecules. 13.2. How to install EDTSurf program? When you unpack the D-I-TASSER Suite, EDTSurf program is already installed at $pkgdir/bin/EDTSurf 13.3. How to use EDTSurf program? EDTSurf -i inputfile ... Specific options: -o prefix of output files (default is the prefix of inputfile) -t triangulation type, 1-MC 2-VCMC (default is 2) -s surface type, 1-VWS 2-SAS 3-MS (default is 3) -c color mode, 1-pure 2-atom 3-chain (default is 2) -p probe radius, float point in [0,2.0] (default is 1.4) -h inner or outer surface for output, 1-inner and outer 2-outer 3-inner (default is 1) -f scale factor, float point in (0,20.0] (default is 4.0) Molecule is scaled by this factor to fit in a bounding box. Scale factor is the larger the better, but will increase the memory use. Our strategy is first enlarging the molecule to check if it exceeds the maximum bounding box. If yes, then reset a proper scale factor to fit the molecule in the maximum bounding box. By running EDTSurf itself, it will print out a brief description on how to use the program. A detail description of EDTSurf is available at http://zhanglab.dcmb.med.umich.edu/EDTSurf/ 13.4. How to cite EDTSurf? If you are using the EDTSurf program, you can cite: D Xu, Y Zhang, Generating Triangulated Macromolecular Surfaces by Euclidean Distance Transform. PLoS ONE 4: e8140 (2009). ####################################################### # # # 14. Installation and implementation of ModRefiner # # # ####################################################### 14.1. Introduction of ModRefiner ModRefiner is a standalone program for atomic-level protein structure construction and refinement. It includes two steps: (1) construct main-chain models from C-alpha trace; (2) build side-chain models and atomic-level structure refinement. 14.2. How to install ModRefiner program? When you unpack the D-I-TASSER Suite, ModRefiner program is already installed at $pkgdir/I-TASSERmod/ModRefiner.pl 14.3. How to use ModRefiner program? ModRefiner supports following four options: a) add side-chain heavy atoms to main-chain model without refinement > ModRefiner.pl 1 ID MD IM ON b) build main-chain model from C-alpha trace model > ModRefiner.pl 2 ID MD IM RM ON c) build full-atomic model from main-chain model > ModRefiner.pl 3 ID MD IM RM ON d) build full-atomic model from C-alpha trace model > ModRefiner.pl 4 ID MD IM RM ON ID: the path of the D-I-TASSER package, e.g. '/home/yourname/D-I-TASSER-1.0' MD: directory which contains the initial model, e.g. '/home/yourname/D-I-TASSER/5.0/example' IM: the initial model to be refined, e.g. 'mode1.pdb' RM: reference model that refined model is driven to, e.g. 'combo1.pdb'. Only CA trace is needed and the length can be not full which will make the refinement of the missing region flexible. If you don't have the reference model, use the name of IM instead. ON: the output name of the refined model, e.g. 'model1_ref.pdb' By running the program without argument, you can print a brief description of how to use the program. 14.4. How to cite ModRefiner? If you are using the ModRefiner program, you can cite: D Xu, Y Zhang. Improving the Physical Realism and Structural Accuracy of Protein Models by a Two-step Atomic-level Energy Minimization. Biophysical Journal, 101: 2525-2534 (2011) ####################################################### # # # 15. Installation and implementation of NWalign # # # ####################################################### 15.1. Introduction of NWalign NW-align is simple and robust alignment program for protein sequence-to-sequence alignments based on the standard Needleman-Wunsch dynamic programming algorithm. The mutation matrix is from BLOSUM62 with gap opening penalty=-11 and gap extension penalty=-1. 15.2. How to install NWalign program? When you unpack the D-I-TASSER Suite, NWalign program is already installed at $pkgdir/bin/align. 15.3. How to use NWalign program? > align F1.fasta F2.fasta (align two sequences in fasta file) > align F1.pdb F2.pdb 1 (align two sequences in PDB file) > align F1.fasta F2.pdb 2 (align Sequence 1 in fasta and 2 in pdb) > align GKDGL EVADELVSE 3 (align sequences typed by keyboard) > align GKDGL F.fasta 4 (align Seq-1 by keyboard and 2 in fasta) > align GKDGL F.pdb 5 (align Seq-1 by keyboard and 2 in pdb) By running the program itself, it will print out the usage options of the program. 15.4. How to cite NWalign? There is no published paper associated with this program. If you are using the NWalign program, you can cite it as Y Zhang, http://zhanglab.dcmb.med.umich.edu/NW-align ####################################################### # # # 16. Installation and implementation of PSSpred # # # ####################################################### 16.1 Introduction of PSSpred PSSpred (Protein Secondary Structure PREDiction) is a simple neural network training algorithm for accurate protein secondary structure prediction. It first collects multiple sequence alignments using PSI-BLAST. Amino-acid frequency and log-odds data with Henikoff weights are then used to train secondary structure, separately, based on the Rumelhart error back propagation method. The final secondary structure prediction result is a combination of 7 neural network predictors from different profile data and parameters. 16.2 How to install PSSpred program? When you unpack the D-I-TASSER Suite, NWalign program is already installed at $pkgdir/PSSpred 16.3 How to use PSSpred program? $pkgdir/PSSpred/mPSSpred.pl seq.txt $pkgdir $libdir Please note that 'seq.txt' should be in current directory and the script will generate two files 'seq.dat' and 'seq.dat.ss' in the current folder. Here, $pkgdir is the root path of D-I-TASSER package. 16.4 How to cite PSSpred? If you are using the PSSpred program, you can cite: http://zhanglab.dcmb.med.umich.edu/PSSpred ####################################################### # # # 17. Installation and implementation of COFACTOR # # # ####################################################### 17.1 Introduction of COFACTOR COFACTOR is a structure-based method for biological function annotation of protein molecules. COFACTOR threads the structure through three comprehensive function libraries by local and global structure matches to identify functional sites and homology. Functional insights, including ligand-binding site, gene-ontology terms and enzyme classification, will be derived from the best functional homology template. The COFACTOR algorithm was ranked as the best method for function prediction in the community-wide CASP9 experiments. 17.2 How to install COFACTOR program? When you unpack the D-I-TASSER Suite, COFACTOR program is already installed at $pkgdir/COFACTOR 17.3 How to use COFACTOR program? $pkgdir/I-TASSERmod/runCOFACTOR.pl 17.4 How to interpret the results If your input data is at $datadir/model1.pdb, the output of COFACTOR will be at $datadir/model1/cofactor: (1)List of similar structures in PDB: similarpdb_model1.lst. The columns are (PDB_ID, TM-score, RMSD, Cov, Seq_id) (2)Ligand-binding sites: BSITE_model1/Bsites_model1.dat. The columns are (Rank, C-score, PDB_ID, TM-score, RMSD, Seq_id, Cov, Lig_name, SITE_num, BS-score, LTM, BS_ID, BS_cov,BS_err, BS_ID1,BS_ID2, Binding residues) (3)EC number: ECsearchresult_model1.dat The columns are (PDB_ID, TM-score, RMSD, Seq_ID, Cov, EC-score, EC number, Active site residues) (4)GO terms: GOsearchresult_model1.dat. The columns are (PDB_ID, TM-score, RMSD, Seq_ID, Cov, GO-score, GO terms) 17.5 How to cite COFACTOR? If you are using the COFACTOR program, you can cite: 1. A Roy, J Yang, Y Zhang. COFACTOR: An accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Research, 40:W471-W477 (2012). 2. J Yang, A Roy, Y Zhang. BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Research, 41: D1096-D1103 (2013). ####################################################### # # # 18. Installation and implementation of COACH # # # ####################################################### 18.1 Introduction of COACH COACH is a meta-server approach to protein function annotations. Starting from given structure of target proteins, COACH will generate complementary ligand binding site predictions using two comparative methods: TM-SITE and S-SITE, which recognize ligand-binding templates from the BioLiP protein function database by binding-specific substructure and sequence profile comparisons. These predictions will be combined with results from COFACTOR to generate multiple function annotations, including ligand-binding sites, enzyme commission and gene ontology terms. 18.2 How to install COACH program? When you unpack the D-I-TASSER Suite, COACH program is already installed at $pkgdir/COACH 18.3 How to use COACH program? $pkgdir/I-TASSERmod/runCOACH.pl 18.4 How to interpret the results If your input data is at $datadir/model1.pdb, the output of COACH will be at $datadir/model1/coach: (1) Ligand-binding sites: Bsites.dat. The columns are (C-score, cluster_densitiy, product_of_top_templates_zscore, Binding residues) (2) Detailed clustering information: Bsites.inf, Bsites.clr, which list the templates used in the cluster that generates the prediction in (1). (3) Ligand-protein complex structures are with name: CH_complex*.pdb (4) Predicions from COFACTOR, TM-SITE, and S-SITE are at, respectively: $datadir/model1/cofactor $datadir/model1/tmsite $datadir/ssite 18.5 How to cite COACH? If you are using the COACH program, you can cite: 1. J Yang, A Roy, Y Zhang. Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics, 29:2588-2595 (2013). 2. J Yang, A Roy, Y Zhang. BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Research, 41: D1096-D1103 (2013). ####################################################### # # # 19. Installation and implementation of TM-SITE # # # ####################################################### 19.1 Introduction of TM-SITE TM-SITE is a structure-based approach to protein-ligand binding site prediction. Structure alignment between query and BioLiP templates is performed on binding-specific substructure using TM-align. The final ligand-binding sites are collected based on the clustering of multiple templates. 19.2 How to install TM-SITE program? When you unpack the D-I-TASSER Suite, TM-SITE program is already installed at $pkgdir/COACH 19.3 How to interpret the results If your input data is at $datadir/model1.pdb, the output of TM-SITE will be at $datadir/model1/tmsite: (1)Ligand-binding sites: Bsites.dat. The columns are (C-score, top_templates_zscore, JSD_score, cluster_density, Binding residues) (2)Detailed clustering information: Bsites.inf, Bsites_lig.clr, which lists the templates used in the cluster that generates the prediction in (1). (3)Ligand-protein complex structures are with name: complex*.pdb 19.4 How to cite TM-SITE? If you are using the TM-SITE program, you can cite: 1. J Yang, A Roy, Y Zhang. Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics, 29:2588-2595 (2013). 2. J Yang, A Roy, Y Zhang. BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Research, 41: D1096-D1103 (2013). ####################################################### # # # 20. Installation and implementation of S-SITE # # # ####################################################### 20.1 Introduction of S-SITE S-SITE is a sequence-based approach to protein-ligand binding site prediction. Binding-specific sequence profile-profile alignment is used to recognize homologous templates in BioLiP. The ligand-binding sites predictions are collected from the clustering of multiple homologous templates. 20.2 How to install S-SITE program? When you unpack the D-I-TASSER Suite, S-SITE program is already installed at $pkgdir/COACH 20.3 How to interpret the results If your input data is at $datadir/seq.fasta, then the output of S-SITE will be at $datadir/ssite: (1)Ligand-binding sites: Bsites_fpt.dat. The columns are (C-score, top_templates_zscore, cluster_density, cluster_density1, JSD_score, Binding residues) (2)Detailed clustering information: Bsites_fpt.clr, which list the templates used in the cluster that generates the prediction in (1). 20.4 How to cite S-SITE? If you are using the S-SITE program, you can cite: 1. J Yang, A Roy, Y Zhang. Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics, 29:2588-2595 (2013). 2. J Yang, A Roy, Y Zhang. BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Research, 41: D1096-D1103 (2013). ####################################################### # # # 21. Installation and implementation of ResQ # # # ####################################################### 21.1 Introduction of ResQ ResQ is a method for estimating B-factor and residue-level quality in protein structure prediction, based on local variations of modelling simulations and the uncertainty of homologous alignments. Given a protein structure model, ResQ first identifies a set of homologous and/or analogous templates from the PDB by threading and structure alignment techniques. The residue-level modeling errors are then derived by support vector regression that was trained on the local structural and alignment variations of the templates, with the B-factor of each residue deduced from the experimental records of the top homologous proteins. 21.2 How to install ResQ program? When you unpack the D-I-TASSER Suite, ResQ program is already installed at $pkgdir/ResQ. 21.3 How to use ResQ program? There are two methods to run ResQ depending on how your models were generated. 1) If your models were generated by D-I-TASSER, you can run the script of $pkgdir/ResQ/runResQ_IT.pl to predict B-factor and local structure errors. The only argument required is the directory of the D-I-TASSER decoys. You can read more at the head of this script to get more information about its input. 2) If your models were not generated by D-I-TASSER, you can run the script $pkgdir/ResQ/runResQ.pl to predict B-factor and local structure errors. It will automatically run LOMETS2 to generate the threading alignment file 'init.dat'. LOMETS2 is included in this package. 21.4 What is the output of ResQ? For D-I-TASSER models, the output of ResQ is: rsq_bfp_new.dat For other models, the output of ResQ is: 1) global.txt for global accuracy estiamtion 2) local.txt for local error and B-factor estimation 21.4 How to cite ResQ? If you are using the ResQ program, you can cite: 1. J Yang, Y Wang, Y Zhang. ResQ: Approach to unified estimation of B-factor and residue-specific error in protein structure prediction, Journal of Molecular Biology, 428: 693-701 (2016).