Retrospective Docking ===================== The retrospective docking process is crucial for validating and refining docking setups before embarking on large-scale docking campaigns. We start with a set of known active molecules that bind our protein. We generate a set of "property-matched decoys" that have similar properties to the actives but are predicted to be inactive. We then optimize our docking setup to score the actives better than the decoys. The general optimizations that we do are: * Electrostatic and desolvation thin sphere scan * Matching sphere perturbations * Tarting (modifying atom partial charges) Typical Workflow ---------------- Compiling a List of Actives ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Try and compile a list of 20-50 of different molecules. We usually shoot for Tanimoto Similarity <0.35. Some places to look for known binders: * Chembl - check the pKi and only take those in the 7+ range * https://www.guidetopharmacology.org * PDB * Literature Generating Property-Matched Decoys ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (AFAIK these instructions will only work on normal gimel - working on that. This all really needs to be condensed into a nice github) Before running this, make sure you are in a ``sh`` shell and source the following: .. code-block:: bash source /nfs/soft/dock/versions/dock37/DOCK-3.7-trunk/env.sh This procedure generates decoys for your actives by finding SMILES in ZINC20 that are property-matched. While not suggested, if you want to specifically look for already built molecules, please see `the wiki page `_. Steps to generate the property-matched decoys: 1. **Setting up directories** Before starting, you need a SMILES file with the format (SMILES, ID) and no header: .. code-block:: bash S(Nc1c(O)cc(C(=O)O)cc1)(c2c(scc2)C(=O)O)(=O)=O active_1 You also need an input file named ``decoy_generation.in`` with the following lines: .. code-block:: bash SMILES YES PROTONATE YES MWT 0 125 LOGP 0 3.6 RB 0 5 HBA 0 4 HBD 0 3 CHARGE 0 2 LIGAND TC RANGE 0.0 0.35 MINIMUM DECOYS PER LIGAND 20 DECOYS PER LIGAND 50 MAXIMUM TC BETWEEN DECOYS 0.8 TANIMOTO YES GENERATE DECOYS 750 If your input ligand SMILES are already protonated, set ``PROTONATE NO``. ``SMILES YES`` incidactes that you want to query ZINC20 for SMILES, not already built molecules from ZINC15. The ``decoy_generation.in`` file specifies the maximum property windows allowable for decoys, but the procedure attempts to find and assign decoys with the closest properties to the actives. The example here means that decoys will be found that are .. code-block:: bash +/- 125 Daltons +/- 3.6 logP +/- 5 rotatable bonds +/- 4 hydrogen bond acceptors +/- 3 hydrogen bond donors +/- 2 charge +/- 0.35 or less Tanimoto The code will attempt to find ``GENERATE DECOYS`` decoys for each active and then will perform an assignment procedure to attempt to arrive at between ``MINIMUM DECOYS PER LIGAND`` and ``DECOYS PER LIGAND`` for each active. ``MAXIMUM TC BETWEEN DECOYS`` refers to the maximum Tanimoto Coefficient allowed between decoys (the higher, the more similar your decoys can be to each other). Once you have created the ``decoy_generation.in``, run the following command to create the decoy generation directory: .. code-block:: bash python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/0000_protonate_setup_dirs.py {SMILES_FILE} {NEW_DIR_NAME} Provide a directory name that you want in place of ``{NEW_DIR_NAME}``. This will create the directory with subdirectories named ``"ligand_${number}"`` for each of the active in the ``{SMILES_FILE}`` you input. 2. **Retrieving SMILES decoys from ZINC20** Run the following command to retrieve decoy SMILES for each active: .. code-block:: bash python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/0001_qsub_generate_decoys.py {NEW_DIR_NAME} Note that this procedure may take some time (potentially a few hours). *Known Bug:* Occassionally, this script may hang and be running for many hours. I'm not sure why this happens, but I typically just move onto the next step. Generally you will be able to assign decoys retrieved from other actives for actives where this procedure fails. 3. **Assigning decoys to actives** Once we retrieve decoys, we assign them to each active. *Note:* There is no scheduler submission script for the following code yet, and it may take a little bit to run. I recommend running it in a screen/tmux session so that you don't disconnect while it is running. .. code-block:: bash cd {NEW_DIR_NAME} source /nfs/home/bwhall61/.python_envs/pulp/bin/activate python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/filter_decoys.py If this step has completed successfully, you should see files in ``NEW_DIR_NAME`` with the format ``{LIGAND_ID}_final_property_matched_decoys.txt``. These files have the format: .. code-block:: none SMILES ZINC_ID logP #Rotatable_Bonds #HBond_Donors #HBond_Acceptors Charge Protomer_ID TC_TO_LIG. There should also be files with the format ``{LIGAND_ID}_replacements.txt`` which include extra property-matched decoys that were assigned to that active. If you don't get enough decoys, the ``decoy_generation.in`` file can be modified to change ``MAXIMUM TC BETWEEN DECOYS``, ``MINIMUM DECOYS PER LIGAND``, etc. To avoid rerunning some of the time consuming calculations change ``TANIMOTO YES`` to ``TANIMOTO NO`` before rerunning this step. 4. **Building the decoys + actives** *Note*: This must run on epyc or a slurm node (working on it... - docker on gimel/SGE nodes is too old) If this is your first time building molecules with the new pipeline, you must first ask the Irwin group to add you to the `docker` group on the slurm nodes or building will not work. First connect to epyc, go back to your decoy directory, and resource the enviornment: .. code-block:: bash ssh epyc cd {DIR_YOU_WERE_IN} source /mnt/nfs/soft/dock/versions/dock385/env.sh Once decoys have been assigned, the decoys and actives need to be built for docking. First run: .. code-block:: bash python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/0003b_write_out_ligands_decoys.py {NEW_DIR_NAME} {COPY_TO_DIR} This will create ``COPY_TO_DIR`` with the SMILES files: .. code-block:: bash ligands.smi - this includes the input actives which have property matched decoys decoys.smi - this includes the canonicalized property-matched decoy SMILES decoy_protomers.smi - this includes the actual property-matched decoy protomer SMILES To build the decoys and actives run (usually we use a pH of 7.4): .. code-block:: bash bash /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/qsub_build_ligands.sh {COPY_TO_DIR} If you need more decoys, you can take more from the ``{LIGAND_ID}_replacements.txt`` files. 5. **Packaging decoys and actives for dockopt** Once your actives and decoys are done building, they need to packaged into .tgz files for docking. Run: .. code-block:: bash python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/format_db2_for_dockopt.py {COPY_TO_DIR} The output will be two files inside ``{COPY_TO_DIR}`` : ``actives.tgz`` and ``decoys.tgz`` which will be used for retrospective docking. 6. **Visualizing decoy properties** *Note:* This needs a little bit of updating. Some of the plots treat discrete variables (HDB for ex) as continuous and make it difficult to compare the distributions. To visualize the distributions of molecular properties: .. code-block:: bash python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0004_plot_properties.py {NEW_DIR_NAME} To visualize the Tanimoto similarity between the decoys and the input ligands run: .. code-block:: bash python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0005_plot_tanimoto_to_lig.py {NEW_DIR_NAME} DockOpt ~~~~~~~ DockOpt generates many different docking configurations which are evaluated in parallel. The goal is to enhance the ability to score the actives better than the property-matched decoys. By default, DockOpt is used to try different combinations of matching spheres, desolvation thin spheres, and electrostatic thin spheres. For more info visit `the wiki page `_ 1. **Source environments:**: **Wynton** (Wynton environment needs to be updated with Brendan's pydock updates) .. code-block:: bash source /wynton/group/bks/soft/python_envs/env.sh **Gimel** (Any node other than ``gimel`` itself) .. code-block:: bash source /mnt/nfs/soft/dock/versions/dock385/env.sh *Note*: I've only ever tried this on gimel so if someone runs into issues on wynton please let me know. 2. **Create DockOpt Job** Ensure you are in the directory containing the required input files: - ``rec.pdb`` or ``rec.crg.pdb`` - ``xtal-lig.pdb`` - ``actives.tgz`` - ``decoys.tgz`` The ``actives.tgz`` and ``decoys.tgz`` are each tarballs which contain the DB2 files for your actives and decoys, respectively. So it should look something like: .. code-block:: bash - actives.tgz - lig1.db2 - lig2.db2 - etc Note that the ligands can be ``.db2`` or ``.db2.gz``. To create a new dockopt job run: .. code-block:: bash pydock3 dockopt - new The default job directory is named ``dockopt_job``. To change the name, use ``--job_dir_name``: .. code-block:: bash pydock3 dockopt - new --job_dir_path=my_job_name The job directory contains two sub-directories: 1. ``working``: input files, intermediate blaster files, sub-directories for individual blastermaster subroutines 2. ``retrodock_jobs``: individual retrodock jobs for each docking configuration If your current directory has any of the following files, they will be copied and used. Otherwise default versions / generated versions will be made: - ``rec.pdb`` - ``rec.crg.pdb`` - ``xtal-lig.pdb`` - ``reduce_wwPDB_het_dict.txt`` - ``filt.params`` - ``radii`` - ``amb.crg.oxt`` - ``vdw.siz`` - ``delphi.def`` - ``vdw.parms.amb.mindock`` - ``prot.table.ambcrg.ambH`` 3. **Configure the DockOpt Job** The ``dockopt_config.yaml`` file specifies the search values that dockopt will use. We typically are interested in trying different combinations of matching spheres, electrostatic thin spheres, and desolvation thin spheres. Larger ligand desolvation thin sphere radius increases the desolvation penalty at a point in space. Larger electrostatic thin sphere radius increase the low dielectric boundary so the electrostatic interactions become stronger. The values to check are specified in the ``dockopt_config.yaml`` in lines like so: .. code-block:: bash thin_spheres_elec: use: true molecular_surface_density: 10.0 distance_to_surface: - 1.0 - 1.1 - 1.2 - 1.3 - 1.4 - 1.5 - 1.6 - 1.7 - 1.8 - 1.9 thin_spheres_desolv: use: true molecular_surface_density: 1.0 distance_to_surface: - 0.1 - 0.2 - 0.3 - 0.4 - 0.5 - 0.6 - 0.7 - 0.8 - 0.9 - 1.0 matching_spheres_perturbation: use: true num_samples_per_matching_spheres_file: 20 max_deviation_angstroms: 0.5 perturb_xtal_spheres: false This configuration tries the 10 different listed distances for the electrostatic thin spheres and the desolvation thin spheres and also tries 20 sets of matching spheres. In total, this configuration will produce and test 2000 different docking setups. **TARTING** Tarting is yet to be implemented in dockopt but we are working on making it easy 4. **Set Environment Variables** **Wynton** (dev node) .. code-block:: bash export TMPDIR=/scratch export QSTAT_EXEC=/opt/sge/bin/lx-amd64/qstat export QSUB_EXEC=/opt/sge/bin/lx-amd64/qsub export SGE_SETTINGS=/opt/sge/wynton/common/settings.sh **Gimel** (Any node other than ``gimel`` itself) .. code-block:: bash export TMPDIR=/scratch export SBATCH_EXEC=/usr/bin/sbatch export SQUEUE_EXEC=/usr/bin/squeue 5. **Running DockOpt** Once you have configured your ``dockopt_config.yaml`` you can run your dockopt job with: .. code-block:: bash cd pydock3 dockopt - run --retrodock_job_max_reattempts=5 where ```` is ``sge`` for wynton and ``slurm`` for gimel. Once the dockopt job is complete, the following files will be generated in the job directory: - *report.html*: contains (1) a histogram of the performance of all tested docking configurations compared against a distribution of the performance of a random classifier, so as to show whether the test docking configurations are significantly better than ones that can be produced by a random classifier. This is necessary due to the fact that many configurations are being tested. Hence, a Bonferroni correction is applied to the significance threshold, dividing p=0.01 by the number of tested configurations. (2) ROC, charge, and energy plots of the top docking configurations, comparing actives vs. decoys, (3) box plots of enrichment for every multi-valued config parameter, and (4) heatmaps of enrichment for every pair of multi-valued config parameters. - *results.csv*: parameter values, criterion values, and other information about each docking configuration. Within each sub-directory of *best_retrodock_jobs* there are: - *dockfiles/*: parameter files and *INDOCK* for given docking configuration - *output/* contains *actives* and *decoys* subdirectories which contain the OUTDOCK files for the actives and decoys - plot images Assuming you are happy with the results, you *could* simply proceed to using the dockfiles for a large-scale docking campaign. However, we suggest one more retrospective test which is to do an extrema set docking to check for charge preference. Extrema Set ~~~~~~~~~~~ The idea of an extrema set is that you provide a list of active molecules and in return get a list of molecules with -2, -1, 0, +1, +2 charge that are in the 25th-75th percentile of MW and logP of your actives. The idea is not to find property matched molecules, but instead to look at the charge preference of your docking setup. 1. **Source Environments** .. code-block:: bash source /mnt/nfs/soft/dock/versions/dock385/env.sh 2. **Get Extrema Molecules** .. code-block:: bash python /mnt/nfs/exk/work/bwhall61/extrema_improvement/smiles_0001_map_tranche_collect_db2_gz.py {SMILES_FILE} {NUM_MOLS} This will get {NUM_MOLS} molecules from each tranche for each charge. 3. **Build the extrema molecules** *Note*: This must run on epyc or a slurm node (working on it... - docker on gimel/SGE nodes is too old) If this is your first time building molecules with the new pipeline, you must first ask the Irwin group to add you to the `docker` group on the slurm nodes or building will not work. .. code-block:: bash bash /mnt/nfs/exk/work/bwhall61/extrema_improvement/qsub_build_ligands.sh 4. **DOCK the extrema molecules** You just need to provide the path to the dockfiles that you want to use .. code-block:: bash export DOCKFILES=[/absolute/path/to/dockfiles] bash /mnt/nfs/exk/work/bwhall61/extrema_improvement/qsub_dock_ligands.sh The output will be a folder named after your dockopt setup (ex. rank=1-step=1_step-conf=419) and will contain docking directories for each charge. This makes it easy to re-run this step with other dockopt setups and have them named differently and easily see their rank. 5. **Plot charge preferences** .. code-block:: bash python /mnt/nfs/exk/work/bwhall61/extrema_improvement/plot_charge_preference.py --input_folder [docking_folder] This will put `charge_preference.png` in the docking_folder that will show the score distributions for each of the charges. Advanced Uses ------------- Tarting ~~~~~~~ https://wiki.docking.org/index.php?title=DOCK_3.7_tart Tarting refers to changing the partial charges of specific atoms in the protein receptor to modify (enhance/decrease) ligand preferences for specific parts of the binding site. Generally one would redefine the partial charges distribution of a specific amino acid, and then use this modified residue when calculating the electrostatic potential for the receptor. * charged residues are not tarted usually, especially Asp. * backbone is better for tarting instead of hanging groups * usually 0.2 is enough. .. Manual modification of Matching Spheres .. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. (This is not for DOCK 3.8 - this won't make sense for right now) .. First convert generated spheres to .pdb format (directory “working”): .. ``csh $DOCKBASE/proteins/showsphere/doshowsph.csh matching_spheres.sph 1 spheres.pdb`` .. Open spheres in Chimera and move as needed (Tools – Movement – Movement .. Mouse Mode, with middle button). Save as pdb. Make a new directories .. (/spheres/ and /spheres/working/) and copy rec.pdb and xtal-lig.pdb to .. /spheres/. Copy spheres-new.pdb to /working/. Since .. useExistingMatchingSphflag option seems to turn off spheres module at .. all, you will need to copy additional files. .. .. code:: bash .. cd working .. cp ../../working/box . .. cp ../../working/lowdielectric.sph . .. $DOCKBASE/proteins/pdbtosph/bin/pdbtosph spheres-new.pdb matching_spheres.sph .. cd .. .. $DOCKBASE/proteins/blastermaster/blastermaster.py --addhOptions=" -HIS -FLIPs " -v --useExistingMatchingSphflag .. matching_spheres_v1.sph change to the new file name in INDOCK .. ``receptor_sphere_file ../dockfiles/matching_spheres.sph``