Retrospective Docking

The retrospective docking process is crucial for validating and refining docking setups before embarking on large-scale docking campaigns.

We start with a set of known active molecules that bind our protein. We generate a set of “property-matched decoys” that have similar properties to the actives but are predicted to be inactive.

We then optimize our docking setup to score the actives better than the decoys.

The general optimizations that we do are:

  • Electrostatic and desolvation thin sphere scan

  • Matching sphere perturbations

  • Tarting (modifying atom partial charges)

Typical Workflow

Compiling a List of Actives

Try and compile a list of 20-50 of different molecules. We usually shoot for Tanimoto Similarity <0.35.

Some places to look for known binders:

Generating Property-Matched Decoys

(AFAIK these instructions will only work on normal gimel - working on that. This all really needs to be condensed into a nice github)

Before running this, make sure you are in a sh shell and source the following:

source /nfs/soft/dock/versions/dock37/DOCK-3.7-trunk/env.sh

This procedure generates decoys for your actives by finding SMILES in ZINC20 that are property-matched.

While not suggested, if you want to specifically look for already built molecules, please see the wiki page.

Steps to generate the property-matched decoys:

  1. Setting up directories

    Before starting, you need a SMILES file with the format (SMILES, ID) and no header:

    S(Nc1c(O)cc(C(=O)O)cc1)(c2c(scc2)C(=O)O)(=O)=O active_1
    

    You also need an input file named decoy_generation.in with the following lines:

    SMILES YES
    PROTONATE YES
    MWT 0 125
    LOGP 0 3.6
    RB 0 5
    HBA 0 4
    HBD 0 3
    CHARGE 0 2
    LIGAND TC RANGE 0.0 0.35
    MINIMUM DECOYS PER LIGAND 20
    DECOYS PER LIGAND 50
    MAXIMUM TC BETWEEN DECOYS 0.8
    TANIMOTO YES
    GENERATE DECOYS 750
    

    If your input ligand SMILES are already protonated, set PROTONATE NO.

    SMILES YES incidactes that you want to query ZINC20 for SMILES, not already built molecules from ZINC15.

    The decoy_generation.in file specifies the maximum property windows allowable for decoys, but the procedure attempts to find and assign decoys with the closest properties to the actives.

    The example here means that decoys will be found that are

    +/- 125 Daltons
    +/- 3.6 logP
    +/- 5 rotatable bonds
    +/- 4 hydrogen bond acceptors
    +/- 3 hydrogen bond donors
    +/- 2 charge
    +/- 0.35 or less Tanimoto
    

    The code will attempt to find GENERATE DECOYS decoys for each active and then will perform an assignment procedure to attempt to arrive at between MINIMUM DECOYS PER LIGAND and DECOYS PER LIGAND for each active.

    MAXIMUM TC BETWEEN DECOYS refers to the maximum Tanimoto Coefficient allowed between decoys (the higher, the more similar your decoys can be to each other).

    Once you have created the decoy_generation.in, run the following command to create the decoy generation directory:

    python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/0000_protonate_setup_dirs.py {SMILES_FILE} {NEW_DIR_NAME}
    

    Provide a directory name that you want in place of {NEW_DIR_NAME}. This will create the directory with subdirectories named "ligand_${number}" for each of the active in the {SMILES_FILE} you input.

  2. Retrieving SMILES decoys from ZINC20

    Run the following command to retrieve decoy SMILES for each active:

    python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/0001_qsub_generate_decoys.py {NEW_DIR_NAME}
    

    Note that this procedure may take some time (potentially a few hours).

    Known Bug: Occassionally, this script may hang and be running for many hours. I’m not sure why this happens, but I typically just move onto the next step. Generally you will be able to assign decoys retrieved from other actives for actives where this procedure fails.

  3. Assigning decoys to actives

    Once we retrieve decoys, we assign them to each active.

    Note: There is no scheduler submission script for the following code yet, and it may take a little bit to run. I recommend running it in a screen/tmux session so that you don’t disconnect while it is running.

    cd {NEW_DIR_NAME}
    source /nfs/home/bwhall61/.python_envs/pulp/bin/activate
    python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/filter_decoys.py
    

    If this step has completed successfully, you should see files in NEW_DIR_NAME with the format {LIGAND_ID}_final_property_matched_decoys.txt. These files have the format:

    SMILES ZINC_ID logP #Rotatable_Bonds #HBond_Donors #HBond_Acceptors Charge Protomer_ID TC_TO_LIG.
    

    There should also be files with the format {LIGAND_ID}_replacements.txt which include extra property-matched decoys that were assigned to that active.

    If you don’t get enough decoys, the decoy_generation.in file can be modified to change MAXIMUM TC BETWEEN DECOYS, MINIMUM DECOYS PER LIGAND, etc.

    To avoid rerunning some of the time consuming calculations change TANIMOTO YES to TANIMOTO NO before rerunning this step.

  4. Building the decoys + actives

    Note: This must run on epyc or a slurm node (working on it… - docker on gimel/SGE nodes is too old)

    If this is your first time building molecules with the new pipeline, you must first ask the Irwin group to add you to the docker group on the slurm nodes or building will not work.

    First connect to epyc, go back to your decoy directory, and resource the enviornment:

    ssh epyc
    cd {DIR_YOU_WERE_IN}
    source /mnt/nfs/soft/dock/versions/dock385/env.sh
    

    Once decoys have been assigned, the decoys and actives need to be built for docking.

    First run:

    python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/0003b_write_out_ligands_decoys.py {NEW_DIR_NAME} {COPY_TO_DIR}
    

    This will create COPY_TO_DIR with the SMILES files:

    ligands.smi - this includes the input actives which have property matched decoys
    decoys.smi  - this includes the canonicalized property-matched decoy SMILES
    decoy_protomers.smi - this includes the actual property-matched decoy protomer SMILES
    

    To build the decoys and actives run (usually we use a pH of 7.4):

    bash /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/qsub_build_ligands.sh {COPY_TO_DIR}
    

    If you need more decoys, you can take more from the {LIGAND_ID}_replacements.txt files.

  5. Packaging decoys and actives for dockopt

    Once your actives and decoys are done building, they need to packaged into .tgz files for docking. Run:

    python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/format_db2_for_dockopt.py {COPY_TO_DIR}
    

    The output will be two files inside {COPY_TO_DIR} : actives.tgz and decoys.tgz which will be used for retrospective docking.

  6. Visualizing decoy properties

    Note: This needs a little bit of updating. Some of the plots treat discrete variables (HDB for ex) as continuous and make it difficult to compare the distributions.

    To visualize the distributions of molecular properties:

    python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0004_plot_properties.py {NEW_DIR_NAME}
    

    To visualize the Tanimoto similarity between the decoys and the input ligands run:

    python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0005_plot_tanimoto_to_lig.py {NEW_DIR_NAME}
    

DockOpt

DockOpt generates many different docking configurations which are evaluated in parallel.

The goal is to enhance the ability to score the actives better than the property-matched decoys.

By default, DockOpt is used to try different combinations of matching spheres, desolvation thin spheres, and electrostatic thin spheres.

For more info visit the wiki page

  1. Source environments::

    Wynton (Wynton environment needs to be updated with Brendan’s pydock updates)

    source /wynton/group/bks/soft/python_envs/env.sh
    

    Gimel (Any node other than gimel itself)

    source /mnt/nfs/soft/dock/versions/dock385/env.sh
    

    Note: I’ve only ever tried this on gimel so if someone runs into issues on wynton please let me know.

  2. Create DockOpt Job

    Ensure you are in the directory containing the required input files:

    • rec.pdb or rec.crg.pdb

    • xtal-lig.pdb

    • actives.tgz

    • decoys.tgz

    The actives.tgz and decoys.tgz are each tarballs which contain the DB2 files for your actives and decoys, respectively.

    So it should look something like:

    - actives.tgz
        - lig1.db2
        - lig2.db2
        - etc
    

    Note that the ligands can be .db2 or .db2.gz.

    To create a new dockopt job run:

    pydock3 dockopt - new
    

    The default job directory is named dockopt_job. To change the name, use --job_dir_name:

    pydock3 dockopt - new --job_dir_path=my_job_name
    
    The job directory contains two sub-directories:
    1. working: input files, intermediate blaster files, sub-directories for individual blastermaster subroutines

    2. retrodock_jobs: individual retrodock jobs for each docking configuration

    If your current directory has any of the following files, they will be copied and used. Otherwise default versions / generated versions will be made:
    • rec.pdb

    • rec.crg.pdb

    • xtal-lig.pdb

    • reduce_wwPDB_het_dict.txt

    • filt.params

    • radii

    • amb.crg.oxt

    • vdw.siz

    • delphi.def

    • vdw.parms.amb.mindock

    • prot.table.ambcrg.ambH

  3. Configure the DockOpt Job

    The dockopt_config.yaml file specifies the search values that dockopt will use. We typically are interested in trying different combinations of matching spheres, electrostatic thin spheres, and desolvation thin spheres. Larger ligand desolvation thin sphere radius increases the desolvation penalty at a point in space. Larger electrostatic thin sphere radius increase the low dielectric boundary so the electrostatic interactions become stronger. The values to check are specified in the dockopt_config.yaml in lines like so:

    thin_spheres_elec:
        use: true
        molecular_surface_density: 10.0
        distance_to_surface:
        - 1.0
        - 1.1
        - 1.2
        - 1.3
        - 1.4
        - 1.5
        - 1.6
        - 1.7
        - 1.8
        - 1.9
    
    thin_spheres_desolv:
        use: true
        molecular_surface_density: 1.0
        distance_to_surface:
        - 0.1
        - 0.2
        - 0.3
        - 0.4
        - 0.5
        - 0.6
        - 0.7
        - 0.8
        - 0.9
        - 1.0
    
    matching_spheres_perturbation:
        use: true
        num_samples_per_matching_spheres_file: 20
        max_deviation_angstroms: 0.5
        perturb_xtal_spheres: false
    

    This configuration tries the 10 different listed distances for the electrostatic thin spheres and the desolvation thin spheres and also tries 20 sets of matching spheres. In total, this configuration will produce and test 2000 different docking setups.

    TARTING

    Tarting is yet to be implemented in dockopt but we are working on making it easy

  4. Set Environment Variables

    Wynton (dev node)

    export TMPDIR=/scratch
    export QSTAT_EXEC=/opt/sge/bin/lx-amd64/qstat
    export QSUB_EXEC=/opt/sge/bin/lx-amd64/qsub
    export SGE_SETTINGS=/opt/sge/wynton/common/settings.sh
    

    Gimel (Any node other than gimel itself)

    export TMPDIR=/scratch
    export SBATCH_EXEC=/usr/bin/sbatch
    export SQUEUE_EXEC=/usr/bin/squeue
    
  5. Running DockOpt

    Once you have configured your dockopt_config.yaml you can run your dockopt job with:

    cd <JOB_DIR_NAME>
    pydock3 dockopt - run <JOB_SCHEDULER_NAME> --retrodock_job_max_reattempts=5
    

    where <JOB_SCHEDULER_NAME> is sge for wynton and slurm for gimel.

    Once the dockopt job is complete, the following files will be generated in the job directory:

    • report.html: contains (1) a histogram of the performance of all tested docking configurations compared against a distribution of the performance of a random classifier, so as to show whether the test docking configurations are significantly better than ones that can be produced by a random classifier. This is necessary due to the fact that many configurations are being tested. Hence, a Bonferroni correction is applied to the significance threshold, dividing p=0.01 by the number of tested configurations. (2) ROC, charge, and energy plots of the top docking configurations, comparing actives vs. decoys, (3) box plots of enrichment for every multi-valued config parameter, and (4) heatmaps of enrichment for every pair of multi-valued config parameters.

    • results.csv: parameter values, criterion values, and other information about each docking configuration.

    Within each sub-directory of best_retrodock_jobs there are:
    • dockfiles/: parameter files and INDOCK for given docking configuration

    • output/ contains actives and decoys subdirectories which contain the OUTDOCK files for the actives and decoys

    • plot images

Assuming you are happy with the results, you could simply proceed to using the dockfiles for a large-scale docking campaign.

However, we suggest one more retrospective test which is to do an extrema set docking to check for charge preference.

Extrema Set

The idea of an extrema set is that you provide a list of active molecules and in return get a list of molecules with -2, -1, 0, +1, +2 charge that are in the 25th-75th percentile of MW and logP of your actives. The idea is not to find property matched molecules, but instead to look at the charge preference of your docking setup.

  1. Source Environments

    source /mnt/nfs/soft/dock/versions/dock385/env.sh
    
  2. Get Extrema Molecules

    python /mnt/nfs/exk/work/bwhall61/extrema_improvement/smiles_0001_map_tranche_collect_db2_gz.py {SMILES_FILE} {NUM_MOLS}
    

    This will get {NUM_MOLS} molecules from each tranche for each charge.

  3. Build the extrema molecules

    Note: This must run on epyc or a slurm node (working on it… - docker on gimel/SGE nodes is too old)

    If this is your first time building molecules with the new pipeline, you must first ask the Irwin group to add you to the docker group on the slurm nodes or building will not work.

    bash /mnt/nfs/exk/work/bwhall61/extrema_improvement/qsub_build_ligands.sh
    
  4. DOCK the extrema molecules

    You just need to provide the path to the dockfiles that you want to use

    export DOCKFILES=[/absolute/path/to/dockfiles]
    bash /mnt/nfs/exk/work/bwhall61/extrema_improvement/qsub_dock_ligands.sh
    

    The output will be a folder named after your dockopt setup (ex. rank=1-step=1_step-conf=419) and will contain docking directories for each charge. This makes it easy to re-run this step with other dockopt setups and have them named differently and easily see their rank.

  5. Plot charge preferences

    python /mnt/nfs/exk/work/bwhall61/extrema_improvement/plot_charge_preference.py --input_folder [docking_folder]
    

    This will put charge_preference.png in the docking_folder that will show the score distributions for each of the charges.

Advanced Uses

Tarting

https://wiki.docking.org/index.php?title=DOCK_3.7_tart

Tarting refers to changing the partial charges of specific atoms in the protein receptor to modify (enhance/decrease) ligand preferences for specific parts of the binding site. Generally one would redefine the partial charges distribution of a specific amino acid, and then use this modified residue when calculating the electrostatic potential for the receptor.

  • charged residues are not tarted usually, especially Asp.

  • backbone is better for tarting instead of hanging groups

  • usually 0.2 is enough.