Retrospective Docking

The retrospective docking process is crucial for validating and refining docking setups before embarking on large-scale docking campaigns.

We start with a set of known active molecules that bind our protein. We generate a set of “property-matched decoys” that have similar properties to the actives but are predicted to be inactive.

We then optimize our docking setup to score the actives better than the decoys.

The general optimizations that we do are:

Electrostatic and desolvation thin sphere scan
Matching sphere perturbations
Tarting (modifying atom partial charges)

Typical Workflow

Compiling a List of Actives

Try and compile a list of 20-50 of different molecules. We usually shoot for Tanimoto Similarity <0.35.

Some places to look for known binders:

Chembl - check the pKi and only take those in the 7+ range
https://www.guidetopharmacology.org
PDB
Literature

Generating Property-Matched Decoys

(AFAIK these instructions will only work on normal gimel - working on that. This all really needs to be condensed into a nice github)

Before running this, make sure you are in a sh shell and source the following:

source /nfs/soft/dock/versions/dock37/DOCK-3.7-trunk/env.sh

This procedure generates decoys for your actives by finding SMILES in ZINC20 that are property-matched.

While not suggested, if you want to specifically look for already built molecules, please see the wiki page.

Steps to generate the property-matched decoys:

Setting up directories
Before starting, you need a SMILES file with the format (SMILES, ID) and no header:
S(Nc1c(O)cc(C(=O)O)cc1)(c2c(scc2)C(=O)O)(=O)=O active_1
You also need an input file named decoy_generation.in with the following lines:
SMILES YES PROTONATE YES MWT 0 125 LOGP 0 3.6 RB 0 5 HBA 0 4 HBD 0 3 CHARGE 0 2 LIGAND TC RANGE 0.0 0.35 MINIMUM DECOYS PER LIGAND 20 DECOYS PER LIGAND 50 MAXIMUM TC BETWEEN DECOYS 0.8 TANIMOTO YES GENERATE DECOYS 750
If your input ligand SMILES are already protonated, set PROTONATE NO.

SMILES YES incidactes that you want to query ZINC20 for SMILES, not already built molecules from ZINC15.

The decoy_generation.in file specifies the maximum property windows allowable for decoys, but the procedure attempts to find and assign decoys with the closest properties to the actives.

The example here means that decoys will be found that are
+/- 125 Daltons +/- 3.6 logP +/- 5 rotatable bonds +/- 4 hydrogen bond acceptors +/- 3 hydrogen bond donors +/- 2 charge +/- 0.35 or less Tanimoto
The code will attempt to find GENERATE DECOYS decoys for each active and then will perform an assignment procedure to attempt to arrive at between MINIMUM DECOYS PER LIGAND and DECOYS PER LIGAND for each active.

MAXIMUM TC BETWEEN DECOYS refers to the maximum Tanimoto Coefficient allowed between decoys (the higher, the more similar your decoys can be to each other).

Once you have created the decoy_generation.in, run the following command to create the decoy generation directory:
python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/0000_protonate_setup_dirs.py {SMILES_FILE} {NEW_DIR_NAME}
Provide a directory name that you want in place of {NEW_DIR_NAME}. This will create the directory with subdirectories named "ligand_${number}" for each of the active in the {SMILES_FILE} you input.
Retrieving SMILES decoys from ZINC20
Run the following command to retrieve decoy SMILES for each active:
python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/0001_qsub_generate_decoys.py {NEW_DIR_NAME}
Note that this procedure may take some time (potentially a few hours).

Known Bug: Occassionally, this script may hang and be running for many hours. I’m not sure why this happens, but I typically just move onto the next step. Generally you will be able to assign decoys retrieved from other actives for actives where this procedure fails.
Assigning decoys to actives
Once we retrieve decoys, we assign them to each active.

Note: There is no scheduler submission script for the following code yet, and it may take a little bit to run. I recommend running it in a screen/tmux session so that you don’t disconnect while it is running.
cd {NEW_DIR_NAME} source /nfs/home/bwhall61/.python_envs/pulp/bin/activate python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/filter_decoys.py
If this step has completed successfully, you should see files in NEW_DIR_NAME with the format {LIGAND_ID}_final_property_matched_decoys.txt. These files have the format:
SMILES ZINC_ID logP #Rotatable_Bonds #HBond_Donors #HBond_Acceptors Charge Protomer_ID TC_TO_LIG.
There should also be files with the format {LIGAND_ID}_replacements.txt which include extra property-matched decoys that were assigned to that active.

If you don’t get enough decoys, the decoy_generation.in file can be modified to change MAXIMUM TC BETWEEN DECOYS, MINIMUM DECOYS PER LIGAND, etc.

To avoid rerunning some of the time consuming calculations change TANIMOTO YES to TANIMOTO NO before rerunning this step.
Building the decoys + actives
Note: This must run on epyc or a slurm node (working on it… - docker on gimel/SGE nodes is too old)

If this is your first time building molecules with the new pipeline, you must first ask the Irwin group to add you to the docker group on the slurm nodes or building will not work.

First connect to epyc, go back to your decoy directory, and resource the enviornment:
ssh epyc cd {DIR_YOU_WERE_IN} source /mnt/nfs/soft/dock/versions/dock385/env.sh
Once decoys have been assigned, the decoys and actives need to be built for docking.

First run:
python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/0003b_write_out_ligands_decoys.py {NEW_DIR_NAME} {COPY_TO_DIR}
This will create COPY_TO_DIR with the SMILES files:
ligands.smi - this includes the input actives which have property matched decoys decoys.smi - this includes the canonicalized property-matched decoy SMILES decoy_protomers.smi - this includes the actual property-matched decoy protomer SMILES
To build the decoys and actives run (usually we use a pH of 7.4):
bash /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/qsub_build_ligands.sh {COPY_TO_DIR}
If you need more decoys, you can take more from the {LIGAND_ID}_replacements.txt files.
Packaging decoys and actives for dockopt
Once your actives and decoys are done building, they need to packaged into .tgz files for docking. Run:
python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/format_db2_for_dockopt.py {COPY_TO_DIR}
The output will be two files inside {COPY_TO_DIR} : actives.tgz and decoys.tgz which will be used for retrospective docking.
Visualizing decoy properties
Note: This needs a little bit of updating. Some of the plots treat discrete variables (HDB for ex) as continuous and make it difficult to compare the distributions.

To visualize the distributions of molecular properties:
python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0004_plot_properties.py {NEW_DIR_NAME}
To visualize the Tanimoto similarity between the decoys and the input ligands run:
python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0005_plot_tanimoto_to_lig.py {NEW_DIR_NAME}

DockOpt

DockOpt generates many different docking configurations which are evaluated in parallel.

The goal is to enhance the ability to score the actives better than the property-matched decoys.

By default, DockOpt is used to try different combinations of matching spheres, desolvation thin spheres, and electrostatic thin spheres.

For more info visit the wiki page

Source environments::
Wynton (Wynton environment needs to be updated with Brendan’s pydock updates)
source /wynton/group/bks/soft/python_envs/env.sh
Gimel (Any node other than gimel itself)
source /mnt/nfs/soft/dock/versions/dock385/env.sh
Note: I’ve only ever tried this on gimel so if someone runs into issues on wynton please let me know.
Create DockOpt Job
Ensure you are in the directory containing the required input files:
rec.pdb or rec.crg.pdb

xtal-lig.pdb

actives.tgz

decoys.tgz
The actives.tgz and decoys.tgz are each tarballs which contain the DB2 files for your actives and decoys, respectively.

So it should look something like:
- actives.tgz - lig1.db2 - lig2.db2 - etc
Note that the ligands can be .db2 or .db2.gz.

To create a new dockopt job run:
pydock3 dockopt - new
The default job directory is named dockopt_job. To change the name, use --job_dir_name:
pydock3 dockopt - new --job_dir_path=my_job_name
The job directory contains two sub-directories:
working: input files, intermediate blaster files, sub-directories for individual blastermaster subroutines

retrodock_jobs: individual retrodock jobs for each docking configuration
If your current directory has any of the following files, they will be copied and used. Otherwise default versions / generated versions will be made:
rec.pdb

rec.crg.pdb

xtal-lig.pdb

reduce_wwPDB_het_dict.txt

filt.params

radii

amb.crg.oxt

vdw.siz

delphi.def

vdw.parms.amb.mindock

prot.table.ambcrg.ambH
Configure the DockOpt Job
The dockopt_config.yaml file specifies the search values that dockopt will use. We typically are interested in trying different combinations of matching spheres, electrostatic thin spheres, and desolvation thin spheres. Larger ligand desolvation thin sphere radius increases the desolvation penalty at a point in space. Larger electrostatic thin sphere radius increase the low dielectric boundary so the electrostatic interactions become stronger. The values to check are specified in the dockopt_config.yaml in lines like so:
thin_spheres_elec: use: true molecular_surface_density: 10.0 distance_to_surface: - 1.0 - 1.1 - 1.2 - 1.3 - 1.4 - 1.5 - 1.6 - 1.7 - 1.8 - 1.9 thin_spheres_desolv: use: true molecular_surface_density: 1.0 distance_to_surface: - 0.1 - 0.2 - 0.3 - 0.4 - 0.5 - 0.6 - 0.7 - 0.8 - 0.9 - 1.0 matching_spheres_perturbation: use: true num_samples_per_matching_spheres_file: 20 max_deviation_angstroms: 0.5 perturb_xtal_spheres: false
This configuration tries the 10 different listed distances for the electrostatic thin spheres and the desolvation thin spheres and also tries 20 sets of matching spheres. In total, this configuration will produce and test 2000 different docking setups.

TARTING

Tarting is yet to be implemented in dockopt but we are working on making it easy

Set Environment Variables

Wynton (dev node)

export TMPDIR=/scratch
export QSTAT_EXEC=/opt/sge/bin/lx-amd64/qstat
export QSUB_EXEC=/opt/sge/bin/lx-amd64/qsub
export SGE_SETTINGS=/opt/sge/wynton/common/settings.sh

Gimel (Any node other than gimel itself)

export TMPDIR=/scratch
export SBATCH_EXEC=/usr/bin/sbatch
export SQUEUE_EXEC=/usr/bin/squeue

Running DockOpt
Once you have configured your dockopt_config.yaml you can run your dockopt job with:
cd <JOB_DIR_NAME> pydock3 dockopt - run <JOB_SCHEDULER_NAME> --retrodock_job_max_reattempts=5
where <JOB_SCHEDULER_NAME> is sge for wynton and slurm for gimel.

Once the dockopt job is complete, the following files will be generated in the job directory:
report.html: contains (1) a histogram of the performance of all tested docking configurations compared against a distribution of the performance of a random classifier, so as to show whether the test docking configurations are significantly better than ones that can be produced by a random classifier. This is necessary due to the fact that many configurations are being tested. Hence, a Bonferroni correction is applied to the significance threshold, dividing p=0.01 by the number of tested configurations. (2) ROC, charge, and energy plots of the top docking configurations, comparing actives vs. decoys, (3) box plots of enrichment for every multi-valued config parameter, and (4) heatmaps of enrichment for every pair of multi-valued config parameters.

results.csv: parameter values, criterion values, and other information about each docking configuration.
Within each sub-directory of best_retrodock_jobs there are:
dockfiles/: parameter files and INDOCK for given docking configuration

output/ contains actives and decoys subdirectories which contain the OUTDOCK files for the actives and decoys

plot images

Assuming you are happy with the results, you could simply proceed to using the dockfiles for a large-scale docking campaign.

However, we suggest one more retrospective test which is to do an extrema set docking to check for charge preference.

Extrema Set

The idea of an extrema set is that you provide a list of active molecules and in return get a list of molecules with -2, -1, 0, +1, +2 charge that are in the 25th-75th percentile of MW and logP of your actives. The idea is not to find property matched molecules, but instead to look at the charge preference of your docking setup.

Source Environments

source /mnt/nfs/soft/dock/versions/dock385/env.sh

Get Extrema Molecules

python /mnt/nfs/exk/work/bwhall61/extrema_improvement/smiles_0001_map_tranche_collect_db2_gz.py {SMILES_FILE} {NUM_MOLS}
This will get {NUM_MOLS} molecules from each tranche for each charge.

Build the extrema molecules
Note: This must run on epyc or a slurm node (working on it… - docker on gimel/SGE nodes is too old)

If this is your first time building molecules with the new pipeline, you must first ask the Irwin group to add you to the docker group on the slurm nodes or building will not work.
bash /mnt/nfs/exk/work/bwhall61/extrema_improvement/qsub_build_ligands.sh
DOCK the extrema molecules
You just need to provide the path to the dockfiles that you want to use
export DOCKFILES=[/absolute/path/to/dockfiles] bash /mnt/nfs/exk/work/bwhall61/extrema_improvement/qsub_dock_ligands.sh
The output will be a folder named after your dockopt setup (ex. rank=1-step=1_step-conf=419) and will contain docking directories for each charge. This makes it easy to re-run this step with other dockopt setups and have them named differently and easily see their rank.
Plot charge preferences
python /mnt/nfs/exk/work/bwhall61/extrema_improvement/plot_charge_preference.py --input_folder [docking_folder]
This will put charge_preference.png in the docking_folder that will show the score distributions for each of the charges.

Advanced Uses

Tarting

https://wiki.docking.org/index.php?title=DOCK_3.7_tart

Tarting refers to changing the partial charges of specific atoms in the protein receptor to modify (enhance/decrease) ligand preferences for specific parts of the binding site. Generally one would redefine the partial charges distribution of a specific amino acid, and then use this modified residue when calculating the electrostatic potential for the receptor.

charged residues are not tarted usually, especially Asp.
backbone is better for tarting instead of hanging groups
usually 0.2 is enough.