Retrospective Docking
The retrospective docking process is crucial for validating and refining docking setups before embarking on large-scale docking campaigns.
We start with a set of known active molecules that bind our protein. We generate a set of “property-matched decoys” that have similar properties to the actives but are predicted to be inactive.
We then optimize our docking setup to score the actives better than the decoys.
The general optimizations that we do are:
Electrostatic and desolvation thin sphere scan
Matching sphere perturbations
Tarting (modifying atom partial charges)
Typical Workflow
Compiling a List of Actives
Try and compile a list of 20-50 of different molecules. We usually shoot for Tanimoto Similarity <0.35.
Some places to look for known binders:
Chembl - check the pKi and only take those in the 7+ range
PDB
Literature
Generating Property-Matched Decoys
(AFAIK these instructions will only work on normal gimel - working on that. This all really needs to be condensed into a nice github)
Before running this, make sure you are in a sh
shell and source the following:
source /nfs/soft/dock/versions/dock37/DOCK-3.7-trunk/env.sh
This procedure generates decoys for your actives by finding SMILES in ZINC20 that are property-matched.
While not suggested, if you want to specifically look for already built molecules, please see the wiki page.
Steps to generate the property-matched decoys:
Setting up directories
Before starting, you need a SMILES file with the format (SMILES, ID) and no header:
S(Nc1c(O)cc(C(=O)O)cc1)(c2c(scc2)C(=O)O)(=O)=O active_1
You also need an input file named
decoy_generation.in
with the following lines:SMILES YES PROTONATE YES MWT 0 125 LOGP 0 3.6 RB 0 5 HBA 0 4 HBD 0 3 CHARGE 0 2 LIGAND TC RANGE 0.0 0.35 MINIMUM DECOYS PER LIGAND 20 DECOYS PER LIGAND 50 MAXIMUM TC BETWEEN DECOYS 0.8 TANIMOTO YES GENERATE DECOYS 750
If your input ligand SMILES are already protonated, set
PROTONATE NO
.SMILES YES
incidactes that you want to query ZINC20 for SMILES, not already built molecules from ZINC15.The
decoy_generation.in
file specifies the maximum property windows allowable for decoys, but the procedure attempts to find and assign decoys with the closest properties to the actives.The example here means that decoys will be found that are
+/- 125 Daltons +/- 3.6 logP +/- 5 rotatable bonds +/- 4 hydrogen bond acceptors +/- 3 hydrogen bond donors +/- 2 charge +/- 0.35 or less Tanimoto
The code will attempt to find
GENERATE DECOYS
decoys for each active and then will perform an assignment procedure to attempt to arrive at betweenMINIMUM DECOYS PER LIGAND
andDECOYS PER LIGAND
for each active.MAXIMUM TC BETWEEN DECOYS
refers to the maximum Tanimoto Coefficient allowed between decoys (the higher, the more similar your decoys can be to each other).Once you have created the
decoy_generation.in
, run the following command to create the decoy generation directory:python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/0000_protonate_setup_dirs.py {SMILES_FILE} {NEW_DIR_NAME}
Provide a directory name that you want in place of
{NEW_DIR_NAME}
. This will create the directory with subdirectories named"ligand_${number}"
for each of the active in the{SMILES_FILE}
you input.Retrieving SMILES decoys from ZINC20
Run the following command to retrieve decoy SMILES for each active:
python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/0001_qsub_generate_decoys.py {NEW_DIR_NAME}
Note that this procedure may take some time (potentially a few hours).
Known Bug: Occassionally, this script may hang and be running for many hours. I’m not sure why this happens, but I typically just move onto the next step. Generally you will be able to assign decoys retrieved from other actives for actives where this procedure fails.
Assigning decoys to actives
Once we retrieve decoys, we assign them to each active.
Note: There is no scheduler submission script for the following code yet, and it may take a little bit to run. I recommend running it in a screen/tmux session so that you don’t disconnect while it is running.
cd {NEW_DIR_NAME} source /nfs/home/bwhall61/.python_envs/pulp/bin/activate python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/filter_decoys.py
If this step has completed successfully, you should see files in
NEW_DIR_NAME
with the format{LIGAND_ID}_final_property_matched_decoys.txt
. These files have the format:SMILES ZINC_ID logP #Rotatable_Bonds #HBond_Donors #HBond_Acceptors Charge Protomer_ID TC_TO_LIG.
There should also be files with the format
{LIGAND_ID}_replacements.txt
which include extra property-matched decoys that were assigned to that active.If you don’t get enough decoys, the
decoy_generation.in
file can be modified to changeMAXIMUM TC BETWEEN DECOYS
,MINIMUM DECOYS PER LIGAND
, etc.To avoid rerunning some of the time consuming calculations change
TANIMOTO YES
toTANIMOTO NO
before rerunning this step.Building the decoys + actives
Note: This must run on epyc or a slurm node (working on it… - docker on gimel/SGE nodes is too old)
If this is your first time building molecules with the new pipeline, you must first ask the Irwin group to add you to the docker group on the slurm nodes or building will not work.
First connect to epyc, go back to your decoy directory, and resource the enviornment:
ssh epyc cd {DIR_YOU_WERE_IN} source /mnt/nfs/soft/dock/versions/dock385/env.sh
Once decoys have been assigned, the decoys and actives need to be built for docking.
First run:
python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/0003b_write_out_ligands_decoys.py {NEW_DIR_NAME} {COPY_TO_DIR}
This will create
COPY_TO_DIR
with the SMILES files:ligands.smi - this includes the input actives which have property matched decoys decoys.smi - this includes the canonicalized property-matched decoy SMILES decoy_protomers.smi - this includes the actual property-matched decoy protomer SMILES
To build the decoys and actives run (usually we use a pH of 7.4):
bash /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/qsub_build_ligands.sh {COPY_TO_DIR}
If you need more decoys, you can take more from the
{LIGAND_ID}_replacements.txt
files.Packaging decoys and actives for dockopt
Once your actives and decoys are done building, they need to packaged into .tgz files for docking. Run:
python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/format_db2_for_dockopt.py {COPY_TO_DIR}
The output will be two files inside
{COPY_TO_DIR}
:actives.tgz
anddecoys.tgz
which will be used for retrospective docking.Visualizing decoy properties
Note: This needs a little bit of updating. Some of the plots treat discrete variables (HDB for ex) as continuous and make it difficult to compare the distributions.
To visualize the distributions of molecular properties:
python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0004_plot_properties.py {NEW_DIR_NAME}
To visualize the Tanimoto similarity between the decoys and the input ligands run:
python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0005_plot_tanimoto_to_lig.py {NEW_DIR_NAME}
DockOpt
DockOpt generates many different docking configurations which are evaluated in parallel.
The goal is to enhance the ability to score the actives better than the property-matched decoys.
By default, DockOpt is used to try different combinations of matching spheres, desolvation thin spheres, and electrostatic thin spheres.
For more info visit the wiki page
Source environments::
Wynton (Wynton environment needs to be updated with Brendan’s pydock updates)
source /wynton/group/bks/soft/python_envs/env.sh
Gimel (Any node other than
gimel
itself)source /mnt/nfs/soft/dock/versions/dock385/env.sh
Note: I’ve only ever tried this on gimel so if someone runs into issues on wynton please let me know.
Create DockOpt Job
Ensure you are in the directory containing the required input files:
rec.pdb
orrec.crg.pdb
xtal-lig.pdb
actives.tgz
decoys.tgz
The
actives.tgz
anddecoys.tgz
are each tarballs which contain the DB2 files for your actives and decoys, respectively.So it should look something like:
- actives.tgz - lig1.db2 - lig2.db2 - etc
Note that the ligands can be
.db2
or.db2.gz
.To create a new dockopt job run:
pydock3 dockopt - new
The default job directory is named
dockopt_job
. To change the name, use--job_dir_name
:pydock3 dockopt - new --job_dir_path=my_job_name
- The job directory contains two sub-directories:
working
: input files, intermediate blaster files, sub-directories for individual blastermaster subroutinesretrodock_jobs
: individual retrodock jobs for each docking configuration
- If your current directory has any of the following files, they will be copied and used. Otherwise default versions / generated versions will be made:
rec.pdb
rec.crg.pdb
xtal-lig.pdb
reduce_wwPDB_het_dict.txt
filt.params
radii
amb.crg.oxt
vdw.siz
delphi.def
vdw.parms.amb.mindock
prot.table.ambcrg.ambH
Configure the DockOpt Job
The
dockopt_config.yaml
file specifies the search values that dockopt will use. We typically are interested in trying different combinations of matching spheres, electrostatic thin spheres, and desolvation thin spheres. Larger ligand desolvation thin sphere radius increases the desolvation penalty at a point in space. Larger electrostatic thin sphere radius increase the low dielectric boundary so the electrostatic interactions become stronger. The values to check are specified in thedockopt_config.yaml
in lines like so:thin_spheres_elec: use: true molecular_surface_density: 10.0 distance_to_surface: - 1.0 - 1.1 - 1.2 - 1.3 - 1.4 - 1.5 - 1.6 - 1.7 - 1.8 - 1.9 thin_spheres_desolv: use: true molecular_surface_density: 1.0 distance_to_surface: - 0.1 - 0.2 - 0.3 - 0.4 - 0.5 - 0.6 - 0.7 - 0.8 - 0.9 - 1.0 matching_spheres_perturbation: use: true num_samples_per_matching_spheres_file: 20 max_deviation_angstroms: 0.5 perturb_xtal_spheres: false
This configuration tries the 10 different listed distances for the electrostatic thin spheres and the desolvation thin spheres and also tries 20 sets of matching spheres. In total, this configuration will produce and test 2000 different docking setups.
TARTING
Tarting is yet to be implemented in dockopt but we are working on making it easy
Set Environment Variables
Wynton (dev node)
export TMPDIR=/scratch export QSTAT_EXEC=/opt/sge/bin/lx-amd64/qstat export QSUB_EXEC=/opt/sge/bin/lx-amd64/qsub export SGE_SETTINGS=/opt/sge/wynton/common/settings.sh
Gimel (Any node other than
gimel
itself)export TMPDIR=/scratch export SBATCH_EXEC=/usr/bin/sbatch export SQUEUE_EXEC=/usr/bin/squeue
Running DockOpt
Once you have configured your
dockopt_config.yaml
you can run your dockopt job with:cd <JOB_DIR_NAME> pydock3 dockopt - run <JOB_SCHEDULER_NAME> --retrodock_job_max_reattempts=5
where
<JOB_SCHEDULER_NAME>
issge
for wynton andslurm
for gimel.Once the dockopt job is complete, the following files will be generated in the job directory:
report.html: contains (1) a histogram of the performance of all tested docking configurations compared against a distribution of the performance of a random classifier, so as to show whether the test docking configurations are significantly better than ones that can be produced by a random classifier. This is necessary due to the fact that many configurations are being tested. Hence, a Bonferroni correction is applied to the significance threshold, dividing p=0.01 by the number of tested configurations. (2) ROC, charge, and energy plots of the top docking configurations, comparing actives vs. decoys, (3) box plots of enrichment for every multi-valued config parameter, and (4) heatmaps of enrichment for every pair of multi-valued config parameters.
results.csv: parameter values, criterion values, and other information about each docking configuration.
- Within each sub-directory of best_retrodock_jobs there are:
dockfiles/: parameter files and INDOCK for given docking configuration
output/ contains actives and decoys subdirectories which contain the OUTDOCK files for the actives and decoys
plot images
Assuming you are happy with the results, you could simply proceed to using the dockfiles for a large-scale docking campaign.
However, we suggest one more retrospective test which is to do an extrema set docking to check for charge preference.
Extrema Set
The idea of an extrema set is that you provide a list of active molecules and in return get a list of molecules with -2, -1, 0, +1, +2 charge that are in the 25th-75th percentile of MW and logP of your actives. The idea is not to find property matched molecules, but instead to look at the charge preference of your docking setup.
Source Environments
source /mnt/nfs/soft/dock/versions/dock385/env.sh
Get Extrema Molecules
python /mnt/nfs/exk/work/bwhall61/extrema_improvement/smiles_0001_map_tranche_collect_db2_gz.py {SMILES_FILE} {NUM_MOLS}
This will get {NUM_MOLS} molecules from each tranche for each charge.
Build the extrema molecules
Note: This must run on epyc or a slurm node (working on it… - docker on gimel/SGE nodes is too old)
If this is your first time building molecules with the new pipeline, you must first ask the Irwin group to add you to the docker group on the slurm nodes or building will not work.
bash /mnt/nfs/exk/work/bwhall61/extrema_improvement/qsub_build_ligands.sh
DOCK the extrema molecules
You just need to provide the path to the dockfiles that you want to use
export DOCKFILES=[/absolute/path/to/dockfiles] bash /mnt/nfs/exk/work/bwhall61/extrema_improvement/qsub_dock_ligands.sh
The output will be a folder named after your dockopt setup (ex. rank=1-step=1_step-conf=419) and will contain docking directories for each charge. This makes it easy to re-run this step with other dockopt setups and have them named differently and easily see their rank.
Plot charge preferences
python /mnt/nfs/exk/work/bwhall61/extrema_improvement/plot_charge_preference.py --input_folder [docking_folder]
This will put charge_preference.png in the docking_folder that will show the score distributions for each of the charges.
Advanced Uses
Tarting
https://wiki.docking.org/index.php?title=DOCK_3.7_tart
Tarting refers to changing the partial charges of specific atoms in the protein receptor to modify (enhance/decrease) ligand preferences for specific parts of the binding site. Generally one would redefine the partial charges distribution of a specific amino acid, and then use this modified residue when calculating the electrostatic potential for the receptor.
charged residues are not tarted usually, especially Asp.
backbone is better for tarting instead of hanging groups
usually 0.2 is enough.