Filtering and Hit Picking

Following the completion of a large-scale docking (LSD) campaign, a critical step is to filter the resulting compound list to identify the most promising candidates. This document outlines the successive filtering layers applied, including novelty checks, interaction filtering (IFP), and clustering. The process culminates in hit picking, where the most viable compounds are selected for experimental testing.

Typical Workflow

Extracting Top Molecules and Poses

First extract the best molecules from your screen. You need to provide the path to your LSD output and the number of molecules you want to retrieve

source /wynton/group/bks/soft/python_envs/python3.9.0.sh
python /wynton/group/bks/work/bwhall61/needs_github/top_poses.py -n [num_mols] [LSD_output]

Interaction and Novelty Filtering

The first step is to filter for interactions that we want (maybe an H-bond with a specific residue) and to make sure that our molecules are novel compared to known actives.

Copy the scripts, your receptor, and known actives to your directory:
```
cp -r /wynton/group/bks/work/bwhall61/ifp_dev ./filter_scripts
cp [path/to/rec.crg.pdb] .
cp [path/to/knowns.smi] .
```
You must do the following for your rec.crg.pdb:
- Change HIE/HID back to HIS
sed -i 's/HIE/HIS/g; s/HID/HIS/g' rec.crg.pdb
- Revert names of tarted residues (you can use a similar sed command to above)
- Change waters back to HETATM from ATOM
Define the interactions to look for:

Modify line 205 of filter_scripts/ifp_interactions.py to include the interactions that you want to include. An example is:
```
filters = [['Hydrogen bond','GLY-333'],['Hydrogen bond','ALA-353'],['Hydrogen bond','TYR-368']]
```

Split the top_poses.mol2 and top_poses.scores into smaller files for submission

python filter_scripts/lc_blazing_fast_separate_mol2_into_smaller_files_called_filter-XXX.py top_poses.mol2  top_poses.scores [num_poses_per_file]
ls filtered*.mol2 > dirlist

Submit IFP to wynton

csh filter_scripts/submit.csh $(pwd) $(pwd)/filter_scripts rec.crg $(pwd)/knowns.smi

Check if everything finished and ran correctly

awk '{sub(/^filtered-/, ""); sub(/\.mol2$/, ""); print "output/" $0 ".interactions.csv"}' dirlist > dirlist_combine
python filter_scripts/check_finished_notfinished_ifp.py dirlist_combine

If any jobs failed to run, they will be put into NOT_FINISHED_dirlist and can be re-run with:

csh filter_scripts/resubmit.csh $(pwd) $(pwd)/filter_scripts rec.crg $(pwd)/knowns.smi NOT_FINISHED_dirlist

Combine the results from the jobs

python filter_scripts/combine_ifp.py dirlist_combine combined

Perform the filtering

The lines of the combined.interactions.csv are organized like:

$1: ZINC ID
$2: Smiles
$3: DOCK score
$4: # of H-bond donors
$5: # of H-bond acceptors
$6: # of unsatisfied H-bond donors
$7: # of unsatisfied H-bond acceptors
$8...: Additional interactions you defined
$(NF-1): Name of the known active the ZINC molecule is most similar to
$NF (last column): Similarity to the known active

The typical protocol is to remove compounds with any unsatisfied H-bond donors, more than 5 unsatisfied H-bond acceptors, and that have a similarity >0.35 to a known molecule.

To do this filtering (and additionally to filter molecules to keep interaction $8):

awk -F "," '$6==0 && $7<=5 && $NF<=0.35 && $8==1 {print NR}' combined.interactions.csv > pass_ifp_lines.txt
(head -n 1 combined.fp; awk 'NR==FNR {lines[$1]; next} (FNR in lines)' pass_ifp_lines.txt combined.fp) > ifp_novelty_filtered.fp
(head -n 1 combined.interactions.csv; awk 'NR==FNR {lines[$1]; next} (FNR in lines)' pass_ifp_lines.txt combined.interactions.csv) > ifp_novelty_filtered.interactions.csv

Best First Clustering

Once we have filtered for novel molecules that have our desired interactions, we do one last clustering step to ensure that the remaining molecules are diverse. Here we use a TC cutoff of 0.35 (molecules with TC >0.35 will be put in the same cluster) but you can increase this to have less strict clusters.

python filter_scripts/bfc.py --fp_file ifp_novelty_filtered.fp --interactions_file ifp_novelty_filtered.interactions.csv --tc_cutoff 0.35

The output will be two files: cluster_head.list and cluster_details.list, containing the cluster heads and all of the molecules and their assigned clusters respectively.

If you don’t have enough molecules, you can do a few things: start with more molecules at the beginning of filtering, loosen your interaction filtering, loosen the novelty filtering, or loosen the clustering TC cutoff.

We can then collect the poses for the cluster heads to do hit picking.

python filter_scripts/collect_poses.py --mol2_file top_poses.mol2 --name_file cluster_head.list --outname poses_for_hitpicking.mol2 --name_column 1

This will give you the final poses_for_hitpicking.mol2 which you can then load into Chimera to visualize.

Hit Picking in Chimera with ViewDock

Finally, hit picking is performed using the ViewDock utility in Chimera.

A basic ViewDock tutorial and more info can be found here.

This involves visual inspection of docking poses to prioritize compounds based on their predicted interactions.

In Chimera, go to Tools > Surface/Binding Analysis > ViewDock.

You can then load your poses_for_hitpicking.mol2 and flip through them, marking molecules as viable, deleted, and purged.

Some groups that you may want to avoid in initial hitpicking based on lab/Brian wisdom: * Br and I * NO2 * Azides and diazo-compounds * Thiophenes? * 6 member rings with >2 nitrogens

Ordering Compounds

To order molecules, send an excel sheet with the following columns to Kateryna (k.opimakh@enamine.net) kindly asking for a quote: * Customer Code (ZINC ID) * Enamine Code (if available) * SMILES * Amount (10mg to begin) * Stereochemistry (racemic ok for initial ordering) * Purity (>90% for initial ordering)

She will send back a draft quote which includes the price for each molecule, at which point you will probably throw some out (we try and keep under ~$120-140 for initial hits). It’s also not a bad idea to double check the SMILES and everything.

I highlight the rows of the excel I want to keep and send it back asking for a final quote. At this point you can also ask for specific formatting. The most common ways are to ask for two separate vials: 8mg + 2mg - one for testing/collaborators and its nice to have extra for aggregation or if they run out and need a little more. If the collaborators want, you can also ask for DMSO stocks. For example: Could we dissolve each in DMSO to get 300uL of 10mM stock solution and receive the rest as powder?

Advanced Uses

Docking stats

cp
/wynton/group/bks/work/elisfink/5HT2AR/leads1/docking_stat_parallel.csh
csh docking_stat_parallel.sh /FULL/PATH

for DOCK 3.7 python ~/DOCK/ucsfdock/analysis/get_docking_statistics.py $PWD/output $PWD/output/dirlist aldeh-20.txt

Get total time grep --color total_hours *docking_stat.out | awk '{print $3}' | awk '{s+=$1}END{print s}'

Removing Strained Molecules

Our methodology for calculating torsion strain for small molecules is documented in Gu S, Smith MSS, et al. Ligand Strain Energy in Large Library Docking. J. Chem. Inf. Model. 2021, 61, 4331−4341. https://doi.org/10.1021/acs.jcim.1c00368. There is also plenty of math written in the comments of the Python scripts.

Usually, we run DOCK3.8 with strain cutoffs already in the INDOCK file, so we only ever dock conformations that are not strained. That procedure relies on the .db2 files in ZINC22 having the strain values written for each set of confs. If you want to filter more strictly after docking or calculate strain for a set of molecules not in ZINC, you can follow these steps.

1. What You Will Need

A .mol2 file, which I have called poses.mol2, although there are options in the Python scripts for .db2 files

2. Location of the Scripts

All the scripts are on the gimel cluster. The files for actually calculating strain are in the directory

/nfs/home/matt715/Strain

and the Bash scripts for running the calculation for many molecules are in

/nfs/home/matt715/Filter

If you copy any of these scripts into your own directory, makes sure to update the lines pointing to the correct directories for Filter and Strain.

3. Running on the Cluster

Submit a job of calculating strain to the gimel cluster using SGE:

sh {PATH}/submit_strain.sh poses.mol2

Once the jobs have finished running, combine the results:

cd STRAIN
find . -name "*_Torsion_Strain.csv" -exec cat {} >> torsion.csv \;

4. Processing the Results

The Python scripts create .csv files for each chunk of the molecules in the .mol2, which then get combined into one .csv file. The .csv file outputs are described in detail in the header comments of Strain/Torsion_Strain.py, but what matters for simple filtering is:

One line per molecule in the .mol2 file
The first column is the molecule name
The second column is the total strain, or the sum of the point estimates for the strain for each torsion pattern (dihedral angle)
The sixth column is the max strain, or the largest single point estimate for strain over all the torsion patterns (dihedral angles)

We want to extract from the .csv file a list of all the unstrained molecules. Let’s say we prospectively docked with cutoffs of total strain less than 8 TEU and max strain less than 3 TEU (Torsion Energy Units), but now we would like to filter the docked molecules for total strain less than 7 TEU and max strain less than 2.5 TEU. We do that with the following commands:

awk -F"," '$2 > 0 && $2 < 7.0 && $6 < 2.5 {print $1 "," $2 "," $6}' torsion.csv > torsion_unstrained.csv
awk -F"," '{print $1}' torsion_unstrained.csv > torsion_unstrained.names

Lastly, we use the lc_blazing_fast_collect_mol2.py script (I have a copy in my Filter directory) to identify the molecules in the .mol2 file whose names are in torsion_unstrained.names and make a new .mol2 with just those. Make sure you are on epyc, gimel2, or gimel5!

cd .. #If still in the STRAIN subdirectory
python {PATH}/lc_blazing_fast_collect_mol2.py STRAIN/torsion_unstrained.names poses.mol2 poses_unstrained.mol2

Alternate Strain (Andrii)

from the directory with .mol2

qsub ~ak87/exa/PROGRAM/LSD/STRAIN/submit-strain-onefile.tcsh
awk -F"," '$2>0 && $2<7.0 && $6<2.5 {print $1" "$2" "$6}' *_Torsion_Strain.csv > RdRp-Shuo-nitriles-20-strain-filtered.csv
awk -F"," '$2>0 && $2<6.5 && $6<1.8' test2_Torsion_Strain.csv > filtered.csv

$2 - Total Strain < 8.0 kcal/mol, $6 - maximum Torsional Strain < 3.5 kcal/mol.

Other recommended values:

DISI 6.5 & 1.8
Elissa: 7 and 2.5
Comments in the script itself (TEU is an arbitrary energy unit):

# we recommend using 8 TEU as a cutoff
# for considering the entire structure to be strained. We recommend using
# 2.5 TEU as a cutoff for when using the non-zero-cutoff-sum and then
# considering any structure with a sum greater than 0 to be strained. That
# means we would be considering any structure with at least one torsion pattern
# with an estimated energy over 2.5 TEU to be strained overall.

Columns in the csv file

1 Molecule or pose ID
2 Total strain
3,4 seems like upper and lower bounds for 95% confidence interval of the total strain
5 torsion ID in the library, largest strain
6 largest strain of a torsion
next columns– torsion ID | energy | energy_lower_bound | energy_upper_bound | atom IDs in the torsion | angle | SMARTS | type of torsion rule (specific or general) | the type of energy calculation method (exact or approximate) | a Boolean indicating whether or not the torsion pattern has an approximate method and a dihedral angle not within the second tolerance of any peak (whether it is “flagged”).

Strain for covalent docking*

For aldehydes $6 is always high, 3.75-3.70, so filtering gives almost nothing. The SMARTS pattern for the problematic angle is [*:1]~[CX3:2]!@[OX2:3]~[*:4], which is for a C-O bond, presumably one to the protein. Use $19 to filter as the value for the next torsion angle with the largest strain (should be $16, but atom indices have commas too).

If there are too many poses

mkdir STRAIN
cd STRAIN
ln -s ../top_poses.mol2 .
python ~ttummino/zzz.scripts/misc/lc_blazing_fast_separate_mol2_into_smaller_files.py top_poses.mol2 split 2000
files=$(ls *mol2); for file in $files; do echo $file; mkdir $file-dir; mv $file $file-dir; done
dirs=$(ls -d *mol2-dir); for dir in $dirs; do echo $dir; cd $dir || exit; qsub ~/exa/PROGRAM/LSD/STRAIN/submit-strain-onefile.tcsh; cd ../ || exit; done
# processing
find . -name "*_Torsion_Strain.csv" -exec cat {} >> cm-ifd-1-l2_strain.csv \;

Bemis-Murcko Scaffold Analysis

https://wiki.docking.org/index.php/Bemis-Murcko_Scaffold_Analysis

Alternatively use DataWarrior to perform scaffold analysis