Filtering and Hit Picking
=========================

Following the completion of a large-scale docking (LSD) campaign,
a critical step is to filter the resulting compound list to identify the most promising candidates.
This document outlines the successive filtering layers applied,
including novelty checks, interaction filtering (IFP), and clustering.
The process culminates in hit picking, where the most viable compounds are selected for experimental testing.


Typical Workflow
----------------

Extracting Top Molecules and Poses
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

First extract the best molecules from your screen. You need to provide the path to your LSD output and the number of molecules you want to retrieve

   .. code-block:: bash

      source /wynton/group/bks/soft/python_envs/python3.9.0.sh
      python /wynton/group/bks/work/bwhall61/needs_github/top_poses.py -n [num_mols] [LSD_output]

Interaction and Novelty Filtering 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The first step is to filter for interactions that we want (maybe an H-bond with a specific residue) and to make sure that our molecules are novel compared to known actives.

1. Copy the scripts, your receptor, and known actives to your directory:

   .. code-block:: bash

      cp -r /wynton/group/bks/work/bwhall61/ifp_dev ./filter_scripts
      cp [path/to/rec.crg.pdb] .
      cp [path/to/knowns.smi] .

   You must do the following for your rec.crg.pdb:
      - Change HIE/HID back to HIS

      .. code-block:: bash

         sed -i 's/HIE/HIS/g; s/HID/HIS/g' rec.crg.pdb

      - Revert names of tarted residues (you can use a similar sed command to above)
      - Change waters back to HETATM from ATOM

2. Define the interactions to look for:

   Modify line 205 of `filter_scripts/ifp_interactions.py` to include the interactions that you want to include. An example is:

   .. code-block:: bash

      filters = [['Hydrogen bond','GLY-333'],['Hydrogen bond','ALA-353'],['Hydrogen bond','TYR-368']]

3. Split the top_poses.mol2 and top_poses.scores into smaller files for submission

   .. code-block:: bash

      python filter_scripts/lc_blazing_fast_separate_mol2_into_smaller_files_called_filter-XXX.py top_poses.mol2  top_poses.scores [num_poses_per_file]
      ls filtered*.mol2 > dirlist

4. Submit IFP to wynton

   .. code-block:: bash

      csh filter_scripts/submit.csh $(pwd) $(pwd)/filter_scripts rec.crg $(pwd)/knowns.smi

5. Check if everything finished and ran correctly

   .. code-block:: bash

      ls -d --color=never [0-9]* > dirlist_combine
      python filter_scripts/check_finished_notfinished_ifp.py dirlist_combine

   If any jobs failed to run, they will be put into `NOT_FINISHED_dirlist` and can be re-run with:

   .. code-block:: bash

        csh filter_scripts/resubmit.csh $(pwd) $(pwd)/filter_scripts rec.crg $(pwd)/knowns.smi NOT_FINISHED_dirlist

6. Combine the results from the jobs

   .. code-block:: bash 

      python filter_scripts/combine_ifp.py dirlist_combine combined

7. Perform the filtering

   The lines of the combined.interactions.csv are organized like:

   .. code-block:: bash

      $1: ZINC ID
      $2: Smiles
      $3: DOCK score
      $4: # of H-bond donors
      $5: # of H-bond acceptors
      $6: # of unsatisfied H-bond donors
      $7: # of unsatisfied H-bond acceptors
      $8...: Additional interactions you defined
      $(NF-1): Name of the known active the ZINC molecule is most similar to
      $NF (last column): Similarity to the known active

   The typical protocol is to remove compounds with any unsatisfied H-bond donors, more than 5 unsatisfied H-bond acceptors, and that have a similarity >0.35 to a known molecule.

   To do this filtering (and additionally to filter molecules to keep interaction $8):

   .. code-block:: bash

      awk -F "," '$6==0 && $7<=5 && $NF<=0.35 && $8==1 {print NR}' combined.interactions.csv > pass_ifp_lines.txt
      (head -n 1 combined.fp; awk 'NR==FNR {lines[$1]; next} (FNR in lines)' pass_ifp_lines.txt combined.fp) > ifp_novelty_filtered.fp
      (head -n 1 combined.interactions.csv; awk 'NR==FNR {lines[$1]; next} (FNR in lines)' pass_ifp_lines.txt combined.interactions.csv) > ifp_novelty_filtered.interactions.csv


Best First Clustering
~~~~~~~~~~~~~~~~~~~~~

Once we have filtered for novel molecules that have our desired interactions, we do one last clustering step to ensure that the remaining molecules are diverse.
Here we use a TC cutoff of 0.35 (molecules with TC >0.35 will be put in the same cluster) but you can increase this to have less strict clusters.

   .. code-block:: bash

      python filter_scripts/bfc.py --fp_file ifp_novelty_filtered.fp --interactions_file ifp_novelty_filtered.interactions.csv --tc_cutoff 0.35

The output will be two files: `cluster_head.list` and `cluster_details.list`, containing the cluster heads and all of the molecules and their assigned clusters respectively.

If you don't have enough molecules, you can do a few things: start with more molecules at the beginning of filtering, loosen your interaction filtering, loosen the novelty filtering, or loosen the clustering TC cutoff. 

We can then collect the poses for the cluster heads to do hit picking.

   .. code-block:: bash

      python filter_scripts/collect_poses.py --mol2_file top_poses.mol2 --name_file cluster_head.list --outname poses_for_hitpicking.mol2 --name_column 1

This will give you the final `poses_for_hitpicking.mol2` which you can then load into Chimera to visualize.


Hit Picking in Chimera with ViewDock
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Finally, hit picking is performed using the ViewDock utility in Chimera.

A basic ViewDock tutorial and more info can be found `here <https://www.cgl.ucsf.edu/chimera/current/docs/UsersGuide/tutorials/vdtut.html>`_. 

This involves visual inspection of docking poses to prioritize compounds based on their predicted interactions.

In Chimera, go to ``Tools > Surface/Binding Analysis > ViewDock``.

You can then load your `poses_for_hitpicking.mol2` and flip through them, marking molecules as viable, deleted, and purged.

Some groups that you *may* want to avoid in initial hitpicking based on lab/Brian wisdom:
* Br and I
* NO2
* Azides and diazo-compounds
* Thiophenes?
* 6 member rings with >2 nitrogens

Ordering Compounds
~~~~~~~~~~~~~~~~~~

To order molecules, send an excel sheet with the following columns to Kateryna (k.opimakh@enamine.net) kindly asking for a quote:
* Customer Code (ZINC ID)
* Enamine Code (if available)
* SMILES
* Amount (10mg to begin)
* Stereochemistry (racemic ok for initial ordering)
* Purity (>90% for initial ordering)

She will send back a draft quote which includes the price for each molecule, at which point you will probably throw some out (we try and keep under ~$120-140 for initial hits).
It's also not a bad idea to double check the SMILES and everything. 

I highlight the rows of the excel I want to keep and send it back asking for a final quote. At this point you can also ask for specific formatting.
The most common ways are to ask for two separate vials: 8mg + 2mg - one for testing/collaborators and its nice to have extra for aggregation or if they run out and need a little more.
If the collaborators want, you can also ask for DMSO stocks. For example: Could we dissolve each in DMSO to get 300uL of 10mM stock solution and receive the rest as powder? 


Advanced Uses
-------------

Docking stats
~~~~~~~~~~~~~

.. code:: bash

   cp
   /wynton/group/bks/work/elisfink/5HT2AR/leads1/docking_stat_parallel.csh
   csh docking_stat_parallel.sh /FULL/PATH

for DOCK 3.7
``python ~/DOCK/ucsfdock/analysis/get_docking_statistics.py $PWD/output $PWD/output/dirlist aldeh-20.txt``

Get total time
``grep --color total_hours *docking_stat.out | awk '{print $3}' | awk '{s+=$1}END{print s}'``


Removing Strained Molecules
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Our methodology for calculating torsion strain for small molecules is documented in Gu S, Smith MSS, et al. Ligand Strain Energy in Large Library Docking. *J. Chem. Inf. Model.* 2021, 61, 4331−4341. https://doi.org/10.1021/acs.jcim.1c00368. There is also plenty of math written in the comments of the Python scripts.

Usually, we run ``DOCK3.8`` with strain cutoffs already in the ``INDOCK`` file, so we only ever dock conformations that are not strained. That procedure relies on the ``.db2`` files in ZINC22 having the strain values written for each set of confs.
If you want to filter more strictly after docking or calculate strain for a set of molecules not in ZINC, you can follow these steps.

1. What You Will Need
^^^^^^^^^^^^^^^^^^^^^

A ``.mol2`` file, which I have called ``poses.mol2``, although there are options in the Python scripts for ``.db2`` files

2. Location of the Scripts
^^^^^^^^^^^^^^^^^^^^^^^^^^

All the scripts are on the gimel cluster. The files for actually calculating strain are in the directory

.. code-block:: bash

    /nfs/home/matt715/Strain

and the Bash scripts for running the calculation for many molecules are in

.. code-block:: bash

  /nfs/home/matt715/Filter

If you copy any of these scripts into your own directory, makes sure to update the lines pointing to the correct directories for ``Filter`` and ``Strain``.

3. Running on the Cluster
^^^^^^^^^^^^^^^^^^^^^^^^^

Submit a job of calculating strain to the gimel cluster using SGE:

.. code-block:: bash

  sh {PATH}/submit_strain.sh poses.mol2

Once the jobs have finished running, combine the results:

.. code-block:: bash

  cd STRAIN
  find . -name "*_Torsion_Strain.csv" -exec cat {} >> torsion.csv \;

4. Processing the Results
^^^^^^^^^^^^^^^^^^^^^^^^^

The Python scripts create ``.csv`` files for each chunk of the molecules in the ``.mol2``, which then get combined into one ``.csv`` file. The ``.csv`` file outputs are described in detail in the header comments of ``Strain/Torsion_Strain.py``, but what matters for simple filtering is:

* One line per molecule in the ``.mol2`` file
* The first column is the molecule name
* The second column is the total strain, or the sum of the point estimates for the strain for each torsion pattern (dihedral angle)
* The sixth column is the max strain, or the largest single point estimate for strain over all the torsion patterns (dihedral angles)

We want to extract from the ``.csv`` file a list of all the unstrained molecules. Let's say we prospectively docked with cutoffs of total strain less than 8 TEU and max strain less than 3 TEU (Torsion Energy Units), but now we would like to filter the docked molecules for total strain less than 7 TEU and max strain less than 2.5 TEU. We do that with the following commands:

.. code-block:: bash

  awk -F"," '$2 > 0 && $2 < 7.0 && $6 < 2.5 {print $1 "," $2 "," $6}' torsion.csv > torsion_unstrained.csv
  awk -F"," '{print $1}' torsion_unstrained.csv > torsion_unstrained.names

Lastly, we use the ``lc_blazing_fast_collect_mol2.py`` script (I have a copy in my ``Filter`` directory) to identify the molecules in the ``.mol2`` file whose names are in ``torsion_unstrained.names`` and make a new ``.mol2`` with just those. Make sure you are on epyc, gimel2, or gimel5!

.. code-block:: bash

  cd .. #If still in the STRAIN subdirectory
  python {PATH}/lc_blazing_fast_collect_mol2.py STRAIN/torsion_unstrained.names poses.mol2 poses_unstrained.mol2

Alternate Strain (Andrii)
~~~~~~~~~~~~~~~~~~~~~~~~~


from the directory with .mol2

.. code:: bash

   qsub ~ak87/exa/PROGRAM/LSD/STRAIN/submit-strain-onefile.tcsh
   awk -F"," '$2>0 && $2<7.0 && $6<2.5 {print $1" "$2" "$6}' *_Torsion_Strain.csv > RdRp-Shuo-nitriles-20-strain-filtered.csv
   awk -F"," '$2>0 && $2<6.5 && $6<1.8' test2_Torsion_Strain.csv > filtered.csv


$2 - Total Strain < 8.0 kcal/mol, $6 - maximum Torsional Strain < 3.5 kcal/mol. 

Other recommended values:

* DISI 6.5 & 1.8 
* Elissa:  7 and 2.5 
* Comments in the script itself (TEU is an arbitrary energy unit):

::

   # we recommend using 8 TEU as a cutoff
   # for considering the entire structure to be strained. We recommend using
   # 2.5 TEU as a cutoff for when using the non-zero-cutoff-sum and then
   # considering any structure with a sum greater than 0 to be strained. That
   # means we would be considering any structure with at least one torsion pattern
   # with an estimated energy over 2.5 TEU to be strained overall.


Columns in the csv file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* 1 Molecule or pose ID 
* 2 Total strain 
* 3,4 seems like upper and lower bounds for 95% confidence interval of the total strain 
* 5 torsion ID in the library, largest strain 
* 6 largest strain of a torsion 
* next columns– torsion ID \| energy \| energy_lower_bound \| energy_upper_bound \| atom IDs in the torsion \| angle \| SMARTS \| type of torsion rule (specific or general) \| the type of energy calculation method (exact or approximate) \| a Boolean indicating whether or not the torsion pattern has an approximate method and a dihedral angle not within the second tolerance of any peak (whether it is “flagged”).


Strain for covalent docking*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For aldehydes $6 is always high, 3.75-3.70, so filtering gives almost nothing. The SMARTS pattern for the problematic angle is ``[*:1]~[CX3:2]!@[OX2:3]~[*:4]``, which is for a C-O bond, presumably one to the protein. Use $19 to filter as the value for the next torsion angle with the largest strain (should be $16, but atom indices have commas too).

If there are too many poses
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   mkdir STRAIN
   cd STRAIN
   ln -s ../top_poses.mol2 .
   python ~ttummino/zzz.scripts/misc/lc_blazing_fast_separate_mol2_into_smaller_files.py top_poses.mol2 split 2000
   files=$(ls *mol2); for file in $files; do echo $file; mkdir $file-dir; mv $file $file-dir; done
   dirs=$(ls -d *mol2-dir); for dir in $dirs; do echo $dir; cd $dir || exit; qsub ~/exa/PROGRAM/LSD/STRAIN/submit-strain-onefile.tcsh; cd ../ || exit; done
   # processing
   find . -name "*_Torsion_Strain.csv" -exec cat {} >> cm-ifd-1-l2_strain.csv \;

.. _scripts-1:

Bemis-Murcko Scaffold Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
https://wiki.docking.org/index.php/Bemis-Murcko_Scaffold_Analysis


Alternatively use `DataWarrior <https://openmolecules.org/help/chemistry.html#ScaffoldAnalysis>`_ to perform scaffold analysis