Retrospective Docking
=====================

The retrospective docking process is crucial for validating and refining docking setups before embarking on large-scale docking campaigns. 

We start with a set of known active molecules that bind our protein. We generate a set of "property-matched decoys" that have similar properties to the actives but are predicted to be inactive. 

We then optimize our docking setup to score the actives better than the decoys.

The general optimizations that we do are: 

* Electrostatic and desolvation thin sphere scan

* Matching sphere perturbations

* Tarting (modifying atom partial charges)

Typical Workflow
----------------

Compiling a List of Actives
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Try and compile a list of 20-50 of different molecules. We usually shoot for Tanimoto Similarity <0.35. 

Some places to look for known binders:

* Chembl - check the pKi and only take those in the 7+ range

* https://www.guidetopharmacology.org

* PDB

* Literature

Generating Property-Matched Decoys
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

(AFAIK these instructions will only work on normal gimel - working on that. This all really needs to be condensed into a nice github)

Before running this, make sure you are in a ``sh`` shell and source the following:

    .. code-block:: bash

        source /nfs/soft/dock/versions/dock37/DOCK-3.7-trunk/env.sh

This procedure generates decoys for your actives by finding SMILES in ZINC20 that are property-matched.

While not suggested, if you want to specifically look for already built molecules, please see `the wiki page <https://wiki.docking.org/index.php?title=Generating_decoys_(Reed%27s_way)>`_.

Steps to generate the property-matched decoys:

1. **Setting up directories**

    Before starting, you need a SMILES file with the format (SMILES, ID) and no header:

    .. code-block:: bash

        S(Nc1c(O)cc(C(=O)O)cc1)(c2c(scc2)C(=O)O)(=O)=O active_1

    You also need an input file named ``decoy_generation.in`` with the following lines:

    .. code-block:: bash

        SMILES YES
        PROTONATE YES
        MWT 0 125
        LOGP 0 3.6
        RB 0 5
        HBA 0 4
        HBD 0 3
        CHARGE 0 2
        LIGAND TC RANGE 0.0 0.35
        MINIMUM DECOYS PER LIGAND 20
        DECOYS PER LIGAND 50
        MAXIMUM TC BETWEEN DECOYS 0.8
        TANIMOTO YES
        GENERATE DECOYS 750

    If your input ligand SMILES are already protonated, set ``PROTONATE NO``. 
    
    ``SMILES YES`` incidactes that you want to query ZINC20 for SMILES, not already built molecules from ZINC15.
    
    The ``decoy_generation.in`` file specifies the maximum property windows allowable for decoys, but the procedure attempts to find and assign decoys with the closest properties to the actives.

    The example here means that decoys will be found that are

    .. code-block:: bash

        +/- 125 Daltons
        +/- 3.6 logP
        +/- 5 rotatable bonds
        +/- 4 hydrogen bond acceptors
        +/- 3 hydrogen bond donors
        +/- 2 charge
        +/- 0.35 or less Tanimoto

    The code will attempt to find ``GENERATE DECOYS`` decoys for each active and then will perform an assignment procedure to attempt 
    to arrive at between ``MINIMUM DECOYS PER LIGAND`` and ``DECOYS PER LIGAND`` for each active.

    ``MAXIMUM TC BETWEEN DECOYS`` refers to the maximum Tanimoto Coefficient allowed between decoys (the higher, the more similar your decoys can be to each other).

    Once you have created the ``decoy_generation.in``, run the following command to create the decoy generation directory:

    .. code-block:: bash

        python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/0000_protonate_setup_dirs.py {SMILES_FILE} {NEW_DIR_NAME}

    Provide a directory name that you want in place of ``{NEW_DIR_NAME}``. 
    This will create the directory with subdirectories named ``"ligand_${number}"`` for each of the active  in the ``{SMILES_FILE}`` you input.

2. **Retrieving SMILES decoys from ZINC20**

    Run the following command to retrieve decoy SMILES for each active:

    .. code-block:: bash

        python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/0001_qsub_generate_decoys.py {NEW_DIR_NAME}

    Note that this procedure may take some time (potentially a few hours).

    *Known Bug:* Occassionally, this script may hang and be running for many hours.
    I'm not sure why this happens, but I typically just move onto the next step. 
    Generally you will be able to assign decoys retrieved from other actives for actives where this procedure fails. 

3. **Assigning decoys to actives**

    Once we retrieve decoys, we assign them to each active.

    *Note:* There is no scheduler submission script for the following code yet, and it may take a little bit to run.
    I recommend running it in a screen/tmux session so that you don't disconnect while it is running.

    .. code-block:: bash

        cd {NEW_DIR_NAME}
        source /nfs/home/bwhall61/.python_envs/pulp/bin/activate
        python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/filter_decoys.py


    If this step has completed successfully, you should see files in ``NEW_DIR_NAME`` with the format ``{LIGAND_ID}_final_property_matched_decoys.txt``.
    These files have the format:
    
    .. code-block:: none

        SMILES ZINC_ID logP #Rotatable_Bonds #HBond_Donors #HBond_Acceptors Charge Protomer_ID TC_TO_LIG.

    There should also be files with the format ``{LIGAND_ID}_replacements.txt`` which include extra property-matched decoys that were assigned to that active.

    If you don't get enough decoys, the ``decoy_generation.in`` file can be modified to change ``MAXIMUM TC BETWEEN DECOYS``, ``MINIMUM DECOYS PER LIGAND``, etc.
    
    To avoid rerunning some of the time consuming calculations change ``TANIMOTO YES`` to ``TANIMOTO NO`` before rerunning this step.


4. **Building the decoys + actives**

    *Note*: This must run on epyc or a slurm node (working on it... - docker on gimel/SGE nodes is too old)

    If this is your first time building molecules with the new pipeline, you must first ask the Irwin group to add you to the `docker` group on the slurm nodes or building will not work.

    First connect to epyc, go back to your decoy directory, and resource the enviornment:

    .. code-block:: bash

        ssh epyc
        cd {DIR_YOU_WERE_IN} 
        source /mnt/nfs/soft/dock/versions/dock385/env.sh

    Once decoys have been assigned, the decoys and actives need to be built for docking.

    First run:

    .. code-block:: bash
        
        python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/0003b_write_out_ligands_decoys.py {NEW_DIR_NAME} {COPY_TO_DIR}

    This will create ``COPY_TO_DIR`` with the SMILES files:

    .. code-block:: bash
        
        ligands.smi - this includes the input actives which have property matched decoys
        decoys.smi  - this includes the canonicalized property-matched decoy SMILES
        decoy_protomers.smi - this includes the actual property-matched decoy protomer SMILES

    To build the decoys and actives run (usually we use a pH of 7.4):

    .. code-block:: bash

        bash /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/qsub_build_ligands.sh {COPY_TO_DIR}

    If you need more decoys, you can take more from the ``{LIGAND_ID}_replacements.txt`` files.

5. **Packaging decoys and actives for dockopt**

    Once your actives and decoys are done building, they need to packaged into .tgz files for docking. Run:

    .. code-block:: bash

        python /mnt/nfs/exk/work/bwhall61/decoy_gen_improvement/format_db2_for_dockopt.py {COPY_TO_DIR}
    
    The output will be two files inside ``{COPY_TO_DIR}`` : ``actives.tgz`` and ``decoys.tgz`` which will be used for retrospective docking.

6. **Visualizing decoy properties**

    *Note:* This needs a little bit of updating. Some of the plots treat discrete variables (HDB for ex) as continuous and make it difficult to compare the distributions.

    To visualize the distributions of molecular properties:

    .. code-block:: bash

        python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0004_plot_properties.py {NEW_DIR_NAME}

    To visualize the Tanimoto similarity between the decoys and the input ligands run:

    .. code-block:: bash

        python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0005_plot_tanimoto_to_lig.py {NEW_DIR_NAME}


DockOpt
~~~~~~~

DockOpt generates many different docking configurations which are evaluated in parallel.

The goal is to enhance the ability to score the actives better than the property-matched decoys.

By default, DockOpt is used to try different combinations of matching spheres, desolvation thin spheres, and electrostatic thin spheres.

For more info visit `the wiki page <https://wiki.docking.org/index.php?title=Dockopt_(pydock3_script)>`_


1. **Source environments:**:

    **Wynton** (Wynton environment needs to be updated with Brendan's pydock updates)

    .. code-block:: bash
        
        source /wynton/group/bks/soft/python_envs/env.sh

    **Gimel** (Any node other than ``gimel`` itself)

    .. code-block:: bash

        source /mnt/nfs/soft/dock/versions/dock385/env.sh

    *Note*: I've only ever tried this on gimel so if someone runs into issues on wynton please let me know.

2. **Create DockOpt Job**

    Ensure you are in the directory containing the required input files:

        - ``rec.pdb`` or ``rec.crg.pdb``
        - ``xtal-lig.pdb``
        - ``actives.tgz``
        - ``decoys.tgz``

    The ``actives.tgz`` and ``decoys.tgz`` are each tarballs which contain the DB2 files for your actives and decoys, respectively. 
    
    So it should look something like:

    .. code-block:: bash

        - actives.tgz
            - lig1.db2 
            - lig2.db2
            - etc

    Note that the ligands can be ``.db2`` or ``.db2.gz``.

    To create a new dockopt job run:

    .. code-block:: bash

        pydock3 dockopt - new

    The default job directory is named ``dockopt_job``. To change the name, use ``--job_dir_name``:

    .. code-block:: bash

        pydock3 dockopt - new --job_dir_path=my_job_name

    The job directory contains two sub-directories:
        1. ``working``: input files, intermediate blaster files, sub-directories for individual blastermaster subroutines
        2. ``retrodock_jobs``: individual retrodock jobs for each docking configuration

    If your current directory has any of the following files, they will be copied and used. Otherwise default versions / generated versions will be made:
        - ``rec.pdb``
        - ``rec.crg.pdb``
        - ``xtal-lig.pdb``
        - ``reduce_wwPDB_het_dict.txt``
        - ``filt.params``
        - ``radii``
        - ``amb.crg.oxt`` 
        - ``vdw.siz``
        - ``delphi.def``
        - ``vdw.parms.amb.mindock``
        - ``prot.table.ambcrg.ambH``

3. **Configure the DockOpt Job**

    The ``dockopt_config.yaml`` file specifies the search values that dockopt will use.
    We typically are interested in trying different combinations of matching spheres, electrostatic thin spheres, and desolvation thin spheres.
    Larger ligand desolvation thin sphere radius increases the desolvation penalty at a point in space.
    Larger electrostatic thin sphere radius increase the low dielectric boundary so the electrostatic interactions become stronger.
    The values to check are specified in the ``dockopt_config.yaml`` in lines like so:

    .. code-block:: bash

        thin_spheres_elec:
            use: true
            molecular_surface_density: 10.0
            distance_to_surface:
            - 1.0
            - 1.1
            - 1.2
            - 1.3
            - 1.4
            - 1.5
            - 1.6
            - 1.7
            - 1.8
            - 1.9
  
        thin_spheres_desolv:
            use: true
            molecular_surface_density: 1.0
            distance_to_surface:
            - 0.1
            - 0.2
            - 0.3
            - 0.4
            - 0.5
            - 0.6
            - 0.7
            - 0.8
            - 0.9
            - 1.0

        matching_spheres_perturbation:
            use: true
            num_samples_per_matching_spheres_file: 20
            max_deviation_angstroms: 0.5
            perturb_xtal_spheres: false


    This configuration tries the 10 different listed distances for the electrostatic thin spheres and the desolvation thin spheres and also tries 20 sets of matching spheres.
    In total, this configuration will produce and test 2000 different docking setups.

    **TARTING**

        Tarting is yet to be implemented in dockopt but we are working on making it easy


4. **Set Environment Variables**

    **Wynton** (dev node)

    .. code-block:: bash

        export TMPDIR=/scratch
        export QSTAT_EXEC=/opt/sge/bin/lx-amd64/qstat
        export QSUB_EXEC=/opt/sge/bin/lx-amd64/qsub
        export SGE_SETTINGS=/opt/sge/wynton/common/settings.sh

    **Gimel** (Any node other than ``gimel`` itself)

    .. code-block:: bash

        export TMPDIR=/scratch
        export SBATCH_EXEC=/usr/bin/sbatch
        export SQUEUE_EXEC=/usr/bin/squeue

5. **Running DockOpt**

    Once you have configured your ``dockopt_config.yaml`` you can run your dockopt job with:

    .. code-block:: bash

        cd <JOB_DIR_NAME>
        pydock3 dockopt - run <JOB_SCHEDULER_NAME> --retrodock_job_max_reattempts=5

    where ``<JOB_SCHEDULER_NAME>`` is ``sge`` for wynton and ``slurm`` for gimel.

    Once the dockopt job is complete, the following files will be generated in the job directory:

        - *report.html*: contains (1) a histogram of the performance of all tested docking configurations compared against a distribution of the performance of a random classifier, so as to show whether the test docking configurations are significantly better than ones that can be produced by a random classifier. This is necessary due to the fact that many configurations are being tested. Hence, a Bonferroni correction is applied to the significance threshold, dividing p=0.01 by the number of tested configurations. (2) ROC, charge, and energy plots of the top docking configurations, comparing actives vs. decoys, (3) box plots of enrichment for every multi-valued config parameter, and (4) heatmaps of enrichment for every pair of multi-valued config parameters.
        - *results.csv*: parameter values, criterion values, and other information about each docking configuration.
  
    Within each sub-directory of *best_retrodock_jobs* there are:
        - *dockfiles/*: parameter files and *INDOCK* for given docking configuration
        - *output/* contains *actives* and *decoys* subdirectories which contain the OUTDOCK files for the actives and decoys
        - plot images

Assuming you are happy with the results, you *could* simply proceed to using the dockfiles for a large-scale docking campaign.

However, we suggest one more retrospective test which is to do an extrema set docking to check for charge preference.

Extrema Set
~~~~~~~~~~~

The idea of an extrema set is that you provide a list of active molecules and in return get a list of molecules with -2, -1, 0, +1, +2 charge that
are in the 25th-75th percentile of MW and logP of your actives. The idea is not to find property matched molecules, but instead to look at the
charge preference of your docking setup.   

1. **Source Environments**

    .. code-block:: bash

        source /mnt/nfs/soft/dock/versions/dock385/env.sh

2. **Get Extrema Molecules**

    .. code-block:: bash

        python /mnt/nfs/exk/work/bwhall61/extrema_improvement/smiles_0001_map_tranche_collect_db2_gz.py {SMILES_FILE} {NUM_MOLS}

    This will get {NUM_MOLS} molecules from each tranche for each charge. 


3. **Build the extrema molecules**

    *Note*: This must run on epyc or a slurm node (working on it... - docker on gimel/SGE nodes is too old)

    If this is your first time building molecules with the new pipeline, you must first ask the Irwin group to add you to the `docker` group on the slurm nodes or building will not work.

    .. code-block:: bash

        bash /mnt/nfs/exk/work/bwhall61/extrema_improvement/qsub_build_ligands.sh

4. **DOCK the extrema molecules**

    You just need to provide the path to the dockfiles that you want to use

    .. code-block:: bash

        export DOCKFILES=[/absolute/path/to/dockfiles]
        bash /mnt/nfs/exk/work/bwhall61/extrema_improvement/qsub_dock_ligands.sh

    The output will be a folder named after your dockopt setup (ex. rank=1-step=1_step-conf=419) and will contain docking directories for each charge. This makes it easy to re-run this step with
    other dockopt setups and have them named differently and easily see their rank.

5. **Plot charge preferences**

    .. code-block:: bash

        python /mnt/nfs/exk/work/bwhall61/extrema_improvement/plot_charge_preference.py --input_folder [docking_folder]

    This will put `charge_preference.png` in the docking_folder that will show the score distributions for each of the charges.

Advanced Uses
-------------

Tarting
~~~~~~~
https://wiki.docking.org/index.php?title=DOCK_3.7_tart

Tarting refers to changing the partial charges of specific atoms in the protein receptor to modify (enhance/decrease) ligand preferences for specific parts of the binding site. Generally one would redefine the partial charges distribution of a specific amino acid, and then use this modified residue when calculating the  electrostatic potential for the receptor. 

*  charged residues are not tarted usually, especially Asp.
*  backbone is better for tarting instead of hanging groups
*  usually 0.2 is enough.  

.. Manual modification of Matching Spheres
.. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. (This is not for DOCK 3.8 - this won't make sense for right now)

.. First convert generated spheres to .pdb format (directory “working”):
.. ``csh $DOCKBASE/proteins/showsphere/doshowsph.csh matching_spheres.sph 1 spheres.pdb``

.. Open spheres in Chimera and move as needed (Tools – Movement – Movement
.. Mouse Mode, with middle button). Save as pdb. Make a new directories
.. (/spheres/ and /spheres/working/) and copy rec.pdb and xtal-lig.pdb to
.. /spheres/. Copy spheres-new.pdb to /working/. Since
.. useExistingMatchingSphflag option seems to turn off spheres module at
.. all, you will need to copy additional files.

.. .. code:: bash

..    cd working
..    cp ../../working/box  .
..    cp ../../working/lowdielectric.sph .
..    $DOCKBASE/proteins/pdbtosph/bin/pdbtosph spheres-new.pdb matching_spheres.sph
..    cd ..
..    $DOCKBASE/proteins/blastermaster/blastermaster.py --addhOptions=" -HIS -FLIPs " -v --useExistingMatchingSphflag

.. matching_spheres_v1.sph change to the new file name in INDOCK

.. ``receptor_sphere_file ../dockfiles/matching_spheres.sph``