Running ChemSTEP (Auto DOCK and Build)
=======================================

Currently, ChemSTEP is set up to run on Wynton with libraries of 13B and 22B. Below are instructions for running ChemSTEP with automatic submission of docking and building.

1. **Source Environment**

   .. code-block:: text

       source /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/bin/activate

2. **Dock the Seed Set**

   Copy the .sdi file for the library you want to use:

   - 13B: ``/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/13B/13M_seeds.sdi``
   - 22B: ``/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/22B/22M_seeds.sdi``

   Then, DOCK the seed set. See Large-Scale Docking (LSD) directions.

3. **Gather Scores for the Seed Set**

   Once docking is complete, run the following from the directory one level above your docking output (MOLECULES_DIR_TO_BIND).

   For 22B library::

       python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/get_scores.py 0

   For the 13B library::

       python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/get_scores_13B.py 0 MOL

   *Note:* You must specify the molecule ID prefix for the 13B library (``MOL``).

   Verify that ``scores_round_0.txt`` was correctly written::

       wc -l scores_round_0.txt

4. **Convert Scores to .npy Files**

   Convert scores to ChemSTEP-readable ``.npy`` files::

       python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/convert_scores_to_npy.py 0 <mol_id_prefix>

   The ``mol_id_prefix`` should match the library — default is ``CSLB`` for 22B/72B, or ``MOL`` for 13B.

5. **Set Up the ChemSTEP Run Directory**

   Make a directory to run ChemSTEP in, cd into it, and copy in necessary files by running:

   .. code-block:: text

       mkdir chemstep_run
       cd chemstep_run
       chemstep-run-new

   You should now have the following files in your ChemSTEP run directory: ``params.txt``, ``run_chemstep.py``, and ``launch_chemstep_as_job.sh``.

   If running with integrated IFP for beacon selection, also run:

   .. code-block:: text

       chemstep-run-ifp

   This will copy in additional files necessary to run IFP including* ``ifp_acceptance_criteria.txt`` and ``interactions.txt``.

6. **Edit params.txt**

   Add the absolute paths to the ChemSTEP-readable score and indices numpy arrays generated in Step 4. The remaining values are left to the user's discretion, with considerations below.

   .. code-block:: text

       seed_indices_file:  /path/to/your/indices_round_0.npy
       seed_scores_file:   /path/to/your/scores_round_0.npy
       hit_pprop:          5.5
       n_docked_per_round: 2000000
       bundle_size:        1000
       max_beacons:        100
       max_n_rounds:       250

   **hit_pprop**: Defines what is considered a "virtual hit." pProp is defined as the -log(rank%) of a molecule within the total library score distribution. For example, a pProp of 4 in the 13B space is equivalent to the top 0.01% of the library (~1.3M molecules); pProp 5 = 0.001% = ~132K virtual hits. ChemSTEP will estimate a DOCK score threshold from the seed set and flag anything scoring better as a virtual hit. Be mindful of seed set size: we suggest the seed set should contain at least 10^(pProp+2) molecules.

   **n_docked_per_round**: Number of molecules prioritized per round. Note that these molecules must all be built and docked between rounds — too many will slow throughput and may reduce diversity; too few may slow virtual hit recovery. Round size does not significantly impact algorithm runtime.

   **max_beacons**: Number of diverse, well-scoring molecules used to guide prioritization per round. All molecules scoring above the pProp threshold are candidates. ChemSTEP selects beacons to maximize diversity by default. Too many beacons reduces inter-beacon diversity; too few can hinder space exploration. Fewer beacons than specified may be assigned if not enough molecules clear the pProp threshold.

   **bundle_size**: In auto docking mode, the number of molecules submitted to build as a single job.

   **max_n_rounds**: No need to adjust this when running ChemSTEP prospectively as outlined here.

7. **Edit run_chemstep.py** with your text editor of choice.

   Update ``lib_path`` to the library pickle for your library:

   - 13B: ``/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/13B/boltz_fplib.pickle``
   - 22B: ``/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/22B/22B_fplib.pickle``

   .. code-block:: text

       lib_path = '/full/path/to/library.pickle'

   Input the path to your dockfiles:

   .. code-block:: text

       dockfiles_path="/full/path/to/dockfiles"

  *NOTE*: All paths must be absolute paths!

   **Optional parameters:**

   a) **minTD exclusion zone** — molecules will not be prioritized from within a certain Tanimoto distance of beacons. Comment in the relevant lines and update the value. Consider also setting ``enforce_n_docked_per_round = True`` when using this option:

      .. code-block:: text

          min_td_search=0.5
          enforce_n_docked_per_round=true

   b) **Integrated IFP** — only selects beacons that satisfy user-defined interaction criteria. Comment in the following lines and update the paths to the necessary files (copied in Step 5 if you ran* ``chemstep-run-ifp`` ).

      .. code-block:: text

          use_IFP=true,
          ifp_pdb_path='/full/path/to/rec.crg.pdb'
          interactions_file='/full/path/to/interactions.txt',
          ifp_acceptance_criteria_file='/full/path/to/ifp_acceptance_criteria.txt'


      The two IFP input files are configured as follows:

      - ``interactions.txt``: one interaction per line, columns separated by commas. Format: ``interaction_type, residue_name_and_number``. Example:

      .. code-block:: text

          Hydrogen bond, GLY19
          Ionic, ASP149

      Supported interaction types include: Proximal, Hydrogen bond, Ionic, Cation-pi, Hydrophobic, Halogen bond, etc. See LUNA and IFP documentation for the full list.


      - ``ifp_acceptance_criteria.txt``: defines the number of unsatisfied donors/acceptors/specific interactions required for a molecule to pass IFP and be considered for beacon selection. Example:

      .. code-block:: text

          #_donors
          #_acceptors
          #_unstatisfied_donors == 0
          #_unstatisfied_acceptors <= 4
          Ionic/ASP-149 > 0


   Below is an example instantiation for AmpC on the 22B library with a minTD exclusion zone of 0.50 and no IFP:

   .. code-block:: text

       lib_path = '/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/22B/22B_fplib.pickle'
       lib = load_library_from_pickle(lib_path)
       algo = CSAlgo(lib, 'params.txt', 'output', 16, verbose=True,
           scheduler='sge', smi_id_prefix='CSLB',
           python_exec="/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/bin/python",
           dockfiles_path="/wynton/group/bks/work/kholland/chemstep_ampc_22B/seed_docking/dockfiles",
           min_td_search=0.5,
           enforce_n_docked_per_round=true,
           #use_IFP=true,
           #ifp_pdb_path='/path/to/your/reference/rec.crg.pdb',
           #interactions_file='/path/to/your/interactions.txt',
           #ifp_acceptance_criteria_file='/path/to/your/ifp_acceptance_criteria.txt',
           docking_method="auto", track_beacon_orig=True)

8. **Launch the Job**

   Submit the main ChemSTEP job::

       qsub launch_chemstep_as_job.sh

9. **Monitor Job Status**

   Check job status with ``qstat``. The main job will run for up to 2 weeks given no errors. ChemSTEP will launch search, building, and docking jobs in successive rounds.

   *Note:* If any building or docking subjobs hang, the main job will not proceed until those are canceled or finished. Keep an eye on job statuses regularly. Occasionally check that docking output files (``scores_round_*.txt``) are being populated.

10. **View Beacon SMILES and IDs**

    From the ChemSTEP running directory, run the following in a screen session on a dev node::

        python /wynton/group/bks/work/shared/kholland/scripts/get_beacon_smiles.py /path/to/library/pickle chemstep_algo.log

    Use the library pickle path from step 7.

11. **Get Poses After Docking**

    Make a list of ``test.mol2.gz.0`` files from docking::

        find /round_*_docking/bundle_paths -maxdepth 2 -name "test.mol2.gz.0" > docked_poses.txt

    Then extract top poses::

        python /wynton/group/bks/work/bwhall61/for_beau/top_poses.py \
            -t <pProp_threshold> \
            -s <num_poses_per_file> \
            -dock_results_path docked_poses.txt