Hit-to-Lead, Virtual Screening Compounds for Drug Discovery

Introduction

What follows are notes on getting Virtual Flow running on one machine to understand the mechanics and scaling around how in silico drug discovery works. Virtual Flow is an open-source drug candidate screening platform that has been designed to screen millions of compounds against a protein or receptor target.

Background

Bringing a new drug to market is an expensive endeavor. Cost estimates vary widely, but one study found the median cost of bringing a drug to market was $1.1 billion (in 2018 dollars). There are many components that contribute to this cost, so it’s important to find ways to reduce cost along the development pipeline. One area that’s received a lot of attention recently is the development of in silico drug discovery systems that reduce the cost of “hit discovery”, finding the small molecule compounds (i.e. ligands) that show high affinity to a target (i.e. receptor or protein).

At the core of in silico drug discovery is the use of molecular docking simulations (e.g. AutoDock Vina) that are used to predict the binding affinity between the small molecule compound and the target. The docking is usually scored according to the binding energy between the compound and target.

Virtual Flow

A very high-level overview of a drug screening system such as VirtualFlow is as follows:

graph TD
	A(Compound Database)-->B(Generate Docking Conformations)
    B --> C(Dock conformation with Receptor)
    C --> D(Measure Binding Energy)
    D --> E(Store results)

While VirtualFlow has been designed to scale across multiple machines, these notes will step through running VFVS on one Ubuntu 22.04 machine.

SLURM Batch System Setup

In order to use VirtualFlow, a batch system needs to be setup first. Here’s how I setup Simple Linux Utility for Resource Management (SLURM):

Install both slurmd and slurmctld since both the controller and node daemon will be running on the same machine.
```
 sudo apt update -y
 sudo apt install slurmd slurmctld -y
```

Add the slurm config:

 sudo touch /etc/slurm/slurm.conf
 sudo chmod 755 /etc/slurm/slurm.conf

Add the following to the slurm.conf file (customizing it as necessary for the machine hardware):

 # slurm.conf file for Ubuntu on with debug logging in /var/log/slurm
	
 # Control machine configuration
 SlurmctldHost=localhost
 SlurmctldPort=6817
 SlurmdPort=6818
	
 # Node configuration
 # A single-CPU computer with 16 cores and 32 GB RAM with hyperthreading on
	
 NodeName=localhost Sockets=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=32000 State=UNKNOWN
	
 # Partition configuration
 PartitionName=test Nodes=localhost Default=YES MaxTime=INFINITE State=UP
	
	
 AccountingStorageType=accounting_storage/none
 JobCompType=jobcomp/none
 JobAcctGatherFrequency=30
 JobAcctGatherType=jobacct_gather/none
 SlurmctldDebug=info
 SlurmctldLogFile=/var/log/slurm/slurmctld.log
 SlurmdDebug=info
 SlurmdLogFile=/var/log/slurm/slurmd.log
 SlurmdSpoolDir=/var/lib/slurm/slurmd
 SlurmUser=slurm
 StateSaveLocation=/var/lib/slurm/slurmctld
	
	
 # User configuration
 ClusterName=localcluster
	
 # Timers
 InactiveLimit=0
 KillWait=30
 MinJobAge=300
 SlurmctldTimeout=60
 SlurmdTimeout=150
 Waittime=0
	
 # SCHEDULING
 SchedulerType=sched/backfill
 SelectType=select/cons_tres
 SelectTypeParameters=CR_Core
	
 # Logging configuration
 SlurmctldLogFile=/var/log/slurm/slurmctld.log
 SlurmdLogFile=/var/log/slurm/slurmd.log
 SlurmSchedLogFile=/var/log/slurm/slurmsched.log
 JobCompType=jobcomp/filetxt
 JobCompLoc=/var/log/slurm/jobacct.log
 SlurmdDebug=info
 SlurmctldDebug=info

Start slurmd and slurmctl:

 sudo systemctl start slurmd
 sudo systemctl start slurmctld

Useful SLURM commands for Virtual Flow:

sinfo - check the state of queue (partition)
squeue - check for jobs in the queue
scancel -u <user> - cancel all the jobs belonging to the particular user. Jobs can also be stopped in other ways in Virtual Flow.

Virtual Flow Setup

For these notes, I’ll be using the setup described in the Virtual Flow tutorial

Get VFVS_GK tutorial files:

 cd ~/dev/
 wget https://virtual-flow.org/sites/virtual-flow.org/files/tutorials/VFVS_GK.tar
 tar -xvf VFVS_GK.tar

Select the compounds according to these instructions (optional, if testing beyond the tutorial files). Run source tranches.sh from the VFVS_GK/input-files/ligand-library folder to download the compound files.
Edit tools/templates/all.ctrl and set these values according to the number of cores on the machine (e.g. 16)
```
 cpus_per_step=16
 queues_per_step=16
 cpus_per_queue=16
 ...
```
Install Open Babel:
```
 sudo apt-get install openbabel
```

Virtual Flow Run

After the setup is complete, follow these steps for each run:

Prepare the output folders:

 cd ~/dev/VFVS_GK/tools
 ./vf_prepare_folders.sh 

Start Virtual Flow:

The below is an example with one job, and one queue. Note that the number of cores used in the queue will be used according to the cpus_per_queue command.
```
 ./vf_start_jobline.sh 1 1 templates/template1.slurm.sh submit 1
```

Check Virtual Flow status:

 ./vf_report.sh -c workflow

Example output:

 Total number of ligands: 1123                                                     
 Number of ligands started: 21                                                     
 	Number of ligands successfully completed: 21                                                
 	Number of ligands failed: 0      
 	...
 	Docking runs per ligand: 2
 Number of dockings started: 42                                                     
 Number of dockings successfully completed: 42                                                
 Number of dockings failed: 0

Check a particular docking method (with top 10 compounds by binding energy):
```
 ./vf_report.sh -c vs -d qvina02_rigid_receptor1 -n 10
```
Example output:

``` Binding affinity - statistics
……………………………………………………………………………………

Number of ligands screened with binding affinity between 0 and inf kcal/mole: 26 Number of ligands screened with binding affinity between -0.1 and -5.0 kcal/mole: 119

… Binding affinity - highest scoring compounds
……………………………………………………………………………………

   Rank  Ligand             Collection    Highest-Score

   1     ABC-1234_1 		XXXXXX_00000  -7.6
   2     ABC-1234_2  		XXXXXX_00000  -7.6
   3     XYZ-4321_4  		XXXXXX_00000  -7.4
	...
```

Monitoring and Debugging

In addition to running vf_report.sh, it can be useful to also monitor the slurm logs:

sudo tail -f  /var/log/slurm/*.log

Typically failed jobs will appear in /var/log/slurm/jobacct.log with a FAILED JobState. For example:

JobId=3391 ... Name=t-1.1 JobState=FAILED

To view error messages, try looking in the logs under:

workflow/output-files/queues/
workflow/output-files/jobs

Related Note: These logs were useful for debugging a particular issue with leading zeroes not being removed when doing a date calculation. So in one-queue.sh, I had to change the start/end date calculations (e.g. docking_start_time_s) $(($(date +'%s * 1000 + %-N / 1000000'))) to $(($(date +'%s'))).

Virtual Flow Run Completion

Once the Virtual Flow is complete, rank all the ligands and extract the docking poses:

Rank the ligands

Add VFTools to your path:

 export PATH=$PATH:/home/<user>/dev/VFTools/bin

Then run:

 cd ~/dev/VFVS_GK
 mkdir -p pp/ranking
 cd pp/ranking
 vfvs_pp_ranking_all.sh ../../output-files/complete/ 2 meta_tranche

Get the top 100 docking poses

 cd ~/dev/VFVS_GK/pp/ranking/qvina02_rigid_receptor1
 head -100 firstposes.all.minindex.sorted.clean > compounds
	

Extract the docking poses

 cd ~/dev/VFVS_GK
 mkdir docking_poses
 cd docking_poses
 vfvs_pp_prepare_dockingposes.sh ../output-files/complete/qvina02_rigid_receptor1/results/ meta_tranch ../pp/ranking/qvina02_rigid_receptor1/compounds dockingsposes overwrite

Follow-ups

Why does ./vf_report.sh -c vs -d qvina02_rigid_receptor1 -n 10 and head -10 qvina02_rigid_receptor1/firstposes.all.minindex.sorted.clean > compounds not have the same top 10 compounds?
The compound screening process is much slower than expected (~60min for 1000 compounds on a 5900X Ryzen CPU). Possible things to try:
- A) Program docking to run on a GPU
- B) Experiment with other batching systems
- C) Spread jobs across a cluster (e.g. AWS ParallelCluster)
- D) Try and benchmark other docking programs (e.g. Quick Vina)
In order to reduce computation cost, can serverless compute (e.g. AWS lambda) be used with the docking executables/binaries?
Can we replace SLURM with other batching or queuing (e.g. Kafka, RabbitMQ) systems?
I’m not a fan of bash as a scripting language due to the lack of modularity and the difficultly in debugging/logging. Can this be re-implemented Ruby or Python instead?

Useful References: