Introduction
What follows are notes on getting Virtual Flow running on one machine to understand the mechanics and scaling around how in silico drug discovery works. Virtual Flow is an open-source drug candidate screening platform that has been designed to screen millions of compounds against a protein or receptor target.
Background
Bringing a new drug to market is an expensive endeavor. Cost estimates vary widely, but one study found the median cost of bringing a drug to market was $1.1 billion (in 2018 dollars). There are many components that contribute to this cost, so it’s important to find ways to reduce cost along the development pipeline. One area that’s received a lot of attention recently is the development of in silico drug discovery systems that reduce the cost of “hit discovery”, finding the small molecule compounds (i.e. ligands) that show high affinity to a target (i.e. receptor or protein).
At the core of in silico drug discovery is the use of molecular docking simulations (e.g. AutoDock Vina) that are used to predict the binding affinity between the small molecule compound and the target. The docking is usually scored according to the binding energy between the compound and target.
Virtual Flow
A very high-level overview of a drug screening system such as VirtualFlow is as follows:
graph TD
A(Compound Database)-->B(Generate Docking Conformations)
B --> C(Dock conformation with Receptor)
C --> D(Measure Binding Energy)
D --> E(Store results)
While VirtualFlow has been designed to scale across multiple machines, these notes will step through running VFVS on one Ubuntu 22.04 machine.
SLURM Batch System Setup
In order to use VirtualFlow, a batch system needs to be setup first. Here’s how I setup Simple Linux Utility for Resource Management (SLURM):
-
Install both slurmd and slurmctld since both the controller and node daemon will be running on the same machine.
sudo apt update -y sudo apt install slurmd slurmctld -y
- Add the slurm config:
sudo touch /etc/slurm/slurm.conf sudo chmod 755 /etc/slurm/slurm.conf
-
Add the following to the slurm.conf file (customizing it as necessary for the machine hardware):
# slurm.conf file for Ubuntu on with debug logging in /var/log/slurm # Control machine configuration SlurmctldHost=localhost SlurmctldPort=6817 SlurmdPort=6818 # Node configuration # A single-CPU computer with 16 cores and 32 GB RAM with hyperthreading on NodeName=localhost Sockets=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=32000 State=UNKNOWN # Partition configuration PartitionName=test Nodes=localhost Default=YES MaxTime=INFINITE State=UP AccountingStorageType=accounting_storage/none JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log SlurmdSpoolDir=/var/lib/slurm/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm/slurmctld # User configuration ClusterName=localcluster # Timers InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=60 SlurmdTimeout=150 Waittime=0 # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core # Logging configuration SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdLogFile=/var/log/slurm/slurmd.log SlurmSchedLogFile=/var/log/slurm/slurmsched.log JobCompType=jobcomp/filetxt JobCompLoc=/var/log/slurm/jobacct.log SlurmdDebug=info SlurmctldDebug=info
-
Start slurmd and slurmctl:
sudo systemctl start slurmd sudo systemctl start slurmctld
Useful SLURM commands for Virtual Flow:
sinfo
- check the state of queue (partition)squeue
- check for jobs in the queuescancel -u <user>
- cancel all the jobs belonging to the particular user. Jobs can also be stopped in other ways in Virtual Flow.
Virtual Flow Setup
For these notes, I’ll be using the setup described in the Virtual Flow tutorial
-
Get VFVS_GK tutorial files:
cd ~/dev/ wget https://virtual-flow.org/sites/virtual-flow.org/files/tutorials/VFVS_GK.tar tar -xvf VFVS_GK.tar
-
Select the compounds according to these instructions (optional, if testing beyond the tutorial files). Run
source tranches.sh
from theVFVS_GK/input-files/ligand-library
folder to download the compound files. -
Edit
tools/templates/all.ctrl
and set these values according to the number of cores on the machine (e.g. 16)cpus_per_step=16 queues_per_step=16 cpus_per_queue=16 ...
-
Install Open Babel:
sudo apt-get install openbabel
Virtual Flow Run
After the setup is complete, follow these steps for each run:
-
Prepare the output folders:
cd ~/dev/VFVS_GK/tools ./vf_prepare_folders.sh
-
Start Virtual Flow:
The below is an example with one job, and one queue. Note that the number of cores used in the queue will be used according to the
cpus_per_queue
command../vf_start_jobline.sh 1 1 templates/template1.slurm.sh submit 1
-
Check Virtual Flow status:
./vf_report.sh -c workflow
Example output:
Total number of ligands: 1123 Number of ligands started: 21 Number of ligands successfully completed: 21 Number of ligands failed: 0 ... Docking runs per ligand: 2 Number of dockings started: 42 Number of dockings successfully completed: 42 Number of dockings failed: 0
-
Check a particular docking method (with top 10 compounds by binding energy):
./vf_report.sh -c vs -d qvina02_rigid_receptor1 -n 10
Example output:
``` Binding affinity - statistics
……………………………………………………………………………………
Number of ligands screened with binding affinity between 0 and inf kcal/mole: 26 Number of ligands screened with binding affinity between -0.1 and -5.0 kcal/mole: 119
…
Binding affinity - highest scoring compounds
……………………………………………………………………………………
Rank Ligand Collection Highest-Score
1 ABC-1234_1 XXXXXX_00000 -7.6
2 ABC-1234_2 XXXXXX_00000 -7.6
3 XYZ-4321_4 XXXXXX_00000 -7.4
...
```
Monitoring and Debugging
In addition to running vf_report.sh
, it can be useful to also monitor the slurm logs:
sudo tail -f /var/log/slurm/*.log
Typically failed jobs will appear in /var/log/slurm/jobacct.log
with a FAILED JobState. For example:
JobId=3391 ... Name=t-1.1 JobState=FAILED
To view error messages, try looking in the logs under:
workflow/output-files/queues/
workflow/output-files/jobs
Related Note:
These logs were useful for debugging a particular issue with leading zeroes not being removed when doing a date calculation. So in one-queue.sh
, I had to change the start/end date calculations (e.g. docking_start_time_s
) $(($(date +'%s * 1000 + %-N / 1000000')))
to $(($(date +'%s')))
.
Virtual Flow Run Completion
Once the Virtual Flow is complete, rank all the ligands and extract the docking poses:
-
Rank the ligands
Add VFTools to your path:
export PATH=$PATH:/home/<user>/dev/VFTools/bin
Then run:
cd ~/dev/VFVS_GK mkdir -p pp/ranking cd pp/ranking vfvs_pp_ranking_all.sh ../../output-files/complete/ 2 meta_tranche
-
Get the top 100 docking poses
cd ~/dev/VFVS_GK/pp/ranking/qvina02_rigid_receptor1 head -100 firstposes.all.minindex.sorted.clean > compounds
-
Extract the docking poses
cd ~/dev/VFVS_GK mkdir docking_poses cd docking_poses vfvs_pp_prepare_dockingposes.sh ../output-files/complete/qvina02_rigid_receptor1/results/ meta_tranch ../pp/ranking/qvina02_rigid_receptor1/compounds dockingsposes overwrite
Follow-ups
- Why does
./vf_report.sh -c vs -d qvina02_rigid_receptor1 -n 10
andhead -10 qvina02_rigid_receptor1/firstposes.all.minindex.sorted.clean > compounds
not have the same top 10 compounds? -
The compound screening process is much slower than expected (~60min for 1000 compounds on a 5900X Ryzen CPU). Possible things to try:
- A) Program docking to run on a GPU
- B) Experiment with other batching systems
- C) Spread jobs across a cluster (e.g. AWS ParallelCluster)
- D) Try and benchmark other docking programs (e.g. Quick Vina)
- In order to reduce computation cost, can serverless compute (e.g. AWS lambda) be used with the docking executables/binaries?
- Can we replace SLURM with other batching or queuing (e.g. Kafka, RabbitMQ) systems?
- I’m not a fan of bash as a scripting language due to the lack of modularity and the difficultly in debugging/logging. Can this be re-implemented Ruby or Python instead?
Useful References: