Previous Up

Computer-Aided Drug Design: 2. Homework: Molecular Descriptors in QSAR

Drug Toxicity

Medicinal chemists face many challenges toward developing a new drug. First, they typically try to increase the potency of the drug by maximizing the affinity of drug candidates toward the target. This process if often guided performing structure-activity studies on a variety of ligands to identify the pharmacophore. If the three-dimensional structure of the target is available, the optimization can be also based on such structure. In cases where the potency is controlled by the drug's bioavailability, quantitative structure-activity studies are valuable in guiding toward more bioavailable compounds. Second, medicinal chemists try to minimize the drug's toxicity. However, there are a multitude of factors that can possibly lead to the drug's toxicity. For example, specific binding to another target molecule, non-specific binding to nucleic acids, general reactivity towards cellular nucleophiles (protein sulfhydryls, DNA bases), production of free-radicals, or effects on membranes and ion potentials across membranes could contribute to drugs toxicity. Because of this multitude of effects, toxicity studies are typically done in vivo, using model organisms. One of the simple model organisms for toxicity studies is a ciliated protozoan Tetrahymena. The advantages of Tetrahymena are it can be easily cultured, and it is biochemically and genetically well studied. For example, one could use Tetrahymena to identify the mechanism of toxicity by controlling the expression of genes that may mediate the toxicity of a compound. Tetrahymena model is currently widely used to predict environmental toxicology; due to differences between humans and ciliated protozoa, the Tetrahymena is less-than-ideal model for prediction of drug toxicity in humans. Toxicity is determined as the concentration of the compound that gives 50% growth impairment of Tetrahymena in aqueous medium (IGC50). For mathematical analysis, the logarithm of inverse of the toxic concentration (log[IGC50-1]) is commonly used. Toxic compounds have small IGC50 and large log[IGC50-1].

Correlation of Drug Structure and Toxicity

In order to predict toxicity of drugs before they are synthesized, the factors that correlate with toxicity in the class of compounds must be determined. In the absence of a priori known mechanism of toxicity, such factors can be elucidated from QSAR in which correlation of various factors to toxicity is assessed via statistical data analysis. The data sets used for elucidation of molecular descriptors that contribute to toxicity tend to be large (hundreds of compounds) for two reasons. First, it is possible that many factors simultaneously contribute to the toxicity, and thus it is expected that the QSAR model contains many independent variables, possibly relating to the toxicity in a non-linear manner. Second, for valid statistical analysis, one needs a significant number of data points per descriptor. For example, the TETRATOX database currently contains Tetrahymena growth impairment data for over 2,400 organic compounds. In this homework, you will simulate a QSAR analysis of Tetrahymena toxicity using a small training data set that has been obtained from a larger data set by (i) eliminating compounds that were too similar to retained compounds, (ii) eliminating gross outliers, (ii) eliminating molecules for which different hydrophobicity measures diverged too much. The large data set is published in "Identification of reactive toxicants: Structure–activity relationships for amides" by T.W. Schultz, J.W. Yarbrough and S.K. Koss. The molecules here are smaller than real drugs, allowing for rapid computational analysis. Students in the course may access this paper from here.

In their work, Schultz and co-workers linked the toxicity to two molecular descriptors: hydrophobicity and electrophilicity. The hydrophobicity is numerically characterized by the logarithm of 1-octanol/water partition coefficient (logP, or logKow. This property can be determined experimentally or predicted with some reliability using various fragment-based prediction schemes. Free online services (e.g. Virtual Computational Chemistry Laboratory (VCCL) ) can provide hydrophobicity estimates for many molecules. In this assignment you will use the same hydrophobicity estimates as were used in the original paper. You can either look these up from Shultz's paper or use the VCCL service (pick the experimental values if available, or use KOWWIN estimates). The electrophilicity is numerically characterized by the energy of the lowest unoccupied molecular orbital (LUMO). The LUMO energy can be predicted using quantum mechanics (QM). However, it is well known that the results of QM calculations depend on approximations that were used to make the calculations possible. The authors had used the AM1 semiempirical model. One of your tasks is to assess how the LUMO energies at AM1 level compare with LUMO energies calculated at ab initio SCF level using moderately large basis sets.

The following table lists 12 amides for which toxicity has been experimentally determined by Schultz and co-workers. The compounds are ranked from the least toxic to the most toxic. IGC50 is given in mM.

Name of the compound		CAS number	SMILES string		log[IGC50-1]
Acetamide 			60-35-5 	CC(=O)N 		-2.32 	
Propionamide 			79-05-0 	CCC(=O)N 		-2.09 	
n-Butyramide 			541-35-5 	CCCC(=O)N 		-1.79 	
3-Chloropriopionamide 		5875-24-1 	C(Cl)CC(=O)N 		-1.59
N-Methylpropionamide 		1187-58-2 	CCC(=O)NC 		-1.53
Trimethylacetamide 		754-10-9 	CC(C)(C)C(=O)N 		-1.48 	
2-Chloropropionamide 		27816-36-0 	CC(Cl)C(=O)N 		-1.44
N-Isopropylacrylamide 		2210-25-5 	C=CC(=O)NC(C)C 		-1.31
N,N-Dimethylacrylamide 		2680-03-7 	C=CC(=O)N(C)C 		-1.24
2,2-Dichloroacetamide 		683-72-7 	C(Cl)(Cl)C(=O)N 	-0.98
n-Hexanoamide 			628-02-4 	CCCCCC(=O)N 		-0.91 	
Acrylamide 			79-06-1 	C=CC(=O)N 		-0.81 	
2,2,2-Trichloroacetamide 	594-65-0 	C(Cl)(Cl)(Cl)C(=O)N 	-0.29 	

When performing the following tasks, you can either work independently or you can pair up with another student in the class. Each student or pair will perform the following tasks. Workstations drug1, drug2, and drug3 are available for calculations.

  1. Obtain the initial structures of these 12 molecules using a method of your choice. Besides the obviously easy SMILES to structure conversion offered by Online SMILES TRanslator, you may also consider looking into the NIST Computational Chemistry Comparison and Benchmark Database, which provides output files from Gaussian quantum mechanical calculations. The files in the Input and Output files section already contain optimized geometries; you need to find ones that correspond to HF/6-31G(d) (same as HF/6-31G*) level.
  2. Optimize the geometry of each of the molecules using the SCF theory and 6-31G(d) basis set. If you obtained the optimized structures from the NIST database, you do not need to perform this step. If you obtained the unoptimized structure using SMILES or other methods, optimize the molecule with Gaussian through MOLDEN. In the Submit Gaussian Job window, change the Basis Set: to 6-31G*, give a unique name for Job Name and hit Submit. The calculations will take several minutes. You can close the Zmatrix Editor window and reload using the Read button, then click on Geom. Conv. button to monitor the progress of minimization. The optimization stops when the energy does not change much from step to step, when the forces are near zero, and when the structure does not change significantly from step to step. You can also check the end of the file using command tail filename.log to see if the calculation has successfully completed.
  3. Read the optimized structure from the log file into MOLDEN. To load the optimized structure, click on Geom. Conv. button and then click on the last point in the optimization series such that a circle appears around this point. MOLDEN now will use this geometry when you set up subsequent calculations. The next part depends on if you work alone or within a pair. If you work alone, perform a single-point calculation at the HF/6-31+G* level. If you work within a pair, a pair will perform a calculation with the HF method using 6-31+G(2d,p) basis set. These calculations will take a little longer. You can check the end of the file using command tail filename.log to see if the calculation has finished. Repeat this step for each molecule.
  4. Read the result of this calculation into MOLDEN. Go to Density Mode and click on the Orbitals button. A new window opens, listing the molecular orbitals, their occupancies, and energies. The LUMO is the lowest energy orbital with zero occupancy. Write down the LUMO energy of this molecule. Repeat this step for each molecule.
  5. Create a picture that visually shows the electrostatic potential mapped onto electron density for the least and most toxic of these 12 molecules. Discuss most prominent differences.
  6. Create a table showing the logP, calculated LUMO energy, and log[IGC50-1] for each compound. You can create this table first as a comma-separated text file on the SGI workstation with the text editor jot. The file can then be imported into a data analysis program such as Mathematica. To import the file called tablefile.txt from within Mathematica, type data = Import["tablefile.txt", "Table"]. To show the formatted table, type TableForm[data]. A brief introduction to how to work with Mathematica is here. Background and examples of performing linear regression and multivariate regression with Mathematica are here.
  7. Create a plot showing how toxicity depends on logP for these compounds. You can use program of your choice for single-variable linear regression; (xmgrace is installed on SGI computers. When using Mathematica, you can separate the data from the imported file into three lists. For example, if you logP values are in the first column, then Mathematica command Hydrophobicity = data[[All,1]] extracts logP values from the table. You can similarly extract the toxicity values from the third column. To plot the data, combine hydrophobicity and toxicity data into one new table by using DataTox1 = Transpose[{Hydrophobicity, Toxicity}] than plot these using ListPlot function: ListPlot[DataTox1].
  8. Perform linear regression. Mathematica commands << Statistics`LinearRegression`" followed by Regress[DataTox1, {1, x}, x] will perform the linear regression. The values under Estimates are the regression parameters and values under SE are their standard errors. The value in the row with label 1 is the y-intercept and the value in the row x is the slope. Provide the best-fit equation and r2 statistics that characterizes the goodness of fit. Notice that the correlation between hydrophobicity and toxicity is weak because toxicity depends not only on hydrophobicity but also on other molecular descriptors. Plotting the regression curve is more tedious with Mathematica and you are not required to show such curve graphically.
  9. Create a plot showing how toxicity depends on LUMO energy for these compounds. Perform linear regression and provide the best-fit equation. Mathematica, command eLUMO = data[[All,2]] extracts LUMO energies from the imported table.
  10. Perform multivariable regression that shows how toxicity depends on both logP and LUMO energy for these compounds. Provide r2 statistics that characterizes the goodness of fit. When using Mathematica, you can try to see if ListPlot3D creates a useful picture characterizing the effect of logP and LUMO energy on toxicity.
  11. The previous data in the training set established a QSAR that may allow to predict toxicity of new compounds. Use your established QSAR to predict toxicity of one or two new compounds. If you work alone, calculate logP and LUMO energy (at HF/6-31+G* level) for 2-chloroacetamide (C(Cl)C(=O)N). If you work within a group, calculate log P and LUMO energy (at HF/6-31+G(2d,p) level) for 2-chloroacetamide (C(Cl)C(=O)N) and N,N-diethylacetamide (CC(=O)N(CC)CC). Compare your predictions with published toxicity values and discuss your findings.


Previous Up

Tutorial by Dr. Kalju Kahn, Department of Chemistry and Biochemistry, UC Santa Barbara. ©2005-2007.