機(jī)器學(xué)習(xí)預(yù)測(cè)藥品溶解度

python機(jī)器學(xué)習(xí)-乳腺癌細(xì)胞挖掘:http://dwz.date/bwey

藥物溶解度表示:
溶解度(solubility)系指在一定溫度(氣體在一定壓力)下,在一定量溶劑中達(dá)飽和時(shí)溶解的最大藥量,是反映藥物溶解性的重要指標(biāo)。溶解度常用一定溫度下100g溶劑中(或100g溶液或100ml溶液)溶解溶質(zhì)的最大克數(shù)來(lái)表示。例如咖啡因在20℃水溶液中溶解度為1.46%,即表示在100ml水中溶解1.46g咖啡因時(shí)溶液達(dá)到飽和。溶解度也可用物質(zhì)的摩爾濃度mol/L表示。《中國(guó)藥典》2000年版關(guān)于藥物溶解度有七種提法:極易溶解、易溶、溶解、略溶、微溶、極微溶解、幾乎不溶和不溶醫(yī)學(xué)|教育網(wǎng)搜集整理。這些概念僅表示藥物大致的溶解性能,至于準(zhǔn)確的溶解度,一般以一份溶質(zhì)(1g或1ml)溶于若干毫升溶劑來(lái)表示,藥典分別將它們記載于各藥物項(xiàng)下。藥物的溶解度數(shù)據(jù)可查閱默克索引(The Merk Index)﹑各國(guó)藥典、專門性的理化手冊(cè)等。對(duì)一些查不到溶解度數(shù)據(jù)的藥物,可通過(guò)實(shí)驗(yàn)測(cè)定。藥品溶解度并非越大越好,或越小越好,而是根據(jù)實(shí)際情況來(lái)定。

?下面用機(jī)器學(xué)習(xí)方法建立回歸模型預(yù)測(cè)藥物溶解度。
第一種算法采用隨機(jī)森林,效果非常好r^2 score'達(dá)到?0.9116131032510899,模型預(yù)測(cè)效果非常好。

# -*- coding: utf-8 -*-
"""
Created on Tue Sep? 4 09:39:29 2018
?
@author: 231469242@qq.com<br>微信公眾號(hào):pythonEducation
隨機(jī)森林100棵樹
RF RMS 0.6057144333891424
('RF r^2 score', 0.9114913707148344)
?
隨機(jī)森林1000棵樹
('RF RMS', 0.5891965582822096)
('RF r^2 score', 0.9116131032510899)
"""
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem
from rdkit.ML.Descriptors import MoleculeDescriptors
from rdkit.Chem import Descriptors
from rdkit.Chem.EState import Fingerprinter
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn import cross_validation
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn import gaussian_process
from sklearn.gaussian_process.kernels import Matern, WhiteKernel, ConstantKernel, RBF
?
?
#定義描述符計(jì)算函數(shù)
def get_fps(mol):
? ? calc=MoleculeDescriptors.MolecularDescriptorCalculator([x[0] for x in Descriptors._descList])
? ? ds = np.asarray(calc.CalcDescriptors(mol))
? ? arr=Fingerprinter.FingerprintMol(mol)[0]
? ? return np.append(arr,ds)
?
#讀入數(shù)據(jù)
data = pd.read_table('smi_sol.dat', sep=' ')
data.to_excel("all_data.xlsx")
??
#增加結(jié)構(gòu)和描述符屬性
data['Mol'] = data['smiles'].apply(Chem.MolFromSmiles)
data['Descriptors'] = data['Mol'].apply(get_fps)
#查看前五行
data.head(5)
?
#轉(zhuǎn)換為numpy數(shù)組
X = np.array(list(data['Descriptors']))
?
df_x=pd.DataFrame(X)
df_x.to_excel("data.xlsx")
?
y = data['solubility'].values
df_y=pd.DataFrame(y)
df_y.to_excel("label.xlsx")
??
st = StandardScaler()
X = st.fit_transform(X)
??
#劃分訓(xùn)練集和測(cè)試集
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.3, random_state=42)
?
'''
#高斯過(guò)程回歸
kernel=1.0 * RBF(length_scale=1) + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(kernel=kernel,n_restarts_optimizer=0,normalize_y=True)
gp.fit(X_train, y_train)
?
y_pred, sigma = gp.predict(X_test, return_std=True)
rms = (np.mean((y_test - y_pred)**2))**0.5
#s = np.std(y_test -y_pred)
print ("GP RMS", rms)
print ("GP r^2 score",r2_score(y_test,y_pred))
'''
?
#隨機(jī)森林模型
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)
?
y_pred = rf.predict(X)
rms = (np.mean((y - y_pred)**2))**0.5
print ("RF RMS", rms)
?
print ("RF r^2 score",r2_score(y,y_pred))
plt.scatter(y_train,rf.predict(X_train), label = 'Train', c='blue')
plt.title('RF Predictor')
plt.xlabel('Measured Solubility')
plt.ylabel('Predicted Solubility')
plt.scatter(y_test,rf.predict(X_test),c='lightgreen', label='Test', alpha = 0.8)
plt.legend(loc=4)
plt.savefig('RF Predictor.png', dpi=600)
plt.show()
df_validation=pd.DataFrame({"test":y,"predict":y_pred})
df_validation.to_excel("validation.xlsx")
第二種算法采用高斯回歸,驗(yàn)證結(jié)果r^2分?jǐn)?shù)達(dá)到0.99,效果比隨機(jī)森林還要好。下面是python代碼截圖和源碼。

# -*- coding: utf-8 -*-
"""
Created on Tue Sep? 4 15:53:57 2018
@author: 231469242@qq.com;微信公眾號(hào):pythonEducation
默認(rèn)參數(shù)
GP RMS label??? 2.98575
GP r^2 score -1.26973055888
核參數(shù)改為kernel=1.0 * RBF(length_scale=1) + WhiteKernel(noise_level=1)
RMS??? 0.651688
dtype: float64
GP r^2 score 0.891869923330719
核參數(shù)修改,且正態(tài)化后
GP RMS label??? 0.597042
GP r^2 score 0.9092436176966117
"""
import
?pandas as pd
import
?numpy as np
import
?matplotlib.pyplot as plt
from
?sklearn.preprocessing?
import
?StandardScaler
from
?sklearn?
import
?cross_validation
from
?sklearn.metrics?
import
?r2_score
from
?sklearn.ensemble?
import
?RandomForestRegressor
from
?sklearn?
import
?gaussian_process
from
?sklearn.gaussian_process.kernels?
import
?Matern, WhiteKernel, ConstantKernel, RBF
#讀入數(shù)據(jù)
data?
=
?pd.read_excel(
'data.xlsx'
)
y?
=
??pd.read_excel(
'label.xlsx'
)
st?
=
?StandardScaler()
X?
=
?st.fit_transform(data)
?
?#劃分訓(xùn)練集和測(cè)試集
X_train, X_test, y_train, y_test?
=
?cross_validation.train_test_split(X, y, test_size
=
0.3
, random_state
=
42
)
#(工作電腦運(yùn)行會(huì)死機(jī))
kernel
=
1.0
?*
?RBF(length_scale
=
1
)?
+
?WhiteKernel(noise_level
=
1
)
#gp = gaussian_process.GaussianProcessRegressor()
#gp = gaussian_process.GaussianProcessRegressor(kernel=kernel)
gp?
=
?gaussian_process.GaussianProcessRegressor(kernel
=
kernel,n_restarts_optimizer
=
0
,normalize_y
=
True
)
gp.fit(X_train, y_train)
y_pred, sigma?
=
?gp.predict(X_test, return_std
=
True
)
rms?
=
?(np.mean((y_test?
-
?y_pred)
*
*
2
))
*
*
0.5
#s = np.std(y_test -y_pred)
print
?(
"GP RMS"
, rms)
print
?(
"GP r^2 score"
,r2_score(y_test,y_pred))
plt.scatter(y_train,gp.predict(X_train), label?
=
?'Train'
, c
=
'blue'
)
plt.title(
'GP Predictor'
)
plt.xlabel(
'Measured Solubility'
)
plt.ylabel(
'Predicted Solubility'
)
plt.scatter(y_test,gp.predict(X_test),c
=
'lightgreen'
, label
=
'Test'
, alpha?
=
?0.8
)
plt.legend(loc
=
4
)
plt.savefig(
'GP Predictor.png'
, dpi
=
300
)
plt.show()
python機(jī)器學(xué)習(xí)生物信息學(xué)系列課(博主錄制):http://dwz.date/b9vw
