Deepchem 磁盘数据到 numpy 阵列

Question

我直接用Deepchem wrapper for GraphConvolution model as follows. I have my smiles data in .csv which consists of 5 molecules with their smiles representation and their respective activities. The data can be accessed from here。

正在导入库：

from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import numpy as np
import tensorflow as tf
import deepchem as dc
from deepchem.models.tensorgraph.models.graph_models import GraphConvModel

加载数据并以适合图形卷积的方式对其进行特征化。

graph_featurizer = dc.feat.graph_features.ConvMolFeaturizer()
loader_train = dc.data.data_loader.CSVLoader( tasks=['Activity'], smiles_field="smiles",featurizer=graph_featurizer)
dataset_train = loader_train.featurize( './train_smiles_data.csv')

分析加载和特征化的数据（我的尝试）

dataset_train.X

array([<deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc3ad748>,
       <deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc367828>,
       <deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc367208>,
       <deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc369c50>],
      dtype=object)


dataset_train.y

array([[2.71],
       [4.41],
       [3.77],
       [4.2 ]])

for x, y, w, id in dataset_train.itersamples():
    print(x, y, w, id)

<deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc3ad6a0> [2.71] [1.] CC1=C(O)C=CC=C1
<deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc30f518> [4.41] [1.] [O-][N+](=O)C1=CC=C(Br)S1
<deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc30f748> [3.77] [1.] CCC/C=C/C=O
<deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc30f940> [4.2] [1.] CCCCCC1=CC=CS1

我想要什么？

从上面的代码看来，dataset_train.X 给出了 diskobject 像 <deepchem.feat.mol_graphs.ConvMol object at 0x7f8bfc3ad6a0> 而不是 numpy array 像 dataset_train.y.

我如何知道 dataset_train.X 中存储的数据类型？如何查看存储在 dataset_train.X 中的数据？或者换句话说，如何将 dataset_train.X 转换为可以检查其中数据的格式？

我相信应该有某种方法可以做到这一点。

Answer 1

根据您对 ConvMolFeaturizer 的 dataset_train.X is an array of ConvMol objects. These ConvMol objects are a container for the features of each of your input molecules. The features are not represented like they are for your targets 'train_dataset.y' as they are more complex graph features. look at the source code for the ConvMol object again and look at the source code。然后，您可以确定要如何解释这些功能：

# Inspect features for molecule 0
conv_feature = dataset_train.X[0]
# Print the atom features
print(conv_feature.get_atom_features())
# Print the adjacency list
print(conv_feature.get_adjancency_list())

Deepchem 磁盘数据到 numpy 阵列

Deepchem disk data to numpy array

python

numpy

rdkit

deep-learning