在 CNN 的基因组序列分类中,无法将 NumPy 数组((整个序列是一个字符串))转换为张量?
Failed to convert a NumPy array ((the whole sequence is a string)) to a Tensor, in genome sequence classification for CNN?
数据基本都是CSV格式,是一个fasta/genome序列,基本上整个序列就是一个字符串。为了将此数据传递到 CNN 模型中,我将数据转换为数字。 genome/fasta 序列,我想将其更改为张量可接受的格式,因此我将此字符串转换为浮点数,例如“AACTG,...,AAC..”到 [[0.25,0.25,0.50,1.00,0.75] ,....,[0.25,0.25,0.50.......]]。但是转换数据显示如下(参见#data show 2)。但是,当我 运行 tf.convert_to_tensor(train_data) 它给我一个错误 Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray)。但是要将数据传入CNN模型,必须是tensor,但是不知道为什么会报错!解决方案是什么?
# data show 2
array([array([0.25, 0.5 , 0.5 , ..., 0.75, 0.25, 0.25]),
array([0.25, 0.75, 0.25, ..., 1. , 0.5 , 0.5 ]), ...,
array([0.25, 1. , 1. , ..., 0.25, 0.25, 0.25])], dtype=object)
# end of data show
DataFrame of my genome/fasta sequence look like this
下面是用于编码数据的函数。
def string_to_array(my_string):
my_string = my_string.lower()
my_string = re.sub('[^acgt]', 'z', my_string)
my_array = list(my_string)
return my_array
# create a label encoder with 'acgtz' alphabet
label_encoder = LabelEncoder()
label_encoder.fit(['a','c','g','t','z'])
def ordinal_encoder(my_array):
integer_encoded = label_encoder.transform(my_array)
float_encoded = integer_encoded.astype(float)
float_encoded[float_encoded == 0] = 0.25 # A
float_encoded[float_encoded == 1] = 0.50 # C
float_encoded[float_encoded == 2] = 0.75 # G
float_encoded[float_encoded == 3] = 1.00 # T
float_encoded[float_encoded == 4] = 0.00 # anything else, z
return float_encoded
def conversion(tdf):
data = []
for i in tdf.index:
val = tdf['seq'].iloc[i]
val = ordinal_encoder(string_to_array(val))
data.append(val)
return data
train_data = conversion(df) # calling the function
train_data = np.asarray(train_data)
问题可能出在你的 numpy 数组 dtype 上。
将数组与 dtype float32
一起使用应该可以解决问题:tf.convert_to_tensor(train_data.astype(np.float32))
数据基本都是CSV格式,是一个fasta/genome序列,基本上整个序列就是一个字符串。为了将此数据传递到 CNN 模型中,我将数据转换为数字。 genome/fasta 序列,我想将其更改为张量可接受的格式,因此我将此字符串转换为浮点数,例如“AACTG,...,AAC..”到 [[0.25,0.25,0.50,1.00,0.75] ,....,[0.25,0.25,0.50.......]]。但是转换数据显示如下(参见#data show 2)。但是,当我 运行 tf.convert_to_tensor(train_data) 它给我一个错误 Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray)。但是要将数据传入CNN模型,必须是tensor,但是不知道为什么会报错!解决方案是什么?
# data show 2
array([array([0.25, 0.5 , 0.5 , ..., 0.75, 0.25, 0.25]),
array([0.25, 0.75, 0.25, ..., 1. , 0.5 , 0.5 ]), ...,
array([0.25, 1. , 1. , ..., 0.25, 0.25, 0.25])], dtype=object)
# end of data show
DataFrame of my genome/fasta sequence look like this
下面是用于编码数据的函数。
def string_to_array(my_string):
my_string = my_string.lower()
my_string = re.sub('[^acgt]', 'z', my_string)
my_array = list(my_string)
return my_array
# create a label encoder with 'acgtz' alphabet
label_encoder = LabelEncoder()
label_encoder.fit(['a','c','g','t','z'])
def ordinal_encoder(my_array):
integer_encoded = label_encoder.transform(my_array)
float_encoded = integer_encoded.astype(float)
float_encoded[float_encoded == 0] = 0.25 # A
float_encoded[float_encoded == 1] = 0.50 # C
float_encoded[float_encoded == 2] = 0.75 # G
float_encoded[float_encoded == 3] = 1.00 # T
float_encoded[float_encoded == 4] = 0.00 # anything else, z
return float_encoded
def conversion(tdf):
data = []
for i in tdf.index:
val = tdf['seq'].iloc[i]
val = ordinal_encoder(string_to_array(val))
data.append(val)
return data
train_data = conversion(df) # calling the function
train_data = np.asarray(train_data)
问题可能出在你的 numpy 数组 dtype 上。
将数组与 dtype float32
一起使用应该可以解决问题:tf.convert_to_tensor(train_data.astype(np.float32))