删除 tf.dataset 管道中输入字符串的重音符号
Removing accents of an input string in a tf.dataset pipeline
我正在尝试创建一个 tf.dataset 管道 (TF2),其中读取文本文件并对其进行一些预处理。
mytext.txt
文件内容下方:
Para este projeto fizemos questão de ter uma equipe formada por mulheres, desde o catering, passando pela maquiagem até a produção, iluminação e direção. Abaixo reunimos algumas histórias dos bastidores:
我的python代码:
import tensorflow as tf
import unicodedata
# Strip accents from input string.
def unicode_to_ascii(s):
return tf.strings.strip(''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
# Text files
files = tf.data.Dataset.list_files('/data/tmp/mytext.txt', shuffle=True, seed=None)
# Pipeline
dataset = tf.data.TextLineDataset(files, compression_type=None, buffer_size=None, num_parallel_reads=None)
dataset = dataset.map(unicode_to_ascii)
for d in dataset:
print(d.numpy().decode('utf8'))
但我收到以下错误:
/data/dev/python/dlbox/examples/preprocess_text copy.py:6 unicode_to_ascii *
return tf.strings.strip(''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
/home/kleysonr/.virtualenvs/tf2/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py:396 converted_call
return py_builtins.overload_of(f)(*args)
TypeError: normalize() argument 2 must be str, not Tensor
我找不到在字符串中转换 s:Tensor 的方法。
我怎样才能让它工作?
编辑 1
正在尝试使用 tf.py_function:
# Pipeline
dataset = tf.data.TextLineDataset(files, compression_type=None, buffer_size=None, num_parallel_reads=None)
# dataset = dataset.map(unicode_to_ascii)
dataset = dataset.map(lambda x: tf.py_function(unicode_to_ascii, x, tf.string))
但也出现错误:
/data/dev/python/dlbox/examples/preprocess_text copy.py:14 None *
dataset = dataset.map(lambda x: tf.py_function(unicode_to_ascii, x, tf.string))
/home/kleysonr/.virtualenvs/tf2/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py:407 eager_py_func
return _internal_py_func(func=func, inp=inp, Tout=Tout, eager=True, name=name)
/home/kleysonr/.virtualenvs/tf2/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py:296 _internal_py_func
input=inp, token=token, Tout=Tout, name=name)
/home/kleysonr/.virtualenvs/tf2/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_script_ops.py:74 eager_py_func
"EagerPyFunc", input=input, token=token, Tout=Tout, name=name)
/home/kleysonr/.virtualenvs/tf2/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py:442 _apply_op_helper
(input_name, op_type_name, values))
TypeError: Expected list for 'input' argument to 'EagerPyFunc' Op, not Tensor("args_0:0", shape=(), dtype=string).
设法使用 tf.py_function
让它工作
# Strip accents from input string.
def unicode_to_ascii(s):
return tf.strings.strip(''.join(c for c in unicodedata.normalize('NFD', s.numpy().decode('utf8')) if unicodedata.category(c) != 'Mn'))
# Text files
files = tf.data.Dataset.list_files('/data/tmp/mytext.txt', shuffle=True, seed=None)
# Pipeline
dataset = tf.data.TextLineDataset(files, compression_type=None, buffer_size=None, num_parallel_reads=None)
dataset = dataset.map(lambda line: tf.py_function(unicode_to_ascii, [line], tf.string))
for d in dataset:
print(d.numpy().decode('utf8'))
我正在尝试创建一个 tf.dataset 管道 (TF2),其中读取文本文件并对其进行一些预处理。
mytext.txt
文件内容下方:
Para este projeto fizemos questão de ter uma equipe formada por mulheres, desde o catering, passando pela maquiagem até a produção, iluminação e direção. Abaixo reunimos algumas histórias dos bastidores:
我的python代码:
import tensorflow as tf
import unicodedata
# Strip accents from input string.
def unicode_to_ascii(s):
return tf.strings.strip(''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
# Text files
files = tf.data.Dataset.list_files('/data/tmp/mytext.txt', shuffle=True, seed=None)
# Pipeline
dataset = tf.data.TextLineDataset(files, compression_type=None, buffer_size=None, num_parallel_reads=None)
dataset = dataset.map(unicode_to_ascii)
for d in dataset:
print(d.numpy().decode('utf8'))
但我收到以下错误:
/data/dev/python/dlbox/examples/preprocess_text copy.py:6 unicode_to_ascii *
return tf.strings.strip(''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
/home/kleysonr/.virtualenvs/tf2/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py:396 converted_call
return py_builtins.overload_of(f)(*args)
TypeError: normalize() argument 2 must be str, not Tensor
我找不到在字符串中转换 s:Tensor 的方法。
我怎样才能让它工作?
编辑 1
正在尝试使用 tf.py_function:
# Pipeline
dataset = tf.data.TextLineDataset(files, compression_type=None, buffer_size=None, num_parallel_reads=None)
# dataset = dataset.map(unicode_to_ascii)
dataset = dataset.map(lambda x: tf.py_function(unicode_to_ascii, x, tf.string))
但也出现错误:
/data/dev/python/dlbox/examples/preprocess_text copy.py:14 None *
dataset = dataset.map(lambda x: tf.py_function(unicode_to_ascii, x, tf.string))
/home/kleysonr/.virtualenvs/tf2/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py:407 eager_py_func
return _internal_py_func(func=func, inp=inp, Tout=Tout, eager=True, name=name)
/home/kleysonr/.virtualenvs/tf2/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py:296 _internal_py_func
input=inp, token=token, Tout=Tout, name=name)
/home/kleysonr/.virtualenvs/tf2/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_script_ops.py:74 eager_py_func
"EagerPyFunc", input=input, token=token, Tout=Tout, name=name)
/home/kleysonr/.virtualenvs/tf2/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py:442 _apply_op_helper
(input_name, op_type_name, values))
TypeError: Expected list for 'input' argument to 'EagerPyFunc' Op, not Tensor("args_0:0", shape=(), dtype=string).
设法使用 tf.py_function
让它工作# Strip accents from input string.
def unicode_to_ascii(s):
return tf.strings.strip(''.join(c for c in unicodedata.normalize('NFD', s.numpy().decode('utf8')) if unicodedata.category(c) != 'Mn'))
# Text files
files = tf.data.Dataset.list_files('/data/tmp/mytext.txt', shuffle=True, seed=None)
# Pipeline
dataset = tf.data.TextLineDataset(files, compression_type=None, buffer_size=None, num_parallel_reads=None)
dataset = dataset.map(lambda line: tf.py_function(unicode_to_ascii, [line], tf.string))
for d in dataset:
print(d.numpy().decode('utf8'))