删除 tf.dataset 管道中输入字符串的重音符号

Question

我正在尝试创建一个 tf.dataset 管道 (TF2)，其中读取文本文件并对其进行一些预处理。

mytext.txt 文件内容下方：

Para este projeto fizemos questão de ter uma equipe formada por mulheres, desde o catering, passando pela maquiagem até a produção, iluminação e direção. Abaixo reunimos algumas histórias dos bastidores:

我的python代码：

import tensorflow as tf
import unicodedata

# Strip accents from input string.
def unicode_to_ascii(s):
    return tf.strings.strip(''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))

# Text files
files = tf.data.Dataset.list_files('/data/tmp/mytext.txt', shuffle=True, seed=None)

# Pipeline
dataset = tf.data.TextLineDataset(files, compression_type=None, buffer_size=None, num_parallel_reads=None)
dataset = dataset.map(unicode_to_ascii)

for d in dataset:
    print(d.numpy().decode('utf8'))

但我收到以下错误：

    /data/dev/python/dlbox/examples/preprocess_text copy.py:6 unicode_to_ascii  *
        return tf.strings.strip(''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
    /home/kleysonr/.virtualenvs/tf2/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py:396 converted_call
        return py_builtins.overload_of(f)(*args)

    TypeError: normalize() argument 2 must be str, not Tensor

我找不到在字符串中转换 s:Tensor 的方法。

我怎样才能让它工作？

编辑 1

正在尝试使用 tf.py_function：

# Pipeline
dataset = tf.data.TextLineDataset(files, compression_type=None, buffer_size=None, num_parallel_reads=None)
# dataset = dataset.map(unicode_to_ascii)
dataset = dataset.map(lambda x: tf.py_function(unicode_to_ascii, x, tf.string))

但也出现错误：

    /data/dev/python/dlbox/examples/preprocess_text copy.py:14 None  *
        dataset = dataset.map(lambda x: tf.py_function(unicode_to_ascii, x, tf.string))
    /home/kleysonr/.virtualenvs/tf2/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py:407 eager_py_func
        return _internal_py_func(func=func, inp=inp, Tout=Tout, eager=True, name=name)
    /home/kleysonr/.virtualenvs/tf2/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py:296 _internal_py_func
        input=inp, token=token, Tout=Tout, name=name)
    /home/kleysonr/.virtualenvs/tf2/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_script_ops.py:74 eager_py_func
        "EagerPyFunc", input=input, token=token, Tout=Tout, name=name)
    /home/kleysonr/.virtualenvs/tf2/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py:442 _apply_op_helper
        (input_name, op_type_name, values))

    TypeError: Expected list for 'input' argument to 'EagerPyFunc' Op, not Tensor("args_0:0", shape=(), dtype=string).

Answer 1

设法使用 tf.py_function

让它工作

# Strip accents from input string.
def unicode_to_ascii(s):
    return tf.strings.strip(''.join(c for c in unicodedata.normalize('NFD', s.numpy().decode('utf8')) if unicodedata.category(c) != 'Mn'))

# Text files
files = tf.data.Dataset.list_files('/data/tmp/mytext.txt', shuffle=True, seed=None)

# Pipeline
dataset = tf.data.TextLineDataset(files, compression_type=None, buffer_size=None, num_parallel_reads=None)
dataset = dataset.map(lambda line: tf.py_function(unicode_to_ascii, [line], tf.string))

for d in dataset:
    print(d.numpy().decode('utf8'))

删除 tf.dataset 管道中输入字符串的重音符号

Removing accents of an input string in a tf.dataset pipeline

python

tensorflow

tensorflow2.0