元组列表中元组的第一个元素小写

Question

我有一份文件清单，标有适当的类别：

documents = [(list(corpus.words(fileid)), category)
              for category in corpus.categories()
              for fileid in corpus.fileids(category)]

这给了我下面的元组列表，其中元组的第一个元素是单词列表（句子的标记）。例如：

[([u'A', u'pilot', u'investigation', u'of', u'a', u'multidisciplinary', 
u'quality', u'of', u'life', u'intervention', u'for', u'men', u'with', 
u'biochemical', u'recurrence', u'of', u'prostate', u'cancer', u'.'], 
'cancer'), 
([u'A', u'Systematic', u'Review', u'of', u'the', u'Effectiveness', 
u'of', u'Medical', u'Cannabis', u'for', u'Psychiatric', u',', 
u'Movement', u'and', u'Neurodegenerative', u'Disorders', u'.'], 'hd')]

我想应用一些文本处理技术，但我希望保持元组列表格式。

我知道，如果我只有一个单词列表，就可以：

[w.lower() for w in words]

但在这种情况下，我想将 .lower() 应用于元组列表中每个元组的第一个元素（字符串列表），并在尝试各种选项之后喜欢：

[[x.lower() for x in element] for element in documents],
[(x.lower(), y) for x,y in documents], or
[x[0].lower() for x in documents]

我总是遇到这个错误：

AttributeError: 'list' object has no attribute 'lower'

我也试过在创建列表之前应用我需要的东西，但是 .categories() 和 .fileids() 是语料库的属性，它们也 return 同样的错误（它们也是列表).

任何帮助将不胜感激。

已解决：

@Adam Smith 和@vasia 的回答都是正确的：

[([s.lower() for s in item[0]], item[1]) for item in documents]

@Adam 上面的回答维护了元组结构； @vasia 从创建元组列表开始就成功了：

documents = [([word.lower() for word in corpus.words(fileid)], category)
              for category in corpus.categories()
              for fileid in corpus.fileids(category)]

谢谢大家:)

Answer 1

你很接近。您正在寻找这样的结构：

[([s.lower() for s in ls], cat) for ls, cat in documents]

本质上将这两者放在一起：

[[x.lower() for x in element] for element in documents],
[(x.lower(), y) for x,y in documents]

Answer 2

所以你的数据结构是[([str], str)]。元组列表，其中每个元组为 (list of strings, string)。在尝试从中提取数据之前，深入理解这意味着什么很重要。

这意味着 for item in documents 会得到一个元组列表，其中 item 是每个元组。

这意味着item[0]是每个元组中的列表。

这意味着 for item in documents: for s in item[0]: 将遍历该列表中的每个字符串。让我们试试吧！

[s.lower() for item in documents for s in item[0]]

根据您的示例数据，这应该给出：

[u'a', u'p', u'i', u'o', u'a', u'm', ...]

如果你想保留元组格式，你可以这样做：

[([s.lower() for s in item[0]], item[1]) for item in documents]

# or perhaps more readably
[([s.lower() for s in lst], val) for lst, val in documents]

这两个语句给出：

[([u'a', u'p', u'i', u'o', u'a', u'm', ...], 'cancer'), ... ]

Answer 3

试试这个：

documents = [([word.lower() for word in corpus.words(fileid)], category)
              for category in corpus.categories()
              for fileid in corpus.fileids(category)]

Answer 4

通常，元组是不可变的。但是，由于每个元组的第一个元素是一个列表，该列表是可变的，因此您可以修改其内容而无需更改该列表的元组所有权：

documents = [(...what you originally posted...) ... etc. ...]

for d in documents:
    # to lowercase all strings in the list
    # trailing '[:]' is important, need to modify list in place using slice
    d[0][:] = [w.lower() for w in d[0]]

    # or to just lower-case the first element of the list (which is what you asked for)
    d[0][0] = d[0][0].lower()

您不能只在字符串上调用 lower() 并更新它 - lower() returns 一个新字符串。因此，要将字符串修改为小写版本，您必须对其进行分配。如果字符串本身是一个元组成员，这是不可能的，但由于您正在修改的字符串在元组的列表中，您可以修改列表内容而无需修改元组对列表的所有权。

元组列表中元组的第一个元素小写

Lowercase first element of tuple in list of tuples

python

text-processing

tuples

list