python str.format with utf-8 characters that take more than 1 position

Question

我试图在 python 中打印日文字符，按列对齐。好像日文字符的宽度等于两个space，所以对齐不起作用。

代码如下：

def print_kanji(s, k):
    print('{:<20}{:<10}{:<10}{:<10}'
        .format(s, k['reading'][0], k['reading'][1], k['kanji']))

# Being 's' some input string and 'k' a map which contains readings in the 3 different japanese alphabets.

我获得的输出如下：

decir               いう        イウ        言う        

pequeño             すくない      スクナイ      少ない       

niño                こども       コドモ       子供        

ya [ha hecho X]     もう        モウ

左边的栏目是西班牙语，但这并不重要。重要的是右边的3列没有对齐。我计算了位置的数量并且它是正确的，也就是说，第一个日文列总是 10 'positions' 长，问题是日文字符是 2 个位置宽而空格只有 1 个。

我也检查过空白（使用日文输入）也是两个位置宽，因此我应该能够通过替换 'latin' space (1位置宽度）由日本人。

如何更改 format 用于对齐字符串的字符？

编辑

我发现 str.format 有一个参数是 fill。我试图用日文空白（两个位置宽）代替它，结果更糟。

编辑 2

我已经通过实现这个功能解决了

def get_formatted_kanji(h, k, kn):
    h2 = h + str(' ' * (10 - 2*len(h)))
    k2 = k + str(' ' * (10 - 2*len(h)))
    kn2 = kn + str(' ' * (10 - 2*len(h)))
    return h2 + k2 + kn2

# being h, k and kn the three 'japanese strings' to be formatted in columns

但是，是否有更好的（内置）方法来实现这一点？

Answer 1

您应该可以通过以下方式更改语言格式：

>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'ja-JP') # or 'jpn'

Answer 2

在终端中，某些字符占两列而其他字符占一列是很常见的。您可以使用 unicodedata Python 模块找出哪些字符是哪些，该模块具有 east_asian_width().

这是一个如何使用它填充文本的示例：

import unicodedata
table = [
    ('decir', 'いう', 'イウ', '言う'), 
    ('pequeño', 'すくない', 'スクナイ', '少ない'), 
    ('niño', 'こども', 'コドモ', '子供'), 
    ('ya [ha hecho X]', 'もう', 'モウ', ''),
]

WIDTHS = {
    'F': 2,
    'H': 1,
    'W': 2,
    'N': 1,
    'A': 1, # Not really correct...
    'Na': 1,
}

def pad(text, width):
    text_width = 0
    for ch in text:
        width_class = unicodedata.east_asian_width(ch)
        text_width += WIDTHS[width_class]
    if width <= text_width:
        return text
    return text + ' ' * (width - text_width)

for s, reading1, reading2, kanji in table:
    print('{}{}{}{}'.format(
        pad(s, 20),
        pad(reading1, 10),
        pad(reading2, 10),
        pad(kanji, 10),
    ))

这是我的系统 (macOS) 上的截图：

限制

以上代码不处理 Unicode 组合字符。更完整的实现将执行 Unicode 文本分割，然后计算出每个字素簇的宽度。我敢肯定，有些图书馆会为您做这件事。

语言注释

请注意，我不认为“少ない”和“pequeño”这两个词可能等同。西班牙语“pequeño”指的是东西的大小，“少ナい”指的是数量。

我认为更有可能

poco: 少ない
pequeño: 小さい

python str.format with utf-8 characters that take more than 1 position