将多语言日期时间格式转换为“%Y-%m-%d”
Converting multi language date time formats to "%Y-%m-%d"
我正在从维基百科页面底部抓取参考资料。这些引用包含一个我可以解析的 OpenUrl link。这是一个例子:
<span
title="ctx_ver=Z39.88-2004&
rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&
rft.genre=unknown&
rft.jtitle=The+Tennessean&
rft.atitle=Belmont+University+awarded+final+2020+presidential+debate&
rft.date=2019-10-11&
rft.aulast=Tamburin&
rft.aufirst=Adam&
rft_id=https%3A%2F%2Fwww.tennessean.com%2Fstory%2Fnews%2F2019%2F10%2F11%2Fbelmont-university-nashville-hosts-presidential-debate-2020%2F3941983002%2F&
rfr_id=info%3Asid%2Fen.wikipedia.org%3A2020+United+States+presidential+election"
class="Z3988">
</span>
我正在成功获取 rft.date
值。但是,值的格式各不相同。我正在尝试做两件事:
- 'Guess' 语言并翻译(如果可能)
- 确定格式并重新格式化为
"%Y-%m-%d"
如果没有语言问题,我可以使用 dateutil(请参阅页面的一半)。然而,语言问题完全难倒了我。
有人对如何处理此类示例的翻译有任何建议吗?
0 "մայիսի 8, 2019"
1 "մայիսի 6, 2019"
2 "մայիսի 10, 2019"
3 "June 20, 2019"
4 "January 16, 2019"
5 "Aug 8, 2019"
6 "Aug 4, 2019"
...
12 "9 August 2019"
13 "8 May 2019"
14 "8 July 2020"
15 "8 July 2019"
16 "8 January 2020"
17 "8 de enero de 2020"
18 "7 tháng 8 năm 2019"
19 "7 May 2020"
...
33 "31 de diciembre de 2019"
...
40 "28 December 2019"
41 "28 de diciembre de 2019"
42 "27 de septiembre de 2019"
43 "26 November 2019"
44 "25 tháng 6 năm 2019"
45 "25 May 2019"
46 "25 March 2020"
47 "25 June 2019"
48 "24 June 2019"
49 "23 July 2019"
50 "22 tháng 7 năm 2019"
51 "22 July 2020"
52 "22 de abril de 2019"
53 "21 August 2019"
54 "2020-10-18"
55 "2020-09-21"
56 "2020-09-19"
57 "2020-09-16"
您可以使用 googletrans python 库来实现您的目标。
我在本地试了一下,好像还不错。
代码如下:
import pandas as pd
from googletrans import Translator
translator = Translator()
df = pd.read_csv('input_file.tsv', sep=' ', header=None, index_col=0)
df.columns = ['date']
df['translated'] = df['date'].map(lambda x: translator.translate(x).text)
print(df)
输出:
date translated
0 մայիսի 8, 2019 May 8, 2019
1 մայիսի 6, 2019 May 6, 2019
2 մայիսի 10, 2019 May 10, 2019
3 June 20, 2019 June 20, 2019
4 January 16, 2019 January 16, 2019
5 Aug 8, 2019 Aug 8, 2019
6 Aug 4, 2019 Aug 4, 2019
12 9 August 2019 9 August 2019
13 8 May 2019 8 May 2019
14 8 July 2020 8 July 2020
15 8 July 2019 8 July 2019
16 8 January 2020 8 January 2020
17 8 de enero de 2020 January 8, 2020
18 7 tháng 8 năm 2019 August 7, 2019
19 7 May 2020 7 May 2020
33 31 de diciembre de 2019 December 31, 2019
40 28 December 2019 28 December 2019
41 28 de diciembre de 2019 Dec 28, 2019
42 27 de septiembre de 2019 Sep 27, 2019
43 26 November 2019 26 November 2019
44 25 tháng 6 năm 2019 June 25, 2019
45 25 May 2019 25 May 2019
46 25 March 2020 25 March 2020
47 25 June 2019 25 June 2019
48 24 June 2019 24 June 2019
49 23 July 2019 23 July 2019
50 22 tháng 7 năm 2019 July 22, 2019
51 22 July 2020 22 July 2020
52 22 de abril de 2019 Apr 22, 2019
53 21 August 2019 21 August 2019
54 2020-10-18 2020-10-18
55 2020-09-21 2020-09-21
56 2020-09-19 2020-09-19
57 2020-09-16 2020-09-16
我正在从维基百科页面底部抓取参考资料。这些引用包含一个我可以解析的 OpenUrl link。这是一个例子:
<span
title="ctx_ver=Z39.88-2004&
rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&
rft.genre=unknown&
rft.jtitle=The+Tennessean&
rft.atitle=Belmont+University+awarded+final+2020+presidential+debate&
rft.date=2019-10-11&
rft.aulast=Tamburin&
rft.aufirst=Adam&
rft_id=https%3A%2F%2Fwww.tennessean.com%2Fstory%2Fnews%2F2019%2F10%2F11%2Fbelmont-university-nashville-hosts-presidential-debate-2020%2F3941983002%2F&
rfr_id=info%3Asid%2Fen.wikipedia.org%3A2020+United+States+presidential+election"
class="Z3988">
</span>
我正在成功获取 rft.date
值。但是,值的格式各不相同。我正在尝试做两件事:
- 'Guess' 语言并翻译(如果可能)
- 确定格式并重新格式化为
"%Y-%m-%d"
如果没有语言问题,我可以使用 dateutil(请参阅页面的一半)。然而,语言问题完全难倒了我。
有人对如何处理此类示例的翻译有任何建议吗?
0 "մայիսի 8, 2019"
1 "մայիսի 6, 2019"
2 "մայիսի 10, 2019"
3 "June 20, 2019"
4 "January 16, 2019"
5 "Aug 8, 2019"
6 "Aug 4, 2019"
...
12 "9 August 2019"
13 "8 May 2019"
14 "8 July 2020"
15 "8 July 2019"
16 "8 January 2020"
17 "8 de enero de 2020"
18 "7 tháng 8 năm 2019"
19 "7 May 2020"
...
33 "31 de diciembre de 2019"
...
40 "28 December 2019"
41 "28 de diciembre de 2019"
42 "27 de septiembre de 2019"
43 "26 November 2019"
44 "25 tháng 6 năm 2019"
45 "25 May 2019"
46 "25 March 2020"
47 "25 June 2019"
48 "24 June 2019"
49 "23 July 2019"
50 "22 tháng 7 năm 2019"
51 "22 July 2020"
52 "22 de abril de 2019"
53 "21 August 2019"
54 "2020-10-18"
55 "2020-09-21"
56 "2020-09-19"
57 "2020-09-16"
您可以使用 googletrans python 库来实现您的目标。 我在本地试了一下,好像还不错。
代码如下:
import pandas as pd
from googletrans import Translator
translator = Translator()
df = pd.read_csv('input_file.tsv', sep=' ', header=None, index_col=0)
df.columns = ['date']
df['translated'] = df['date'].map(lambda x: translator.translate(x).text)
print(df)
输出:
date translated
0 մայիսի 8, 2019 May 8, 2019
1 մայիսի 6, 2019 May 6, 2019
2 մայիսի 10, 2019 May 10, 2019
3 June 20, 2019 June 20, 2019
4 January 16, 2019 January 16, 2019
5 Aug 8, 2019 Aug 8, 2019
6 Aug 4, 2019 Aug 4, 2019
12 9 August 2019 9 August 2019
13 8 May 2019 8 May 2019
14 8 July 2020 8 July 2020
15 8 July 2019 8 July 2019
16 8 January 2020 8 January 2020
17 8 de enero de 2020 January 8, 2020
18 7 tháng 8 năm 2019 August 7, 2019
19 7 May 2020 7 May 2020
33 31 de diciembre de 2019 December 31, 2019
40 28 December 2019 28 December 2019
41 28 de diciembre de 2019 Dec 28, 2019
42 27 de septiembre de 2019 Sep 27, 2019
43 26 November 2019 26 November 2019
44 25 tháng 6 năm 2019 June 25, 2019
45 25 May 2019 25 May 2019
46 25 March 2020 25 March 2020
47 25 June 2019 25 June 2019
48 24 June 2019 24 June 2019
49 23 July 2019 23 July 2019
50 22 tháng 7 năm 2019 July 22, 2019
51 22 July 2020 22 July 2020
52 22 de abril de 2019 Apr 22, 2019
53 21 August 2019 21 August 2019
54 2020-10-18 2020-10-18
55 2020-09-21 2020-09-21
56 2020-09-19 2020-09-19
57 2020-09-16 2020-09-16