我想将以下文本文件转换为以 headers 为列、以行为相应数据的文件
I want to convert the following text file into one having the headers as columns and the corresponding data in rows
下载后我所有的数据都在文本文件 (.txt) 中。我有多个这样的文件(至少 70000 个),我想我会先使用 cmd 中的 copy.txt 合并到一个主文本文件中。然后我想分析 mastertext 文件,这样理想的输出将是 .csv 文件。
需要将数据转换为 table 类结构以便于分析。每个文件(在我拥有的 70000 个 .txt 文件中)都以“['and ends with ']”开头,每一行新数据都以“{”开头并以“}”结尾,我有多行数据相同的格式。我希望转换后的数据具有 headers,例如 'campaignId',以便我可以使用数据透视分析列(就像您在 excel 中所做的那样)。我是编程新手,不确定 Python 中的哪些库可以帮助我实现目标。
这是我拥有的数据示例。 (注意 - 这些只是一个文件中的两行,我有大约 70000 个这样的文件)
{
"campaignId": "all",
"startDate": "2020-06-11",
"endDate": "2020-06-11",
"device": "Computers",
"network": "Display Network",
"channel": "all",
"accLevelQS": -1.0,
"impressions": 389,
"clicks": 3,
"ctr": 0.0,
"avgCPC": 0.0,
"convValuePerClick": 0.0,
"convValuePerCost": 0.0,
"costConv1PerClick": 0.0,
"convRate1PerClick": 0.0,
"cost": 0.142884,
"conv1PerClick": 0.0,
"totalConvValue": 0.00,
"allConversions": 0.0,
"allConversionValue": 0.00,
"avgPosition": 0.0,
"intr": 3,
"searchImprShare": 0.0,
"contImprShare": 5.0,
"impressionShare": 5.0
},
{
"campaignId": "all",
"startDate": "2020-06-11",
"endDate": "2020-06-11",
"device": "Mobile devices with full browsers",
"network": "Display Network",
"channel": "all",
"accLevelQS": -1.0,
"impressions": 6101,
"clicks": 90,
"ctr": 0.0,
"avgCPC": 0.0,
"convValuePerClick": 0.0,
"convValuePerCost": 0.0,
"costConv1PerClick": 0.0,
"convRate1PerClick": 0.0,
"cost": 4.342799,
"conv1PerClick": 0.0,
"totalConvValue": 0.00,
"allConversions": 0.0,
"allConversionValue": 0.00,
"avgPosition": 0.0,
"intr": 90,
"searchImprShare": 0.0,
"contImprShare": 5.0077566465021217,
"impressionShare": 5.0077566465021217
}
I have made some assumptions as the data you have provided seems like a list of JSON data. This is because there is a "," in between your entries.
这是我的 input.json
文件。请注意,我在顶部和底部添加了 [ 和 ],因为这将为您的数据提供适当的 JSON 结构。
[
{
"campaignId": "all",
"startDate": "2020-06-11",
"endDate": "2020-06-11",
"device": "Computers",
"network": "Display Network",
"channel": "all",
"accLevelQS": -1.0,
"impressions": 389,
"clicks": 3,
"ctr": 0.0,
"avgCPC": 0.0,
"convValuePerClick": 0.0,
"convValuePerCost": 0.0,
"costConv1PerClick": 0.0,
"convRate1PerClick": 0.0,
"cost": 0.142884,
"conv1PerClick": 0.0,
"totalConvValue": 0.00,
"allConversions": 0.0,
"allConversionValue": 0.00,
"avgPosition": 0.0,
"intr": 3,
"searchImprShare": 0.0,
"contImprShare": 5.0,
"impressionShare": 5.0
},
{
"campaignId": "all",
"startDate": "2020-06-11",
"endDate": "2020-06-11",
"device": "Mobile devices with full browsers",
"network": "Display Network",
"channel": "all",
"accLevelQS": -1.0,
"impressions": 6101,
"clicks": 90,
"ctr": 0.0,
"avgCPC": 0.0,
"convValuePerClick": 0.0,
"convValuePerCost": 0.0,
"costConv1PerClick": 0.0,
"convRate1PerClick": 0.0,
"cost": 4.342799,
"conv1PerClick": 0.0,
"totalConvValue": 0.00,
"allConversions": 0.0,
"allConversionValue": 0.00,
"avgPosition": 0.0,
"intr": 90,
"searchImprShare": 0.0,
"contImprShare": 5.0077566465021217,
"impressionShare": 5.0077566465021217
}
]
下面是使用 pandas
库将数据处理成数据帧然后将其写入 CSV 文件的代码。
import json # Available by default, no install required.
import glob # Available by default, no install required.
import pandas as pd # Requires installation via pip.
# Initialise a list to store our results.
combined_json = []
# Set a glob pattern to *.txt since your files are txt files.
# You can also write the full path e.g. /home/user/textfiles/*.txt
text_files = glob.glob("*.txt")
# Loop through all the text files and combine them into a single JSON list.
# As for 70,000 files, I am unsure how the performance will turn out.
for json_text in text_files:
with open(json_text, 'r') as text_file:
combined_json.extend(json.load(text_file))
# Write all the files to a JSON file. For your future usage.
# You also can read directly from the combed_json variable.
with open('input.json', 'w') as json_file:
json.dump(combined_json, json_file, indent=2)
# Convert the JSON data into a dataframe, using the combined_json variable.
json_df = pd.json_normalize(combined_json)
# Write the data from the dataframe to the CSV file.
# Mode "w" will always overwrite the CSV file, use mode "a" to append text instead of overwriting.
json_df.to_csv("dataframe.csv", mode="w")
更多关于pd.json_normalize
的工作原理,可以参考here.
要开始使用pandas
库,可以参考here。
如果你想从字符串而不是文件中加载 JSON,你可以参考 here.
要了解更多关于glob
的信息,您可以参考here。
{}
中包含的每个文本实际上都是 dict
类型。这种类型可以使用 pd.DataFrame
轻松转换为 pandas,最后使用 to_excel
.
转换为 excel
那么完整的代码可以这样写:
import pandas as pd
my_dict_1={
"campaignId": "all", "startDate":"2020-06-11", "endDate":"2020-06-11", "device":"Computers", "network":"Display Network", "channel":"all", "accLevelQS":-1.0, "impressions":389, "clicks":3, "ctr":0.0, "avgCPC":0.0, "convValuePerClick":0.0, "convValuePerCost":0.0, "costConv1PerClick":0.0, "convRate1PerClick":0.0, "cost":0.142884, "conv1PerClick":0.0, "totalConvValue":0.00, "allConversions":0.0, "allConversionValue":0.00, "avgPosition":0.0, "intr":3, "searchImprShare":0.0, "contImprShare":5.0, "impressionShare":5.0
}
my_dict_2={
"campaignId": "all", "startDate":"2020-06-11", "endDate":"2020-06-11", "device":"Mobile devices with full browsers", "network":"Display Network", "channel":"all", "accLevelQS":-1.0, "impressions":6101, "clicks":90, "ctr":0.0, "avgCPC":0.0, "convValuePerClick":0.0, "convValuePerCost":0.0, "costConv1PerClick":0.0, "convRate1PerClick":0.0, "cost":4.342799, "conv1PerClick":0.0, "totalConvValue":0.00, "allConversions":0.0, "allConversionValue":0.00, "avgPosition":0.0, "intr":90, "searchImprShare":0.0, "contImprShare":5.0077566465021217, "impressionShare":5.0077566465021217
}
all_dict=[my_dict_1,my_dict_2]
df1 = pd.DataFrame(all_dict)
df1.to_excel("output.xlsx",index = False)
下载后我所有的数据都在文本文件 (.txt) 中。我有多个这样的文件(至少 70000 个),我想我会先使用 cmd 中的 copy.txt 合并到一个主文本文件中。然后我想分析 mastertext 文件,这样理想的输出将是 .csv 文件。
需要将数据转换为 table 类结构以便于分析。每个文件(在我拥有的 70000 个 .txt 文件中)都以“['and ends with ']”开头,每一行新数据都以“{”开头并以“}”结尾,我有多行数据相同的格式。我希望转换后的数据具有 headers,例如 'campaignId',以便我可以使用数据透视分析列(就像您在 excel 中所做的那样)。我是编程新手,不确定 Python 中的哪些库可以帮助我实现目标。
这是我拥有的数据示例。 (注意 - 这些只是一个文件中的两行,我有大约 70000 个这样的文件)
{
"campaignId": "all",
"startDate": "2020-06-11",
"endDate": "2020-06-11",
"device": "Computers",
"network": "Display Network",
"channel": "all",
"accLevelQS": -1.0,
"impressions": 389,
"clicks": 3,
"ctr": 0.0,
"avgCPC": 0.0,
"convValuePerClick": 0.0,
"convValuePerCost": 0.0,
"costConv1PerClick": 0.0,
"convRate1PerClick": 0.0,
"cost": 0.142884,
"conv1PerClick": 0.0,
"totalConvValue": 0.00,
"allConversions": 0.0,
"allConversionValue": 0.00,
"avgPosition": 0.0,
"intr": 3,
"searchImprShare": 0.0,
"contImprShare": 5.0,
"impressionShare": 5.0
},
{
"campaignId": "all",
"startDate": "2020-06-11",
"endDate": "2020-06-11",
"device": "Mobile devices with full browsers",
"network": "Display Network",
"channel": "all",
"accLevelQS": -1.0,
"impressions": 6101,
"clicks": 90,
"ctr": 0.0,
"avgCPC": 0.0,
"convValuePerClick": 0.0,
"convValuePerCost": 0.0,
"costConv1PerClick": 0.0,
"convRate1PerClick": 0.0,
"cost": 4.342799,
"conv1PerClick": 0.0,
"totalConvValue": 0.00,
"allConversions": 0.0,
"allConversionValue": 0.00,
"avgPosition": 0.0,
"intr": 90,
"searchImprShare": 0.0,
"contImprShare": 5.0077566465021217,
"impressionShare": 5.0077566465021217
}
I have made some assumptions as the data you have provided seems like a list of JSON data. This is because there is a "," in between your entries.
这是我的 input.json
文件。请注意,我在顶部和底部添加了 [ 和 ],因为这将为您的数据提供适当的 JSON 结构。
[
{
"campaignId": "all",
"startDate": "2020-06-11",
"endDate": "2020-06-11",
"device": "Computers",
"network": "Display Network",
"channel": "all",
"accLevelQS": -1.0,
"impressions": 389,
"clicks": 3,
"ctr": 0.0,
"avgCPC": 0.0,
"convValuePerClick": 0.0,
"convValuePerCost": 0.0,
"costConv1PerClick": 0.0,
"convRate1PerClick": 0.0,
"cost": 0.142884,
"conv1PerClick": 0.0,
"totalConvValue": 0.00,
"allConversions": 0.0,
"allConversionValue": 0.00,
"avgPosition": 0.0,
"intr": 3,
"searchImprShare": 0.0,
"contImprShare": 5.0,
"impressionShare": 5.0
},
{
"campaignId": "all",
"startDate": "2020-06-11",
"endDate": "2020-06-11",
"device": "Mobile devices with full browsers",
"network": "Display Network",
"channel": "all",
"accLevelQS": -1.0,
"impressions": 6101,
"clicks": 90,
"ctr": 0.0,
"avgCPC": 0.0,
"convValuePerClick": 0.0,
"convValuePerCost": 0.0,
"costConv1PerClick": 0.0,
"convRate1PerClick": 0.0,
"cost": 4.342799,
"conv1PerClick": 0.0,
"totalConvValue": 0.00,
"allConversions": 0.0,
"allConversionValue": 0.00,
"avgPosition": 0.0,
"intr": 90,
"searchImprShare": 0.0,
"contImprShare": 5.0077566465021217,
"impressionShare": 5.0077566465021217
}
]
下面是使用 pandas
库将数据处理成数据帧然后将其写入 CSV 文件的代码。
import json # Available by default, no install required.
import glob # Available by default, no install required.
import pandas as pd # Requires installation via pip.
# Initialise a list to store our results.
combined_json = []
# Set a glob pattern to *.txt since your files are txt files.
# You can also write the full path e.g. /home/user/textfiles/*.txt
text_files = glob.glob("*.txt")
# Loop through all the text files and combine them into a single JSON list.
# As for 70,000 files, I am unsure how the performance will turn out.
for json_text in text_files:
with open(json_text, 'r') as text_file:
combined_json.extend(json.load(text_file))
# Write all the files to a JSON file. For your future usage.
# You also can read directly from the combed_json variable.
with open('input.json', 'w') as json_file:
json.dump(combined_json, json_file, indent=2)
# Convert the JSON data into a dataframe, using the combined_json variable.
json_df = pd.json_normalize(combined_json)
# Write the data from the dataframe to the CSV file.
# Mode "w" will always overwrite the CSV file, use mode "a" to append text instead of overwriting.
json_df.to_csv("dataframe.csv", mode="w")
更多关于pd.json_normalize
的工作原理,可以参考here.
要开始使用pandas
库,可以参考here。
如果你想从字符串而不是文件中加载 JSON,你可以参考 here.
要了解更多关于glob
的信息,您可以参考here。
{}
中包含的每个文本实际上都是 dict
类型。这种类型可以使用 pd.DataFrame
轻松转换为 pandas,最后使用 to_excel
.
那么完整的代码可以这样写:
import pandas as pd
my_dict_1={
"campaignId": "all", "startDate":"2020-06-11", "endDate":"2020-06-11", "device":"Computers", "network":"Display Network", "channel":"all", "accLevelQS":-1.0, "impressions":389, "clicks":3, "ctr":0.0, "avgCPC":0.0, "convValuePerClick":0.0, "convValuePerCost":0.0, "costConv1PerClick":0.0, "convRate1PerClick":0.0, "cost":0.142884, "conv1PerClick":0.0, "totalConvValue":0.00, "allConversions":0.0, "allConversionValue":0.00, "avgPosition":0.0, "intr":3, "searchImprShare":0.0, "contImprShare":5.0, "impressionShare":5.0
}
my_dict_2={
"campaignId": "all", "startDate":"2020-06-11", "endDate":"2020-06-11", "device":"Mobile devices with full browsers", "network":"Display Network", "channel":"all", "accLevelQS":-1.0, "impressions":6101, "clicks":90, "ctr":0.0, "avgCPC":0.0, "convValuePerClick":0.0, "convValuePerCost":0.0, "costConv1PerClick":0.0, "convRate1PerClick":0.0, "cost":4.342799, "conv1PerClick":0.0, "totalConvValue":0.00, "allConversions":0.0, "allConversionValue":0.00, "avgPosition":0.0, "intr":90, "searchImprShare":0.0, "contImprShare":5.0077566465021217, "impressionShare":5.0077566465021217
}
all_dict=[my_dict_1,my_dict_2]
df1 = pd.DataFrame(all_dict)
df1.to_excel("output.xlsx",index = False)