从 Python 中超过 separate/separated 行的文件中提取文本

Question

希望有人能提供帮助，因为我是 Python 的新手并且有点挣扎。

我需要获取一个文本文件的内容并将其中的元素提取到另一个文件中，捕获一些行但忽略其他行。例如，这是原始文件的一部分：

input1   
   name "Bob"   
   Always_active  
next_input  
input2   
   name "Alice"   
   Sometimes_active   
next_input  
input3   
   name "Ted"   
   Always_active   
next_input  
input4    
   name "Albert"   
   Never_active   
next_input  
input5   
   name "Sue"   
   Always_active   
next_input  
input6   
   name "David"   
   Never_active   
next_input  
input7   
   name "Building1"   
   Always_active   
next_input  
input8   
   name "Building2"   
   Always_active   
next_input  
input9   
   name "Building3"   
   Always_active   
next_input  
input10   
   name "Building4"     
   Always_active     
next_input

这就是我希望能够捕捉到的：

input1   
   name "Bob"   
input2   
   name "Alice"   
input3   
   name "Ted"   
input4   
   name "Albert"   
input5   
   name "Sue"   
input6   
   name "David"   
input7   
   name "Building1"   
input8    
   name "Building2"   
input9   
   name "Building3"   
input10   
   name "Building4"

所以基本上我需要忽略一些行并捕获其余部分。我怎样才能做到这一点？

Answer 1

您可以从阅读文件开始：

with open(r'C:\your_file_path\your_file_name.txt', 'r') as f:
    text = f.read().splitlines()

那么您将拥有变量“文本”，它是一个列表，此列表中的每个元素都是文本中的一行。接下来你应该检查每一行的“条件”

new_text = []
for i in text:
    if i[:5] == 'input':
        new_text.append(i)
    elif i[:6] == 'name \"':
        new_text.append(i)

变量“new_text”是一个包含您想要的元素行的列表。

Answer 2

您应该逐行读取文件并过滤您要写入新文件的行。

# First read file
with open(filename, 'r') as f:
    file = f.readlines()

# Create new file
f2 = open(filename, 'a')

for line in file:
    if line not in ('Always_active', 'Never_active'): # Filter lines you don't want
        f2.write(line) # Write line
        f2.write('\n') # Break line

# Close file
f2.close()

应该可以

Answer 3

您可以使用 pandas 来阅读和编辑文件。

假设您将 temp_txt_file.txt 中的文本保存在与 python 脚本相同的文件夹中。

import pandas as pd
import csv

data = pd.read_csv("./temp_txt_file.txt", header=None)
data_filtered = data.loc[data.loc[:, 0].apply(lambda x: x.find("name") == 0 or x.find("input") == 0)]
data_filtered.to_csv(path_or_buf='./filtered_data.txt', index=False, header=False, quoting=csv.QUOTE_NONE)

您首先将数据加载到 pandas.DataFrame。在第二行中，您 loc 删除了字符串以 name 或 input 开头的所有行（这由 .find() 完成，returns 0 如果字符串以查询开头）

注意： 使用 .apply(lambda x: function) 可以将函数应用于 pandas.DataFrame 的每一行x 包含作为字符串的行。

代码：

data.loc[:, 0].apply(lambda x: x.find("name") == 0 or x.find("input") == 0)

产生一系列“真/假”，您可以将其视为过滤掉不需要的行的掩码。然后，您在 pandas.DataFrame 中应用 (locate) 并将其保存到一个新文件 *filtered_data.txt" 中，没有任何 header 和索引pandas.DataFrame.

注意： qutoing=csv.QUOTE_NONE 是为了避免 "

不必要的转义字符

建议： 如果您能以不同的方式格式化文本文件并使用 .csv 文件，这将有很大帮助，因为看起来您的数据有点table。然后你可以深入研究 pandas，这非常适合组织数据。

从 Python 中超过 separate/separated 行的文件中提取文本

Extract text from file over separate/separated lines in Python

python

text

extract