使用 python PyMuPDF (fitz) 遍历行并检查它的长度,如果符合条件则添加句点
Using python PyMuPDF (fitz) to iterate through lines and check length of it and add a period if it meets the criteria
尝试遍历 PyMuPDF 库中页面的每一行来检查句子的长度,如果少于 10 个单词,那么我想添加一个句号。
伪代码为:
#loop through the lines of the PDF
#check number of words in line
#if line has less than 10 words
#add period
下面是真实代码:
import fitz
myfile = "my.pdf"
doc =fitz.open(myfile)
page=doc[0]
for page in doc:
text = page.getText("text")
print(text)
当我添加另一个 for 循环时,例如
for line in page:
我收到一条错误消息,指出页面不可迭代。我还有其他方法可以做到这一点吗?
谢谢
为了遍历页面行,您可以使用 getDisplayList:
page_display = page.getDisplayList()
dictionary_elements = page_display.getTextPage().extractDICT()
for block in dictionary_elements['blocks']:
for line in block['lines']:
line_text = ''
for span in line['spans']:
line_text += ' ' + span['text]
print(l
尝试遍历 PyMuPDF 库中页面的每一行来检查句子的长度,如果少于 10 个单词,那么我想添加一个句号。 伪代码为:
#loop through the lines of the PDF
#check number of words in line
#if line has less than 10 words
#add period
下面是真实代码:
import fitz
myfile = "my.pdf"
doc =fitz.open(myfile)
page=doc[0]
for page in doc:
text = page.getText("text")
print(text)
当我添加另一个 for 循环时,例如
for line in page:
我收到一条错误消息,指出页面不可迭代。我还有其他方法可以做到这一点吗?
谢谢
为了遍历页面行,您可以使用 getDisplayList:
page_display = page.getDisplayList()
dictionary_elements = page_display.getTextPage().extractDICT()
for block in dictionary_elements['blocks']:
for line in block['lines']:
line_text = ''
for span in line['spans']:
line_text += ' ' + span['text]
print(l