读取部分文件，以特定单词停止和开始

Question

我正在使用 python 2.7，我被分配（self-directed 分配，我写了这些指令）编写一个小型静态 html 生成器，我想协助寻找 new-to-python 面向资源，一次读取部分文件。如果有人提供代码答案，那很好，但我想了解 why 和 how python 的工作原理。我可以买书，但不是很贵的书——我现在可以负担得起三十、也许四十美元用于这项特定研究。

这个程序应该工作的方式是有一个 template.html 文件，一个 message.txt文件、图像文件、archive.html文件和output.html 文件。这比您需要的信息更多，但我的基本想法是 "go back and forth reading from template and message, putting their contents in output and then writing in archive that output exists"。但我还没有做到这一点，我并不是要你解决整个问题，正如我在下面详述的那样：

程序从 template.html 读入 html，停在起始标记处，然后读入页面标题来自 message.txt。那就是我现在的处境。有用！我很高兴...几个小时前，当我意识到那不是最终的老板时。

#doctype to title
copyLine = False
for line in template.readlines():
    if not '<title>' in line:
       copyLine = True
       if copyLine:
            outputhtml.write(line)
            copyLine = False
else:
    templateSeek = template.tell()
    break

#read name of message
titleOut = message.readline()
print titleOut, " is the title of the new page"
#--------
##5. Put the title from the message file in the head>title tag of the output file
#--------
titleOut = str(titleOut)
titleTag = "<title>"+titleOut+"|Circuit Salsa</title>"
outputhtml.write(titleTag)

我的问题是：我不懂正则表达式，当我尝试各种形式的 for...in 代码时，我得到了所有模板，none 模板，一些组合我不想要的模板部分...无论如何，我如何来回阅读这些文件并从我离开的地方继续阅读？非常感谢任何寻找 easier-to-understand 资源的帮助，我花了大约五个小时研究这个，但我很头疼，因为我一直在获取针对更高级受众的资源，但我不理解它们。

这是我最后尝试的两种方法（没有成功）：

block = ""
found = False
print "0"
for line in template:
    if found:
        print "1"
        block += line
        if line.strip() == "<h1>": break
else:
    if line.strip() == "</title>":
        print "2"
        found = True
        block = "</title>"

print block + "3"

只打印了点 0 和 3。我把 print # 放在那里是因为我不知道为什么我的输出文件没有改变。

template.seek(templateSeek)
copyLine = False
for line in template.readlines():
    if not '<a>' in line:
        copyLine = True
        if copyLine:
            outputhtml.write(line)
            copyLine = False
    else:
        templateSeek = template.tell()
        break

对于另一个，我很确定我做错了。

Answer 1

我会使用 BeautifulSoup for this. An alternative is to use regular expressions，无论如何知道这些都很好。我知道它们看起来很吓人，但实际上并不难学（我花了一个小时左右）。例如，要获取所有 link 标签，您可以执行类似

的操作

from re import findall, DOTALL

html = '''
<!DOCTYPE html>
<html>

<head>
    <title>My awesome web page!</title>
</head>

<body>
    <h2>Sites I like</h2>
    <ul>
        <li><a href="https://www.google.com/">Google</a></li>
        <li><a href="https://www.facebook.com">Facebook</a></li>
        <li><a href="http://www.amazon.com">Amazon</a></li>
    </ul>

    <h2>My favorite foods</h2>
    <ol>
        <li>Pizza</li>
        <li>French Fries</li>
    </ol>
</body>

</html>
'''

def find_tag(src, tag):
    return findall(r'<{0}.*?>.*?</{0}>'.format(tag), src, DOTALL)

print find_tag(html, 'a')
# ['<a href="https://www.google.com/">Google</a>', '<a href="https://www.facebook.com">Facebook</a>', '<a href="http://www.amazon.com">Amazon</a>']
print find_tag(html, 'li')
# ['<li><a href="https://www.google.com/">Google</a></li>', '<li><a href="https://www.facebook.com">Facebook</a></li>', '<li><a href="http://www.amazon.com">Amazon</a></li>', '<li>Pizza</li>', '<li>French Fries</li>']
print find_tag(html, 'body')
# ['<body>\n    <h2>Sites I like</h2>\n    <ul>\n        <li><a href="https://www.google.com/">Google</a></li>\n        <li><a href="https://www.facebook.com">Facebook</a></li>\n        <li><a href="http://www.amazon.com">Amazon</a></li>\n    </ul>\n\n    <h2>My favorite foods</h2>\n    <ol>\n        <li>Pizza</li>\n        <li>French Fries</li>\n    </ol>\n</body>']

我希望您至少发现其中的一些有用。如果您有任何后续问题，请评论我的回答。祝你好运！

Answer 2

您在第一次尝试时遇到了缩进问题。 else 子句与 for 语句处于同一缩进级别，因此它们一起形成复合 for:else: 控制结构。新 Python 程序员经常对此感到困惑。 else: 子句仅在 for 循环运行到结束且没有遇到 break 语句时才执行。显然在你的情况下 break 语句确实得到执行，因此 else: 子句没有。 else: 子句在循环之外，因此 "found" 永远不会设置为 True。我认为如果您缩进 else: 子句，您会喜欢这个结果。此外，我认为您可以放弃对 strip() 的调用，而是使用 "if '' in line:" 等语句

我怀疑你对第二个函数的看法是正确的。这对我来说毫无意义。

Answer 3

昨晚深夜，我遇到了一个解决方案，可以满足我的要求。虽然学习正则表达式将是一项有用的技能，我肯定会在整个夏天培养它，但正则表达式对于这个特定的应用程序来说有点过分了。我最终使用 linecache 读取特定行，因为我想从这些文件中获取的信息由换行符分隔。

读取部分文件，以特定单词停止和开始

Reading in parts of file, stopping and starting with certain words

html

python

file-io

parsing