table.decompose(): AttributeError: 'str' object has no attribute 'decompose'

table.decompose(): AttributeError: 'str' object has no attribute 'decompose'

我正在尝试使用 BeautifulSoup 来解析 html 文档。我试图编写一个可以解析文档的代码,找到所有 tables 并删除那些有 数字/字母数字比例 > 15%。我使用给出的代码作为对上一个问题的回答:

但出于某种原因,table.decompose() 参数被标记为错误。我会很感激我能得到的任何帮助。请注意,我是初学者,所以,尽管我尝试了,但我并不总是理解更复杂的解决方案!

代码如下:

test_file = 'locationoftestfile.html'


# Define a function to remove tables which have numeric characters/ alphabetic and numeric characters > 15%
def remove_table(table):
        table = re.sub('<[^>]*>', ' ', str(table))
        numeric = sum(c.isdigit() for c in table)
        print('numeric: ' + str(numeric))
        alphabetic = sum(c.isalpha() for c in table)
        print('alpha: ' + str(alphabetic))
        try:
                ratio = numeric / float(numeric + alphabetic)
                print('ratio: '+ str(ratio))
        except ZeroDivisionError as err:
                ratio = 1
        if ratio > 0.15: 
            table.decompose()


# Define a function to create our Soup object and then extract text
def file_to_text(file):
    soup_file = open(file, 'r')
    soup = BeautifulSoup(soup_file, 'html.parser')
    for table in soup.find_all('table'):
        remove_table(table)
    text = soup.get_text()
    return text


file_to_text(test_file)

这是output/error我收到的:

numeric: 1
alpha: 55
ratio: 0.017857142857142856
numeric: 9
alpha: 88
ratio: 0.09278350515463918
numeric: 20
alpha: 84
ratio: 0.19230769230769232
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-28-c7e380df4fdc> in <module>
----> 1 file_to_text(test_file)

<ipython-input-27-9fb65cec1313> in file_to_text(file)
     16                 ratio = 1
     17         if ratio > 0.15:
---> 18             table.decompose()
     19     text = soup.get_text()
     20     return text

AttributeError: 'str' object has no attribute 'decompose'

请注意 table.decompose() 参数与我链接的解决方案中给出的参数不同。该解决方案使用

   return True
else:
   return False

但是,也许天真,我不明白这将如何删除 table。

table = re.sub('<[^>]*>', ' ', str(table))

这会用字符串覆盖参数 'table'。您可能想在这里为变量使用另一个名称。例如

def remove_table(table):
    table_as_str = re.sub('<[^>]*>', ' ', str(table))
    numeric = sum(c.isdigit() for c in table_as_str)
    print('numeric: ' + str(numeric))
    alphabetic = sum(c.isalpha() for c in table_as_str)
    print('alpha: ' + str(alphabetic))
    try:
            ratio = numeric / float(numeric + alphabetic)
            print('ratio: '+ str(ratio))
    except ZeroDivisionError as err:
            ratio = 1
    if ratio > 0.15: 
        table.decompose()