从字符串中删除 HTML 个标签
Remove HTML tags from strings
我正在尝试使用 rvest
从 HTML 标签中提取文本数据。
数据:
[vc_row css_animation="" row_type="row" use_row_as_full_screen_section="no" type="full_width" angled_section="no" text_align="left" background_image_as_pattern="without_pattern"][vc_column][vc_column_text]\n\n<h6 class="button" style="padding: 0px 42%;">Description</h6>\n\n<ol>\n\n \t<li>Ideal for : Women</li>\n\n \t<li>Package Contents : 1 Pcs</li>\n\n \t<li>Fit Type : Regular, Relaxed, Classic and Slim Fit.</li>\n\n \t<li>Care Instructions : Machine Wash and Normal Wash.</li>\n\n \t<li>Occasion : Lough, Smart, Dressy, Business, Casual and Formal.</li>\n\n \t<li>Sleeve Type : Short sleeves, 3/4th Sleeves, Full Sleeves ,Kimono Sleeves and Off Shoulder.</li>\n\n \t<li>Browse our Brand In Love for more choices of Shrugs , Tops and T shirt and Western wear Collections.</li>\n\n \t<li>Care Instructions : Ensure washing the tee in cold water, don't iron directly on the print and don't dry in direct sunlight.</li>\n\n</ol>\n\n<img class="alignnone wp-image-858" src="https://justelite.in/wp-content/uploads/2020/06/art-size.jpg" alt="" />\n\n<h6 class="button" style="padding: 0px 42%;">Reviews</h6>\n\n[/vc_column_text][/vc_column][/vc_row]
我做的是:
html_text(read_html(as.character(data)))
我仍然得到 vc_row css_animation
和一些其他未删除的标签。
dput
数据:
structure(2L, .Label = c("", "[vc_row css_animation=\"\" row_type=\"row\" use_row_as_full_screen_section=\"no\" type=\"full_width\" angled_section=\"no\" text_align=\"left\" background_image_as_pattern=\"without_pattern\"][vc_column][vc_column_text]\n\n<h6 class=\"button\" style=\"padding: 0px 42%;\">Description</h6>\n\n<ol>\n\n \t<li>Ideal for : Women</li>\n\n \t<li>Package Contents : 1 Pcs</li>\n\n \t<li>Fit Type : Regular, Relaxed, Classic and Slim Fit.</li>\n\n \t<li>Care Instructions : Machine Wash and Normal Wash.</li>\n\n \t<li>Occasion : Lough, Smart, Dressy, Business, Casual and Formal.</li>\n\n \t<li>Sleeve Type : Short sleeves, 3/4th Sleeves, Full Sleeves ,Kimono Sleeves and Off Shoulder.</li>\n\n \t<li>Browse our Brand In Love for more choices of Shrugs , Tops and T shirt and Western wear Collections.</li>\n\n \t<li>Care Instructions : Ensure washing the tee in cold water, don't iron directly on the print and don't dry in direct sunlight.</li>\n\n</ol>\n\n<img class=\"alignnone wp-image-858\" src=\"https://justelite.in/wp-content/uploads/2020/06/art-size.jpg\" alt=\"\" />\n\n<h6 class=\"button\" style=\"padding: 0px 42%;\">Reviews</h6>\n\n[/vc_column_text][/vc_column][/vc_row]"
), class = "factor")
据我所知,您得到的 html 标签不正确,因为这些标签通常由“<”和“>”分隔(例如,< h1 >
)。你的被 [ h1 ]
包围了。调整上面链接的功能,你可以这样做:
s <- structure(2L, .Label = c("", "[vc_row css_animation=\"\" row_type=\"row\" use_row_as_full_screen_section=\"no\" type=\"full_width\" angled_section=\"no\" text_align=\"left\" background_image_as_pattern=\"without_pattern\"][vc_column][vc_column_text]\n\n<h6 class=\"button\" style=\"padding: 0px 42%;\">Description</h6>\n\n<ol>\n\n \t<li>Ideal for : Women</li>\n\n \t<li>Package Contents : 1 Pcs</li>\n\n \t<li>Fit Type : Regular, Relaxed, Classic and Slim Fit.</li>\n\n \t<li>Care Instructions : Machine Wash and Normal Wash.</li>\n\n \t<li>Occasion : Lough, Smart, Dressy, Business, Casual and Formal.</li>\n\n \t<li>Sleeve Type : Short sleeves, 3/4th Sleeves, Full Sleeves ,Kimono Sleeves and Off Shoulder.</li>\n\n \t<li>Browse our Brand In Love for more choices of Shrugs , Tops and T shirt and Western wear Collections.</li>\n\n \t<li>Care Instructions : Ensure washing the tee in cold water, don't iron directly on the print and don't dry in direct sunlight.</li>\n\n</ol>\n\n<img class=\"alignnone wp-image-858\" src=\"https://justelite.in/wp-content/uploads/2020/06/art-size.jpg\" alt=\"\" />\n\n<h6 class=\"button\" style=\"padding: 0px 42%;\">Reviews</h6>\n\n[/vc_column_text][/vc_column][/vc_row]"
), class = "factor")
cleanFun <- function(htmlString) {
return(gsub("<.*?>|\[.*?\]", "", htmlString))
}
cleanFun(s)
#> [1] "\n\nDescription\n\n\n\n \tIdeal for : Women\n\n \tPackage Contents : 1 Pcs\n\n \tFit Type : Regular, Relaxed, Classic and Slim Fit.\n\n \tCare Instructions : Machine Wash and Normal Wash.\n\n \tOccasion : Lough, Smart, Dressy, Business, Casual and Formal.\n\n \tSleeve Type : Short sleeves, 3/4th Sleeves, Full Sleeves ,Kimono Sleeves and Off Shoulder.\n\n \tBrowse our Brand In Love for more choices of Shrugs , Tops and T shirt and Western wear Collections.\n\n \tCare Instructions : Ensure washing the tee in cold water, don't iron directly on the print and don't dry in direct sunlight.\n\n\n\n\n\nReviews\n\n"
由 reprex package (v0.3.0)
于 2020-09-16 创建
我正在尝试使用 rvest
从 HTML 标签中提取文本数据。
数据:
[vc_row css_animation="" row_type="row" use_row_as_full_screen_section="no" type="full_width" angled_section="no" text_align="left" background_image_as_pattern="without_pattern"][vc_column][vc_column_text]\n\n<h6 class="button" style="padding: 0px 42%;">Description</h6>\n\n<ol>\n\n \t<li>Ideal for : Women</li>\n\n \t<li>Package Contents : 1 Pcs</li>\n\n \t<li>Fit Type : Regular, Relaxed, Classic and Slim Fit.</li>\n\n \t<li>Care Instructions : Machine Wash and Normal Wash.</li>\n\n \t<li>Occasion : Lough, Smart, Dressy, Business, Casual and Formal.</li>\n\n \t<li>Sleeve Type : Short sleeves, 3/4th Sleeves, Full Sleeves ,Kimono Sleeves and Off Shoulder.</li>\n\n \t<li>Browse our Brand In Love for more choices of Shrugs , Tops and T shirt and Western wear Collections.</li>\n\n \t<li>Care Instructions : Ensure washing the tee in cold water, don't iron directly on the print and don't dry in direct sunlight.</li>\n\n</ol>\n\n<img class="alignnone wp-image-858" src="https://justelite.in/wp-content/uploads/2020/06/art-size.jpg" alt="" />\n\n<h6 class="button" style="padding: 0px 42%;">Reviews</h6>\n\n[/vc_column_text][/vc_column][/vc_row]
我做的是:
html_text(read_html(as.character(data)))
我仍然得到 vc_row css_animation
和一些其他未删除的标签。
dput
数据:
structure(2L, .Label = c("", "[vc_row css_animation=\"\" row_type=\"row\" use_row_as_full_screen_section=\"no\" type=\"full_width\" angled_section=\"no\" text_align=\"left\" background_image_as_pattern=\"without_pattern\"][vc_column][vc_column_text]\n\n<h6 class=\"button\" style=\"padding: 0px 42%;\">Description</h6>\n\n<ol>\n\n \t<li>Ideal for : Women</li>\n\n \t<li>Package Contents : 1 Pcs</li>\n\n \t<li>Fit Type : Regular, Relaxed, Classic and Slim Fit.</li>\n\n \t<li>Care Instructions : Machine Wash and Normal Wash.</li>\n\n \t<li>Occasion : Lough, Smart, Dressy, Business, Casual and Formal.</li>\n\n \t<li>Sleeve Type : Short sleeves, 3/4th Sleeves, Full Sleeves ,Kimono Sleeves and Off Shoulder.</li>\n\n \t<li>Browse our Brand In Love for more choices of Shrugs , Tops and T shirt and Western wear Collections.</li>\n\n \t<li>Care Instructions : Ensure washing the tee in cold water, don't iron directly on the print and don't dry in direct sunlight.</li>\n\n</ol>\n\n<img class=\"alignnone wp-image-858\" src=\"https://justelite.in/wp-content/uploads/2020/06/art-size.jpg\" alt=\"\" />\n\n<h6 class=\"button\" style=\"padding: 0px 42%;\">Reviews</h6>\n\n[/vc_column_text][/vc_column][/vc_row]"
), class = "factor")
据我所知,您得到的 html 标签不正确,因为这些标签通常由“<”和“>”分隔(例如,< h1 >
)。你的被 [ h1 ]
包围了。调整上面链接的功能,你可以这样做:
s <- structure(2L, .Label = c("", "[vc_row css_animation=\"\" row_type=\"row\" use_row_as_full_screen_section=\"no\" type=\"full_width\" angled_section=\"no\" text_align=\"left\" background_image_as_pattern=\"without_pattern\"][vc_column][vc_column_text]\n\n<h6 class=\"button\" style=\"padding: 0px 42%;\">Description</h6>\n\n<ol>\n\n \t<li>Ideal for : Women</li>\n\n \t<li>Package Contents : 1 Pcs</li>\n\n \t<li>Fit Type : Regular, Relaxed, Classic and Slim Fit.</li>\n\n \t<li>Care Instructions : Machine Wash and Normal Wash.</li>\n\n \t<li>Occasion : Lough, Smart, Dressy, Business, Casual and Formal.</li>\n\n \t<li>Sleeve Type : Short sleeves, 3/4th Sleeves, Full Sleeves ,Kimono Sleeves and Off Shoulder.</li>\n\n \t<li>Browse our Brand In Love for more choices of Shrugs , Tops and T shirt and Western wear Collections.</li>\n\n \t<li>Care Instructions : Ensure washing the tee in cold water, don't iron directly on the print and don't dry in direct sunlight.</li>\n\n</ol>\n\n<img class=\"alignnone wp-image-858\" src=\"https://justelite.in/wp-content/uploads/2020/06/art-size.jpg\" alt=\"\" />\n\n<h6 class=\"button\" style=\"padding: 0px 42%;\">Reviews</h6>\n\n[/vc_column_text][/vc_column][/vc_row]"
), class = "factor")
cleanFun <- function(htmlString) {
return(gsub("<.*?>|\[.*?\]", "", htmlString))
}
cleanFun(s)
#> [1] "\n\nDescription\n\n\n\n \tIdeal for : Women\n\n \tPackage Contents : 1 Pcs\n\n \tFit Type : Regular, Relaxed, Classic and Slim Fit.\n\n \tCare Instructions : Machine Wash and Normal Wash.\n\n \tOccasion : Lough, Smart, Dressy, Business, Casual and Formal.\n\n \tSleeve Type : Short sleeves, 3/4th Sleeves, Full Sleeves ,Kimono Sleeves and Off Shoulder.\n\n \tBrowse our Brand In Love for more choices of Shrugs , Tops and T shirt and Western wear Collections.\n\n \tCare Instructions : Ensure washing the tee in cold water, don't iron directly on the print and don't dry in direct sunlight.\n\n\n\n\n\nReviews\n\n"
由 reprex package (v0.3.0)
于 2020-09-16 创建