提高弹性搜索信号的性能
Improve performance of elasticsearch signals
我正在使用名为 commonsearch 的系统。特别是这个post,我们会讲后端部分,写在python.
后端系统将 warc 文件和索引的内容流式传输到 2 个 elasticsearch 集群 - 1) 文本 elasticsearch 集群 2) 文档 elasticsearch 集群。
在添加我的更改之前,索引的平均速度约为每个索引 0.02。
经过我的更改后它是 ~ 1.00(在 aws 上它是 0.4)。
所以我做了什么。
我为每个使用 html2text 的 warc 体去除了 html,但它并没有真正花费太多时间(可能是 +0.02),但它确实使性能更加尖锐,更多内容,剥离 html.
需要更多时间
我添加了 2 个 textblob 文本分类器 (naiveBays) 检查每个索引值,它的训练是序列化的 (pickle) 并在循环之前加载。
第一个训练包含33'000个测试数据,第二个包含几百个(我将在第二个中添加更多)。
性能分析
每个示例 10 个。
我的更改之前:
Indexing http://2sao.vn/p1004c1007n20110413113841718/mau-vay-du-tiec-cho-quy-co-hoan-hao.vnn [64/1817]
--- 0.0224668979645 seconds ---
Indexing http://2sidesoftheocean.blogspot.com/2012/04/my-first-family-in-1940-us-census_02.html
--- 0.0367019176483 seconds ---
Indexing http://3.pulsitemeter.com/exbii/exbii-photos-aunties-bath-.html
--- 0.00342702865601 seconds ---
Indexing http://303cycling.com/Meredith-Miller-USGP-Cyclocross-Video-Specialized-bikes
--- 0.0187289714813 seconds ---
Indexing http://303magazine.com/2012/10/undead-mans-party-casselmans-hosts-zombie-crawl-aftermath-featuring-celldweller/
--- 0.0460560321808 seconds ---
Indexing http://38-avg.blogspot.com/2008/05/birdheart.html
--- 0.0178949832916 seconds ---
Indexing http://3docean.net/item/motorola-droid-razr-low-poly-/3712487?sso
--- 0.0468878746033 seconds ---
Indexing http://4.bp.blogspot.com/_hZs38tqNXns/StdbQyR_zGI/AAAAAAAAEyw/VvNCalngDbY/s1600-h/Vanderwood
--- 0.00142908096313 seconds ---
Indexing http://411mania.com/sports/young-firpo-the-best-light-heavywieght-to-never-win-a-title/
--- 0.0295450687408 seconds ---
添加 html2 文本后:
Indexing http://17hmr.net/index.php?action=profile;area=showposts;u=994
--- 0.0240960121155 seconds ---
Indexing http://17hmr.net/index.php?board=1.3060;sort=last_post
--- 0.0262401103973 seconds ---
Indexing http://17hmr.net/index.php?topic=12827.msg177073
--- 0.0259499549866 seconds ---
Indexing http://17hmr.net/index.php?topic=6751.45
--- 0.0249440670013 seconds ---
Indexing http://1889.ca/2012/11/interview-with-horror-author-mike-kearby/
--- 0.0152020454407 seconds ---
Indexing http://1980s.fm/modules.php?name=Forums&file=profile&mode=viewprofile&u=94
--- 0.151058912277 seconds ---
Indexing http://1n73r.net/category/microsoft/windows-microsoft/xp/
--- 0.0693669319153 seconds ---
Indexing http://2013missworld.com/
--- 0.0448951721191 seconds ---
Indexing http://24demayito.blogspot.com/
--- 0.111493110657 seconds ---
Indexing http://24kadra.com/2009/03/04/serial-bratany/
--- 0.145864963531 seconds ---
添加 html2 个文本和一个分类器(小的)后:
Indexing http://102theriver.iheart.com/articles
--- 0.333050012589 seconds ---
Indexing http://1035kissfm.iheart.com/articles/trending-104650/reading-rainbow-campaign-nets-1-million-12410738
--- 0.334407091141 seconds ---
Indexing http://1037theq.iheart.com/articles/trending-465498/tiesto-celebrates-a-town-called-paradise-12478486/
--- 0.34556388855 seconds ---
Indexing http://1065ctq.iheart.com/articles/national-news-104668/new-electronic-license-plates-could-be-11383289/
--- 0.330471038818 seconds ---
Indexing http://10kbullets.com/reviews/neon-nights/
--- 0.328196048737 seconds ---
Indexing http://12160.info/group/gunsandtactics/forum/topic/show?id=2649739%3ATopic%3A1105218&xg_source=msg
--- 0.353976011276 seconds ---
Indexing http://12under12under2012.blogspot.com/2012/04/aprils-forsta-vinnare-blev.html
--- 0.363568067551 seconds ---
Indexing http://1350kman.com/settlement-reached-in-salina-contamination-cleanup/
--- 0.367321968079 seconds ---
Indexing http://14ers.com/php14ers/loginviaforum.php?prgm=peakstatus_main
--- 0.309129953384 seconds ---
Indexing http://16sarkisozleri.blogspot.com/2012/12/nasip-degilmis-demet-akaln-ftozcan-deniz.html
--- 0.361335992813 seconds ---
添加 html2 个文本和一个分类器(大分类器)后:
Indexing http://10000birds.com/white-crested-laughingthrush.htm
--- 2.16983008385 seconds ---
Indexing http://1012lounge.com/
--- 1.48357391357 seconds ---
Indexing http://1015store.com/dresses-by-colors/coral-dresses.html
--- 1.85999703407 seconds ---
Indexing http://1019ampradio.cbslocal.com/tag/happy-holidays/
--- 1.24361300468 seconds ---
Indexing http://102theriver.iheart.com/articles
--- 1.25308895111 seconds ---
Indexing http://1035kissfm.iheart.com/articles/trending-104650/reading-rainbow-campaign-nets-1-million-12410738
--- 1.19226098061 seconds ---
Indexing http://1037theq.iheart.com/articles/trending-465498/tiesto-celebrates-a-town-called-paradise-12478486/
--- 1.14514183998 seconds ---
Indexing http://1065ctq.iheart.com/articles/national-news-104668/new-electronic-license-plates-could-be-11383289/
--- 1.09987902641 seconds ---
Indexing http://10kbullets.com/reviews/neon-nights/
--- 1.07253599167 seconds ---
Indexing http://12160.info/group/gunsandtactics/forum/topic/show?id=2649739%3ATopic%3A1105218&xg_source=msg
--- 1.1537129879 seconds ---
添加 html2 个文本和两个分类器后:
Indexing http://12under12under2012.blogspot.com/2012/04/aprils-forsta-vinnare-blev.html
--- 1.43961000443 seconds ---
Indexing http://1350kman.com/settlement-reached-in-salina-contamination-cleanup/
--- 1.37341785431 seconds ---
Indexing http://14ers.com/php14ers/loginviaforum.php?prgm=peakstatus_main
--- 1.26939201355 seconds ---
Indexing http://16sarkisozleri.blogspot.com/2012/12/nasip-degilmis-demet-akaln-ftozcan-deniz.html
--- 1.36402606964 seconds ---
Indexing http://17hmr.net/index.php?action=profile;area=showposts;u=994
--- 1.23323822021 seconds ---
Indexing http://17hmr.net/index.php?board=1.3060;sort=last_post
--- 1.22554993629 seconds ---
Indexing http://17hmr.net/index.php?topic=12827.msg177073
--- 1.23036003113 seconds ---
Indexing http://17hmr.net/index.php?topic=6751.45
--- 1.20131611824 seconds ---
Indexing http://1889.ca/2012/11/interview-with-horror-author-mike-kearby/
--- 1.1732749939 seconds ---
Indexing http://1980s.fm/modules.php?name=Forums&file=profile&mode=viewprofile&u=94
--- 1.36015105247 seconds ---
Indexing http://1n73r.net/category/microsoft/windows-microsoft/xp/
--- 1.2988049984 seconds ---
很少提及
这个项目也是部署在aws上的。当我 运行 它在 aws 上时,它显示每个索引 0.4(我自己是 1.3)。
问题
我怎样才能提高所有这些的性能?
我应该让分类器的训练更轻松但更精确吗?
为什么 aws 和我的电脑差别这么大?
你需要代码来理解吗?如果需要我可以添加。
欢迎所有想法!
每个问题:
我怎样才能提高所有这些的性能?
这里有几种方法,根据您用于训练的模型(例如 Bag Of Words)尝试文本和 类 的特征选择,或者尝试 LSA 和 LSI,看看这个:
我应该让我的分类器训练更轻,但更精确吗? 根据你所说的精确,几乎是的,一些文本表示模型,是高维的并且Curse Of Dimensionality 可能会发生,你可以使用特征选择。您也可以使用一些采样方法来减少数据的训练元组,看看这个:
http://searchbusinessanalytics.techtarget.com/definition/data-sampling
为什么aws和我的电脑差别这么大?很简单,AWS算法更先进,资源更强大
我正在使用名为 commonsearch 的系统。特别是这个post,我们会讲后端部分,写在python.
后端系统将 warc 文件和索引的内容流式传输到 2 个 elasticsearch 集群 - 1) 文本 elasticsearch 集群 2) 文档 elasticsearch 集群。
在添加我的更改之前,索引的平均速度约为每个索引 0.02。
经过我的更改后它是 ~ 1.00(在 aws 上它是 0.4)。
所以我做了什么。
我为每个使用 html2text 的 warc 体去除了 html,但它并没有真正花费太多时间(可能是 +0.02),但它确实使性能更加尖锐,更多内容,剥离 html.
需要更多时间我添加了 2 个 textblob 文本分类器 (naiveBays) 检查每个索引值,它的训练是序列化的 (pickle) 并在循环之前加载。
第一个训练包含33'000个测试数据,第二个包含几百个(我将在第二个中添加更多)。
性能分析
每个示例 10 个。
我的更改之前:
Indexing http://2sao.vn/p1004c1007n20110413113841718/mau-vay-du-tiec-cho-quy-co-hoan-hao.vnn [64/1817]
--- 0.0224668979645 seconds ---
Indexing http://2sidesoftheocean.blogspot.com/2012/04/my-first-family-in-1940-us-census_02.html
--- 0.0367019176483 seconds ---
Indexing http://3.pulsitemeter.com/exbii/exbii-photos-aunties-bath-.html
--- 0.00342702865601 seconds ---
Indexing http://303cycling.com/Meredith-Miller-USGP-Cyclocross-Video-Specialized-bikes
--- 0.0187289714813 seconds ---
Indexing http://303magazine.com/2012/10/undead-mans-party-casselmans-hosts-zombie-crawl-aftermath-featuring-celldweller/
--- 0.0460560321808 seconds ---
Indexing http://38-avg.blogspot.com/2008/05/birdheart.html
--- 0.0178949832916 seconds ---
Indexing http://3docean.net/item/motorola-droid-razr-low-poly-/3712487?sso
--- 0.0468878746033 seconds ---
Indexing http://4.bp.blogspot.com/_hZs38tqNXns/StdbQyR_zGI/AAAAAAAAEyw/VvNCalngDbY/s1600-h/Vanderwood
--- 0.00142908096313 seconds ---
Indexing http://411mania.com/sports/young-firpo-the-best-light-heavywieght-to-never-win-a-title/
--- 0.0295450687408 seconds ---
添加 html2 文本后:
Indexing http://17hmr.net/index.php?action=profile;area=showposts;u=994
--- 0.0240960121155 seconds ---
Indexing http://17hmr.net/index.php?board=1.3060;sort=last_post
--- 0.0262401103973 seconds ---
Indexing http://17hmr.net/index.php?topic=12827.msg177073
--- 0.0259499549866 seconds ---
Indexing http://17hmr.net/index.php?topic=6751.45
--- 0.0249440670013 seconds ---
Indexing http://1889.ca/2012/11/interview-with-horror-author-mike-kearby/
--- 0.0152020454407 seconds ---
Indexing http://1980s.fm/modules.php?name=Forums&file=profile&mode=viewprofile&u=94
--- 0.151058912277 seconds ---
Indexing http://1n73r.net/category/microsoft/windows-microsoft/xp/
--- 0.0693669319153 seconds ---
Indexing http://2013missworld.com/
--- 0.0448951721191 seconds ---
Indexing http://24demayito.blogspot.com/
--- 0.111493110657 seconds ---
Indexing http://24kadra.com/2009/03/04/serial-bratany/
--- 0.145864963531 seconds ---
添加 html2 个文本和一个分类器(小的)后:
Indexing http://102theriver.iheart.com/articles
--- 0.333050012589 seconds ---
Indexing http://1035kissfm.iheart.com/articles/trending-104650/reading-rainbow-campaign-nets-1-million-12410738
--- 0.334407091141 seconds ---
Indexing http://1037theq.iheart.com/articles/trending-465498/tiesto-celebrates-a-town-called-paradise-12478486/
--- 0.34556388855 seconds ---
Indexing http://1065ctq.iheart.com/articles/national-news-104668/new-electronic-license-plates-could-be-11383289/
--- 0.330471038818 seconds ---
Indexing http://10kbullets.com/reviews/neon-nights/
--- 0.328196048737 seconds ---
Indexing http://12160.info/group/gunsandtactics/forum/topic/show?id=2649739%3ATopic%3A1105218&xg_source=msg
--- 0.353976011276 seconds ---
Indexing http://12under12under2012.blogspot.com/2012/04/aprils-forsta-vinnare-blev.html
--- 0.363568067551 seconds ---
Indexing http://1350kman.com/settlement-reached-in-salina-contamination-cleanup/
--- 0.367321968079 seconds ---
Indexing http://14ers.com/php14ers/loginviaforum.php?prgm=peakstatus_main
--- 0.309129953384 seconds ---
Indexing http://16sarkisozleri.blogspot.com/2012/12/nasip-degilmis-demet-akaln-ftozcan-deniz.html
--- 0.361335992813 seconds ---
添加 html2 个文本和一个分类器(大分类器)后:
Indexing http://10000birds.com/white-crested-laughingthrush.htm
--- 2.16983008385 seconds ---
Indexing http://1012lounge.com/
--- 1.48357391357 seconds ---
Indexing http://1015store.com/dresses-by-colors/coral-dresses.html
--- 1.85999703407 seconds ---
Indexing http://1019ampradio.cbslocal.com/tag/happy-holidays/
--- 1.24361300468 seconds ---
Indexing http://102theriver.iheart.com/articles
--- 1.25308895111 seconds ---
Indexing http://1035kissfm.iheart.com/articles/trending-104650/reading-rainbow-campaign-nets-1-million-12410738
--- 1.19226098061 seconds ---
Indexing http://1037theq.iheart.com/articles/trending-465498/tiesto-celebrates-a-town-called-paradise-12478486/
--- 1.14514183998 seconds ---
Indexing http://1065ctq.iheart.com/articles/national-news-104668/new-electronic-license-plates-could-be-11383289/
--- 1.09987902641 seconds ---
Indexing http://10kbullets.com/reviews/neon-nights/
--- 1.07253599167 seconds ---
Indexing http://12160.info/group/gunsandtactics/forum/topic/show?id=2649739%3ATopic%3A1105218&xg_source=msg
--- 1.1537129879 seconds ---
添加 html2 个文本和两个分类器后:
Indexing http://12under12under2012.blogspot.com/2012/04/aprils-forsta-vinnare-blev.html
--- 1.43961000443 seconds ---
Indexing http://1350kman.com/settlement-reached-in-salina-contamination-cleanup/
--- 1.37341785431 seconds ---
Indexing http://14ers.com/php14ers/loginviaforum.php?prgm=peakstatus_main
--- 1.26939201355 seconds ---
Indexing http://16sarkisozleri.blogspot.com/2012/12/nasip-degilmis-demet-akaln-ftozcan-deniz.html
--- 1.36402606964 seconds ---
Indexing http://17hmr.net/index.php?action=profile;area=showposts;u=994
--- 1.23323822021 seconds ---
Indexing http://17hmr.net/index.php?board=1.3060;sort=last_post
--- 1.22554993629 seconds ---
Indexing http://17hmr.net/index.php?topic=12827.msg177073
--- 1.23036003113 seconds ---
Indexing http://17hmr.net/index.php?topic=6751.45
--- 1.20131611824 seconds ---
Indexing http://1889.ca/2012/11/interview-with-horror-author-mike-kearby/
--- 1.1732749939 seconds ---
Indexing http://1980s.fm/modules.php?name=Forums&file=profile&mode=viewprofile&u=94
--- 1.36015105247 seconds ---
Indexing http://1n73r.net/category/microsoft/windows-microsoft/xp/
--- 1.2988049984 seconds ---
很少提及
这个项目也是部署在aws上的。当我 运行 它在 aws 上时,它显示每个索引 0.4(我自己是 1.3)。
问题
我怎样才能提高所有这些的性能? 我应该让分类器的训练更轻松但更精确吗? 为什么 aws 和我的电脑差别这么大? 你需要代码来理解吗?如果需要我可以添加。
欢迎所有想法!
每个问题:
我怎样才能提高所有这些的性能?
这里有几种方法,根据您用于训练的模型(例如 Bag Of Words)尝试文本和 类 的特征选择,或者尝试 LSA 和 LSI,看看这个:
我应该让我的分类器训练更轻,但更精确吗? 根据你所说的精确,几乎是的,一些文本表示模型,是高维的并且Curse Of Dimensionality 可能会发生,你可以使用特征选择。您也可以使用一些采样方法来减少数据的训练元组,看看这个: http://searchbusinessanalytics.techtarget.com/definition/data-sampling
为什么aws和我的电脑差别这么大?很简单,AWS算法更先进,资源更强大