加入 Apache Pig
JOIN in Apache Pig
我有两个文件,由 json 个对象组成,位于我的 hdfs 的两个不同位置,我需要根据公共字段加入这两个文件。
第一个文件由推文数据组成,有 34 个字段(我数过)。看起来像:
{"contributors": null, "truncated": false, "text": "US Bank Loans And credit card capitol one business", "avl_brand_all": ["US Bank"], "is_quote_status": false , "in_reply_to_status_id": null, "id": 770150015968825344, "favorite_count": 0, "avl_num_sentences": 1, "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</ a>", "retweeted": false, "coordinates": null, "entities": {"symbols": [], "user_mentions": [], "hashtags": [], "urls": [{"url": "<link>": [51, 74], "expand ed_url": "http://usbanklogins.com/bank/", "display_url": "usbanklogins.com/bank/"}]}, "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "avl_word_tags": [{"distance": 1, " word": "u", "pos": "OTHER"}, {"distance": 1, "word": "bank", "pos": "NOUN"}, {"distance": 1, "word": "loan", "pos": "NOUN"}, {"distance": 1, "word": "credit", "pos": "NOUN"}, {"distan ce": 1, "word": "card", "pos": "NOUN"}, {"distance": 1, "word": "capitol", "pos": "VERB"}, {"distance": 1, "word": "one", "pos": "OTHER"}, {"distance": 1, "word": "business", "pos": " NOUN"}], "avl_brand_1": "US Bank", "retweet_count": 0, "avl_lexicon_text": "us bank loans and credit card capitol one business", "id_str": "770150015968825344", "favorited": false, "a vl_sentences": ["us bank loans and credit card capitol one business"], "user": {"follow_request_sent": false, "has_extended_profile": false, "profile_use_background_image": true, "id" : 485610502, "verified": false, "profile_text_color": "0C3E53", "profile_image_url_https": "<link>", "profile _sidebar_fill_color": "FFF7CC", "geo_enabled": false, "entities": {"url": {"urls": [{"url": "link", "indices": [0, 22], "expanded_url": "http://www.seowithme.com", " display_url": "seowithme.com"}]}, "description": {"urls": []}}, "followers_count": 347, "profile_sidebar_border_color": "F2E195", "location": "", "default_profile_image": false, "id_s tr": "485610502", "is_translation_enabled": false, "utc_offset": null, "statuses_count": 117, "description": "seowithme", "friends_count": 959, "profile_link_color": "FF0000", "profil e_image_url": "http://pbs.twimg.com/profile_images/2334489262/qyznw08zjrgv3vlxtdvt_normal.jpeg", "notifications": false, "profile_background_image_url_https": "https://abs.twimg.com/i mages/themes/theme12/bg.gif", "profile_background_color": "BADFCD", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme12/bg.gif", "screen_name": "sajanshrestha2 2", "lang": "en", "following": false, "profile_background_tile": false, "favourites_count": 2, "name": "sajan shrestha", "url": "<link>", "created_at": "Tue Feb 07 11: 40:39 +0000 2012", "contributors_enabled": false, "time_zone": null, "protected": false, "default_profile": false, "is_translator": false, "listed_count": 0}, "avl_num_paragraphs": 1, "geo": null, "in_reply_to_user_id_str": null, "possibly_sensitive": false, "lang": "en", "created_at": "Mon Aug 29 06:44:07 +0000 2016", "avl_source": "individual", "in_reply_to_stat us_id_str": null, "place": null, "metadata": {"iso_language_code": "en", "result_type": "recent"}, "avl_num_words": 8}
第二个文件有 json 个对象,每个对象只有两个字段。看起来像:
{"avl_syntaxnet_tags": [{"pos_tag": "PRP", "position": "1", "dep_rel": "dep", "parent": "3", "word": "us"}, {"pos_tag": "NN", "position": "2", "dep_rel": "nn", "parent": "3", "word": "bank"}, {"pos_tag": "NNS", "position": "3", "dep_rel": "nsubj", "parent": "7", "word": "loans"}, {"pos_tag": "CC", "position": "4", "dep_rel": "cc", "parent": "3", "word": "and"}, {" pos_tag": "NN", "position": "5", "dep_rel": "nn", "parent": "6", "word": "credit"}, {"pos_tag": "NN", "position": "6", "dep_rel": "conj", "parent": "3", "word": "card"}, {"pos_tag": " VBP", "position": "7", "dep_rel": "ROOT", "parent": "0", "word": "capitol"}, {"pos_tag": "CD", "position": "8", "dep_rel": "num", "parent": "9", "word": "one"}, {"pos_tag": "NN", "pos ition": "9", "dep_rel": "dobj", "parent": "7", "word": "business"}], "avl_lexicon_text": "us bank loans and credit card capitol one business"}
现在,json_objects 中都有一个名为 avl_lexicon_text
的公共字段,我想使用公共字段连接这两个对象。
我为加入编写了以下 Pig 脚本:
a = LOAD file1 as (a1, a2);
b = LOAD file2 as (b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15, b16, b17, b18, b19, b20, b21, b22, b23, b24, b25, b26, b27, b28, b29, b30, b31, b32, b33, b34);
x = JOIN b BY b19 FULL, a BY a2;
STORE x INTO '$SYNTAXNET_OUTPUT';
我检查了 b19
是 b
中的 avl_lexicon_text
字段,a2
与 a
中的相同。我得到的结果真的很奇怪。当我 dump x
时,我没有得到包含 a
和 b
中所有字段的新 json_object。我得到 b
中的所有对象,然后是 a
中的所有对象。
有人可以建议我正确的方法吗?
编辑: 另外,有没有一种方法可以在不加载架构的情况下执行此操作?因为将来某个时候,如果任何文件的格式发生变化(添加新字段或删除现有字段),我不想更改 pig 脚本。有没有一种方法可以在不引用字段位置但通过访问字段名称的情况下执行 JOIN?谢谢! )
该行为是预期的,因为您指定了 FULL 外部联接。
删除 FULL 以仅匹配 records.See here 用于 FULL 外部连接。
x = JOIN b BY b19, a BY a2;
我有两个文件,由 json 个对象组成,位于我的 hdfs 的两个不同位置,我需要根据公共字段加入这两个文件。
第一个文件由推文数据组成,有 34 个字段(我数过)。看起来像:
{"contributors": null, "truncated": false, "text": "US Bank Loans And credit card capitol one business", "avl_brand_all": ["US Bank"], "is_quote_status": false , "in_reply_to_status_id": null, "id": 770150015968825344, "favorite_count": 0, "avl_num_sentences": 1, "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</ a>", "retweeted": false, "coordinates": null, "entities": {"symbols": [], "user_mentions": [], "hashtags": [], "urls": [{"url": "<link>": [51, 74], "expand ed_url": "http://usbanklogins.com/bank/", "display_url": "usbanklogins.com/bank/"}]}, "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "avl_word_tags": [{"distance": 1, " word": "u", "pos": "OTHER"}, {"distance": 1, "word": "bank", "pos": "NOUN"}, {"distance": 1, "word": "loan", "pos": "NOUN"}, {"distance": 1, "word": "credit", "pos": "NOUN"}, {"distan ce": 1, "word": "card", "pos": "NOUN"}, {"distance": 1, "word": "capitol", "pos": "VERB"}, {"distance": 1, "word": "one", "pos": "OTHER"}, {"distance": 1, "word": "business", "pos": " NOUN"}], "avl_brand_1": "US Bank", "retweet_count": 0, "avl_lexicon_text": "us bank loans and credit card capitol one business", "id_str": "770150015968825344", "favorited": false, "a vl_sentences": ["us bank loans and credit card capitol one business"], "user": {"follow_request_sent": false, "has_extended_profile": false, "profile_use_background_image": true, "id" : 485610502, "verified": false, "profile_text_color": "0C3E53", "profile_image_url_https": "<link>", "profile _sidebar_fill_color": "FFF7CC", "geo_enabled": false, "entities": {"url": {"urls": [{"url": "link", "indices": [0, 22], "expanded_url": "http://www.seowithme.com", " display_url": "seowithme.com"}]}, "description": {"urls": []}}, "followers_count": 347, "profile_sidebar_border_color": "F2E195", "location": "", "default_profile_image": false, "id_s tr": "485610502", "is_translation_enabled": false, "utc_offset": null, "statuses_count": 117, "description": "seowithme", "friends_count": 959, "profile_link_color": "FF0000", "profil e_image_url": "http://pbs.twimg.com/profile_images/2334489262/qyznw08zjrgv3vlxtdvt_normal.jpeg", "notifications": false, "profile_background_image_url_https": "https://abs.twimg.com/i mages/themes/theme12/bg.gif", "profile_background_color": "BADFCD", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme12/bg.gif", "screen_name": "sajanshrestha2 2", "lang": "en", "following": false, "profile_background_tile": false, "favourites_count": 2, "name": "sajan shrestha", "url": "<link>", "created_at": "Tue Feb 07 11: 40:39 +0000 2012", "contributors_enabled": false, "time_zone": null, "protected": false, "default_profile": false, "is_translator": false, "listed_count": 0}, "avl_num_paragraphs": 1, "geo": null, "in_reply_to_user_id_str": null, "possibly_sensitive": false, "lang": "en", "created_at": "Mon Aug 29 06:44:07 +0000 2016", "avl_source": "individual", "in_reply_to_stat us_id_str": null, "place": null, "metadata": {"iso_language_code": "en", "result_type": "recent"}, "avl_num_words": 8}
第二个文件有 json 个对象,每个对象只有两个字段。看起来像:
{"avl_syntaxnet_tags": [{"pos_tag": "PRP", "position": "1", "dep_rel": "dep", "parent": "3", "word": "us"}, {"pos_tag": "NN", "position": "2", "dep_rel": "nn", "parent": "3", "word": "bank"}, {"pos_tag": "NNS", "position": "3", "dep_rel": "nsubj", "parent": "7", "word": "loans"}, {"pos_tag": "CC", "position": "4", "dep_rel": "cc", "parent": "3", "word": "and"}, {" pos_tag": "NN", "position": "5", "dep_rel": "nn", "parent": "6", "word": "credit"}, {"pos_tag": "NN", "position": "6", "dep_rel": "conj", "parent": "3", "word": "card"}, {"pos_tag": " VBP", "position": "7", "dep_rel": "ROOT", "parent": "0", "word": "capitol"}, {"pos_tag": "CD", "position": "8", "dep_rel": "num", "parent": "9", "word": "one"}, {"pos_tag": "NN", "pos ition": "9", "dep_rel": "dobj", "parent": "7", "word": "business"}], "avl_lexicon_text": "us bank loans and credit card capitol one business"}
现在,json_objects 中都有一个名为 avl_lexicon_text
的公共字段,我想使用公共字段连接这两个对象。
我为加入编写了以下 Pig 脚本:
a = LOAD file1 as (a1, a2);
b = LOAD file2 as (b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15, b16, b17, b18, b19, b20, b21, b22, b23, b24, b25, b26, b27, b28, b29, b30, b31, b32, b33, b34);
x = JOIN b BY b19 FULL, a BY a2;
STORE x INTO '$SYNTAXNET_OUTPUT';
我检查了 b19
是 b
中的 avl_lexicon_text
字段,a2
与 a
中的相同。我得到的结果真的很奇怪。当我 dump x
时,我没有得到包含 a
和 b
中所有字段的新 json_object。我得到 b
中的所有对象,然后是 a
中的所有对象。
有人可以建议我正确的方法吗?
编辑: 另外,有没有一种方法可以在不加载架构的情况下执行此操作?因为将来某个时候,如果任何文件的格式发生变化(添加新字段或删除现有字段),我不想更改 pig 脚本。有没有一种方法可以在不引用字段位置但通过访问字段名称的情况下执行 JOIN?谢谢! )
该行为是预期的,因为您指定了 FULL 外部联接。 删除 FULL 以仅匹配 records.See here 用于 FULL 外部连接。
x = JOIN b BY b19, a BY a2;