加入 Apache Pig

JOIN in Apache Pig

我有两个文件,由 json 个对象组成,位于我的 hdfs 的两个不同位置,我需要根据公共字段加入这两个文件。

第一个文件由推文数据组成,有 34 个字段(我数过)。看起来像:

{"contributors": null, "truncated": false, "text": "US Bank Loans And credit card capitol one business", "avl_brand_all": ["US Bank"], "is_quote_status": false    , "in_reply_to_status_id": null, "id": 770150015968825344, "favorite_count": 0, "avl_num_sentences": 1, "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</    a>", "retweeted": false, "coordinates": null, "entities": {"symbols": [], "user_mentions": [], "hashtags": [], "urls": [{"url": "<link>": [51, 74], "expand    ed_url": "http://usbanklogins.com/bank/", "display_url": "usbanklogins.com/bank/"}]}, "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "avl_word_tags": [{"distance": 1, "    word": "u", "pos": "OTHER"}, {"distance": 1, "word": "bank", "pos": "NOUN"}, {"distance": 1, "word": "loan", "pos": "NOUN"}, {"distance": 1, "word": "credit", "pos": "NOUN"}, {"distan    ce": 1, "word": "card", "pos": "NOUN"}, {"distance": 1, "word": "capitol", "pos": "VERB"}, {"distance": 1, "word": "one", "pos": "OTHER"}, {"distance": 1, "word": "business", "pos": "    NOUN"}], "avl_brand_1": "US Bank", "retweet_count": 0, "avl_lexicon_text": "us bank loans and credit card capitol one business", "id_str": "770150015968825344", "favorited": false, "a    vl_sentences": ["us bank loans and credit card capitol one business"], "user": {"follow_request_sent": false, "has_extended_profile": false, "profile_use_background_image": true, "id"    : 485610502, "verified": false, "profile_text_color": "0C3E53", "profile_image_url_https": "<link>", "profile    _sidebar_fill_color": "FFF7CC", "geo_enabled": false, "entities": {"url": {"urls": [{"url": "link", "indices": [0, 22], "expanded_url": "http://www.seowithme.com", "    display_url": "seowithme.com"}]}, "description": {"urls": []}}, "followers_count": 347, "profile_sidebar_border_color": "F2E195", "location": "", "default_profile_image": false, "id_s    tr": "485610502", "is_translation_enabled": false, "utc_offset": null, "statuses_count": 117, "description": "seowithme", "friends_count": 959, "profile_link_color": "FF0000", "profil    e_image_url": "http://pbs.twimg.com/profile_images/2334489262/qyznw08zjrgv3vlxtdvt_normal.jpeg", "notifications": false, "profile_background_image_url_https": "https://abs.twimg.com/i    mages/themes/theme12/bg.gif", "profile_background_color": "BADFCD", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme12/bg.gif", "screen_name": "sajanshrestha2    2", "lang": "en", "following": false, "profile_background_tile": false, "favourites_count": 2, "name": "sajan shrestha", "url": "<link>", "created_at": "Tue Feb 07 11:    40:39 +0000 2012", "contributors_enabled": false, "time_zone": null, "protected": false, "default_profile": false, "is_translator": false, "listed_count": 0}, "avl_num_paragraphs": 1,     "geo": null, "in_reply_to_user_id_str": null, "possibly_sensitive": false, "lang": "en", "created_at": "Mon Aug 29 06:44:07 +0000 2016", "avl_source": "individual", "in_reply_to_stat    us_id_str": null, "place": null, "metadata": {"iso_language_code": "en", "result_type": "recent"}, "avl_num_words": 8}

第二个文件有 json 个对象,每个对象只有两个字段。看起来像:

{"avl_syntaxnet_tags": [{"pos_tag": "PRP", "position": "1", "dep_rel": "dep", "parent": "3", "word": "us"}, {"pos_tag": "NN", "position": "2", "dep_rel": "nn", "parent": "3", "word":     "bank"}, {"pos_tag": "NNS", "position": "3", "dep_rel": "nsubj", "parent": "7", "word": "loans"}, {"pos_tag": "CC", "position": "4", "dep_rel": "cc", "parent": "3", "word": "and"}, {"    pos_tag": "NN", "position": "5", "dep_rel": "nn", "parent": "6", "word": "credit"}, {"pos_tag": "NN", "position": "6", "dep_rel": "conj", "parent": "3", "word": "card"}, {"pos_tag": "    VBP", "position": "7", "dep_rel": "ROOT", "parent": "0", "word": "capitol"}, {"pos_tag": "CD", "position": "8", "dep_rel": "num", "parent": "9", "word": "one"}, {"pos_tag": "NN", "pos    ition": "9", "dep_rel": "dobj", "parent": "7", "word": "business"}], "avl_lexicon_text": "us bank loans and credit card capitol one business"}

现在,json_objects 中都有一个名为 avl_lexicon_text 的公共字段,我想使用公共字段连接这两个对象。

我为加入编写了以下 Pig 脚本:

a = LOAD file1 as (a1, a2);
b = LOAD file2 as (b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15, b16, b17, b18, b19, b20, b21, b22, b23, b24, b25, b26, b27, b28, b29, b30, b31, b32, b33, b34);
x = JOIN b BY b19 FULL, a BY a2;
STORE x INTO '$SYNTAXNET_OUTPUT';

我检查了 b19b 中的 avl_lexicon_text 字段,a2a 中的相同。我得到的结果真的很奇怪。当我 dump x 时,我没有得到包含 ab 中所有字段的新 json_object。我得到 b 中的所有对象,然后是 a 中的所有对象。

有人可以建议我正确的方法吗?

编辑: 另外,有没有一种方法可以在不加载架构的情况下执行此操作?因为将来某个时候,如果任何文件的格式发生变化(添加新字段或删除现有字段),我不想更改 pig 脚本。有没有一种方法可以在不引用字段位置但通过访问字段名称的情况下执行 JOIN?谢谢! )

该行为是预期的,因为您指定了 FULL 外部联接。 删除 FULL 以仅匹配 records.See here 用于 FULL 外部连接。

x = JOIN b BY b19, a BY a2;