草庐IT

regex - 具有多行记录的文本文件的 Hive 外部表定义

coder 2024-01-10 原文

我需要将此文件解析为一个配置单元表,该表是来自亚马逊的电影评论数据集。我在构建正则表达式以解析 .txt 文件并创建具有正确列类型的表时遇到问题。

.txt

product/productId: B0001G6PZC
review/userId: A3F3THLLZXURQN
review/profileName: A. Y
review/helpfulness: 3/3
review/score: 4.0
review/time: 1199664000
review/summary: Good story, Good action. Good Drama. Good Movie
review/text: When I first heard of this movie, I didn't think it would be that great, so I never bothered to go see it in theaters. Later on, I ended up downloading the movie, and didn't think much of it.<br /><br />But now after watching the movie on BD, I think that the movie is quite outstanding. Its got a good story behind it, with some level of historical basis behind it with Samurai becoming phased out into Japan's modernization.<br /><br />It does a good job in immersing you into the conflicts that warriors must endure... and yet, find peace with the way of the Samurai as they are a warrior race and not savages.<br /><br />4/5 stars.

product/productId: B0001G6PZC
review/userId: A3J78KAIPW6KAH
review/profileName: Joan Paolo De Bastos "conde_almasy"
review/helpfulness: 3/3
review/score: 4.0
review/time: 1198540800
review/summary: Good Movie. Wonderful Visuals. A Great Way to SHOW OFF you Hi-Def System
review/text: Last Samurai is no masterpiece<br /><br />but technically it is<br /><br />the visuals, the sound effects, the music.<br /><br />If you want to show off to your friends what a great hi-def system you got, purchase this movie.<br /><br />If you want a classic, but lord of the rings or gone with the wind instead.

product/productId: B0001G6PZC
review/userId: A3F3B6HY9RJI04
review/profileName: James Duckett
review/helpfulness: 3/3
review/score: 5.0
review/time: 1192060800
review/summary: Great Movie, Fantastic HD Quality
review/text: After picking up my HD DVD player I've had troubles watching regular DVD movies.  I had heard some good things about this movie but couldn't pass it up once it was in high definition.<br /><br />The story is pretty good.  This is the story of Captain Algren who has been sent to Japan in the late 1800's in order to help them modernize the Japanese army as they go from fighting with swords and arrows to machine guns and cannons.<br /><br />After the "modern" Japanese army prematurely attacks the Samurai and lose horribly, Captain Algren is taken captive by the Samurai and introduced to their way of life and refusal to lay down the sword in the name of compliance.  In time, Captain Algren finds himself wanting to become one of the Samurai and learning more of their way of life.<br /><br />The story is pretty good but what raises this up to the level of being outstanding is the high definition quality of the movie.  It was fantastic, especially seeing the colorful Japanese landscape in all of its magnificence.<br /><br />If you like Tom Cruise action movies, this is one to pick up especially in high definition (whether it be Blu-Ray or HD DVD).  The violence can be extremely graphic (hey, this is war) so if you are sensitive to that you may want to look for something else.  Otherwise, the pacing of the movie is pretty good.  It isn't an all out gore-fest... there is action and then it breaks and lets you relax and catch up a little bit and then goes back to action and so on and so forth.

这是我的 SQL:

CREATE EXTERNAL TABLE movies(id string, uId string, profileName string, helpfulness string, score float, time int, summary string, text string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' 
WITH serdeproperties( "input.regex" = "[ ].*", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s"")
location '/user/hduser/moviesTest';

但是配置单元没有正确解析它并且:SELECT * FROM movies 给我这个结果:

NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL

谁能告诉我我做错了什么?

最佳答案

这可以通过 Hive UDF 轻松完成;

你的数据在表 suppose temp 中,单列命名为 line;

create table temp(line String);
load data local inpath 'review.txt' into table temp;
select line from temp;

roduct/productId: B0001G6PZC
review/userId: A3F3THLLZXURQN
review/profileName: A. Y
review/helpfulness: 3/3

review/score: 4.0
review/time: 1199664000
review/summary: Good story, Good action. Good Drama. Good Movie
review/text: When I first heard of this movie, I didn't think it would be that great, so I never bothered to go see it in theaters. Later on, I ended up downloading the movie, and didn't think much of it.<br /><br />But now after watching the movie on BD, I think that the movie is quite outstanding. Its got a good story behind it, with some level of historical basis behind it with Samurai becoming phased out into Japan's modernization.<br /><br />It does a good job in immersing you into the conflicts that warriors must endure... and yet, find peace with the way of the Samurai as they are a warrior race and not savages.<br /><br />4/5 stars.

product/productId: B0001G6PZC
review/userId: A3J78KAIPW6KAH
review/profileName: Joan Paolo De Bastos "conde_almasy"
review/helpfulness: 3/3
review/score: 4.0
review/time: 1198540800
............

............

在java中创建一个Hive Udf。来源在这里

package HiveUDF;

import org.apache.hadoop.hive.ql.exec.UDF;

public class ReviewDataUdf extends UDF {
    String s = " ";
    String structuredFormat;
    int inds = 0;
    int inde = 0;

    public String evaluate(String t) {
        s = s + " " + t;
        if (t.contains("review/text:")) {
            String productId = "";
            try {
                if (s.contains("product/productId:")) {
                    inds = s.indexOf("product/productId:");
                    inde = s.indexOf("review/userId:", inds);
                    productId = s.substring(inds + 18, inde);
                } else {
                    productId = "N/A";
                }

            } catch (Exception e) {
                productId = "";
            }
            String userId = "";
            try {
                if (s.contains("review/userId:")) {

                    inds = s.indexOf("review/userId:");
                    inde = s.indexOf("review/profileName:", inds);
                    userId = s.substring(inds + 14, inde);
                } else {
                    userId = "N/A";
                }
            } catch (Exception e) {
                userId = "";
            }

            String profileName = "";
            try {
                if (s.contains("review/profileName:")) {
                    inds = s.indexOf("review/profileName:");
                    inde = s.indexOf("review/helpfulness:", inds);
                    profileName = s.substring(inds + 19, inde);
                } else {
                    profileName = "N/A";
                }
            } catch (Exception e) {
                profileName = "";
            }

            String helpfulness = "";
            try {
                if (s.contains("review/helpfulness:")) {
                    inds = s.indexOf("review/helpfulness:");
                    inde = s.indexOf("review/score:", inds);
                    helpfulness = s.substring(inds + 20, inde);
                } else {
                    helpfulness = "N/A";
                }
            } catch (Exception e) {
                helpfulness = "";
            }

            String score = "";
            try {
                if (s.contains("review/score:")) {
                    inds = s.indexOf("review/score:");
                    inde = s.indexOf("review/time:", inds);
                    score = s.substring(inds + 14, inde);
                } else {
                    score = "N/A";
                }
            } catch (Exception e) {
                score = "";
            }

            String time = "";
            try {
                if (s.contains("review/time:")) {
                    inds = s.indexOf("review/time:");
                    inde = s.indexOf("review/summary:", inds);
                    time = s.substring(inds + 14, inde);
                } else {
                    time = "N/A";
                }
            } catch (Exception e) {
                time = "";
            }

            String summary = "";
            try {
                if (s.contains("review/summary:")) {
                    inds = s.indexOf("review/summary:");
                    inde = s.indexOf("review/text:", inds);
                    summary = s.substring(inds + 16, inde);
                } else {
                    summary = "N/A";
                }
            } catch (Exception e) {
                summary = "";
            }

            String text = "";
            try {
                if (s.contains("review/text:")) {
                    inds = s.indexOf("review/text:");
                    inde = s.indexOf(s.length(), inds);
                    text = s.substring(inds + 14);
                } else {
                    text = "N/A";
                }
            } catch (Exception e) {
                text = "";
            }
            structuredFormat = productId + "\t" + userId + "\t" + profileName + "\t" + helpfulness + "\t" + score
                    + "\t" + time + "\t" + summary + "\t" + text;
            s = "";
            return structuredFormat.trim();
        } else {
            return null;
        }
    }
}

导出ReviewDataUdf.jar,在hive中注册并创建函数。

hive> ADD JAR /home/Kishore/ReviewDataUdf.jar;

hive> create temporary FUNCTION structReview as 'HiveUDF.ReviewDataUdf';

使用structReview函数获取结构化数据。

Create table AmazonReview as
select split(review,"\t")[0] as productId, split(review,"\t")[1] as userId, split(review,"\t")[2] as profileName,split(review,"\t")[3] as helpfulness, split(review,"\t")[4] as score,split(review,"\t")[5] as time,split(review,"\t")[6] as summary,split(review,"\t")[7] as text from(
select structReview(line) As review from temp ) b
where review != "NULL";

数据在 AmazonReview 表中采用结构化格式

select productId, userId, profileName from AmazonReview;
OK
B0001G6PZC   A3F3THLLZXURQN      A. Y 
B0001G6PZC   A3J78KAIPW6KAH      Joan Paolo De Bastos "conde_almasy" 
B0001G6PZC   A3F3B6HY9RJI04      James Duckett

关于regex - 具有多行记录的文本文件的 Hive 外部表定义,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30387682/

有关regex - 具有多行记录的文本文件的 Hive 外部表定义的更多相关文章

  1. ruby - Facter::Util::Uptime:Module 的未定义方法 get_uptime (NoMethodError) - 2

    我正在尝试设置一个puppet节点,但ruby​​gems似乎不正常。如果我通过它自己的二进制文件(/usr/lib/ruby/gems/1.8/gems/facter-1.5.8/bin/facter)在cli上运行facter,它工作正常,但如果我通过由ruby​​gems(/usr/bin/facter)安装的二进制文件,它抛出:/usr/lib/ruby/1.8/facter/uptime.rb:11:undefinedmethod`get_uptime'forFacter::Util::Uptime:Module(NoMethodError)from/usr/lib/ruby

  2. ruby - 具有身份验证的私有(private) Ruby Gem 服务器 - 2

    我想安装一个带有一些身份验证的私有(private)Rubygem服务器。我希望能够使用公共(public)Ubuntu服务器托管内部gem。我读到了http://docs.rubygems.org/read/chapter/18.但是那个没有身份验证-如我所见。然后我读到了https://github.com/cwninja/geminabox.但是当我使用基本身份验证(他们在他们的Wiki中有)时,它会提示从我的服务器获取源。所以。如何制作带有身份验证的私有(private)Rubygem服务器?这是不可能的吗?谢谢。编辑:Geminabox问题。我尝试“捆绑”以安装新的gem..

  3. ruby-on-rails - Rails 3.2.1 中 ActionMailer 中的未定义方法 'default_content_type=' - 2

    我在我的项目中添加了一个系统来重置用户密码并通过电子邮件将密码发送给他,以防他忘记密码。昨天它运行良好(当我实现它时)。当我今天尝试启动服务器时,出现以下错误。=>BootingWEBrick=>Rails3.2.1applicationstartingindevelopmentonhttp://0.0.0.0:3000=>Callwith-dtodetach=>Ctrl-CtoshutdownserverExiting/Users/vinayshenoy/.rvm/gems/ruby-1.9.3-p0/gems/actionmailer-3.2.1/lib/action_mailer

  4. ruby-on-rails - form_for 中不在模型中的自定义字段 - 2

    我想向我的Controller传递一个参数,它是一个简单的复选框,但我不知道如何在模型的form_for中引入它,这是我的观点:{:id=>'go_finance'}do|f|%>Transferirde:para:Entrada:"input",:placeholder=>"Quantofoiganho?"%>Saída:"output",:placeholder=>"Quantofoigasto?"%>Nota:我想做一个额外的复选框,但我该怎么做,模型中没有一个对象,而是一个要检查的对象,以便在Controller中创建一个ifelse,如果没有检查,请帮助我,非常感谢,谢谢

  5. ruby - 主要 :Object when running build from sublime 的未定义方法 `require_relative' - 2

    我已经从我的命令行中获得了一切,所以我可以运行rubymyfile并且它可以正常工作。但是当我尝试从sublime中运行它时,我得到了undefinedmethod`require_relative'formain:Object有人知道我的sublime设置中缺少什么吗?我正在使用OSX并安装了rvm。 最佳答案 或者,您可以只使用“require”,它应该可以正常工作。我认为“require_relative”仅适用于ruby​​1.9+ 关于ruby-主要:Objectwhenrun

  6. ruby - Sinatra:运行 rspec 测试时记录噪音 - 2

    Sinatra新手;我正在运行一些rspec测试,但在日志中收到了一堆不需要的噪音。如何消除日志中过多的噪音?我仔细检查了环境是否设置为:test,这意味着记录器级别应设置为WARN而不是DEBUG。spec_helper:require"./app"require"sinatra"require"rspec"require"rack/test"require"database_cleaner"require"factory_girl"set:environment,:testFactoryGirl.definition_file_paths=%w{./factories./test/

  7. ruby-on-rails - 如何在 ruby​​ 交互式 shell 中有多行? - 2

    这可能是个愚蠢的问题。但是,我是一个新手......你怎么能在交互式ruby​​shell中有多行代码?好像你只能有一条长线。按回车键运行代码。无论如何我可以在不运行代码的情况下跳到下一行吗?再次抱歉,如果这是一个愚蠢的问题。谢谢。 最佳答案 这是一个例子:2.1.2:053>a=1=>12.1.2:054>b=2=>22.1.2:055>a+b=>32.1.2:056>ifa>b#Thecode‘if..."startsthedefinitionoftheconditionalstatement.2.1.2:057?>puts"f

  8. ruby - 在 Ruby 中有条件地定义函数 - 2

    我有一些代码在几个不同的位置之一运行:作为具有调试输出的命令行工具,作为不接受任何输出的更大程序的一部分,以及在Rails环境中。有时我需要根据代码的位置对代码进行细微的更改,我意识到以下样式似乎可行:print"Testingnestedfunctionsdefined\n"CLI=trueifCLIdeftest_printprint"CommandLineVersion\n"endelsedeftest_printprint"ReleaseVersion\n"endendtest_print()这导致:TestingnestedfunctionsdefinedCommandLin

  9. ruby - 定义方法参数的条件 - 2

    我有一个只接受一个参数的方法:defmy_method(number)end如果使用number调用方法,我该如何引发错误??通常,我如何定义方法参数的条件?比如我想在调用的时候报错:my_method(1) 最佳答案 您可以添加guard在函数的开头,如果参数无效则引发异常。例如:defmy_method(number)failArgumentError,"Inputshouldbegreaterthanorequalto2"ifnumbereputse.messageend#=>Inputshouldbegreaterthano

  10. ruby - 如何在 Grape 中定义哈希数组? - 2

    我使用Ember作为我的前端和GrapeAPI来为我的API提供服务。前端发送类似:{"service"=>{"name"=>"Name","duration"=>"30","user"=>nil,"organization"=>"org","category"=>nil,"description"=>"description","disabled"=>true,"color"=>nil,"availabilities"=>[{"day"=>"Saturday","enabled"=>false,"timeSlots"=>[{"startAt"=>"09:00AM","endAt"=>

随机推荐