MapReduce之WordCount案例实操

小唐同学(๑>؂<๑） 2023-07-18 原文

前期准备：

因为MapReduce中案例比较多，所以需要单独创建一个工程

准备工作创建工程后先改maven仓库的地址（创建工程后默认为idea自带的仓库**提示在你打开别的项目后，在你重新打开本项目的时候，maven会改回idea的maven）

让后在项目的src/main/resources目录下，新建一个文件，命名为“log4j.properties”

（打印INFO级别的日志）

填入：

log4j.rootLogger=INFO, stdout  
log4j.appender.stdout=org.apache.log4j.ConsoleAppender  
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout  
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n  
log4j.appender.logfile=org.apache.log4j.FileAppender  
log4j.appender.logfile.File=target/spring.log  
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout  
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

让后在Java包下创建三级目录并且创建三个类（对应mapper，reduce,driver）

本机测试：

mapper阶段：

mapper阶段继承自Mapper

在类中重写map方法在map方法外对 Text 和 IntWritable 进行实例化

代码：

package com.tangxiaocong.mapreduce.wordcount2;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/*
VALUEIN,        map阶段的value输入类型   text类型  一行
<KEYIN,         map阶段的key输入类型  LongWritable (偏移量)
KEYOUT,                输出的key  类型为text
VALUEOUT                输出的value  类型为int
>*/
public class WordCountMapper extends Mapper <LongWritable, Text,Text, IntWritable>{

    //定义属性 实例化 减少内存的消耗  在下边循环中的话  会循环创建 在全局可以多次使用
    private Text text = new Text();
    private IntWritable intWritable = new IntWritable(1);//map阶段不需要计算 同1为1
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        //获取一行数据  转换成string
        String s = value.toString();
        //切割  切割后的单词存入数组
        String[] s1 = s.split(" ");
        //循环写出--输出  写出需要桥梁context这个抽象类
        for (String s2 : s1) {
            //数组中是String类型  需要转换成Text
            //封装text
            //此set是方法  不是Java中的接口  Java中有接口set 不可重复
            text.set(s2);
            //转换后通过context的write写出
            context.write(text,intWritable);


        }
    }
}

Reduce阶段：

package com.tangxiaocong.mapreduce.wordcount2;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WordCountReduce extends Reducer<Text, IntWritable,Text,IntWritable> {
       private IntWritable outv= new IntWritable();
    /*
    * Iterable<IntWritable> values   是一个集合的老祖宗   reduce阶段为会把同类集合化   两个<tangxiaoc,1>  reduce阶段会先合并成
    * tangxiaocong,(1,1)
    * */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
            //Iterable<IntWritable> values  里现在是    (1,1)  现在需要把他们求和
        int sum =0;
        for (IntWritable value : values) {
            sum+=value.get();   //get方法是获取他的值
        }
        outv.set(sum);
        context.write(key,outv);
    }
}

Driver类：

package com.tangxiaocong.mapreduce.wordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

//driver   其实是一个固定的套路
public class WordCountDriver {
    public static void main(String[] args) throws Exception {
        //1.获取job

        //Configuration  是job的配置信息类
        Configuration entries = new Configuration();
        Job job = Job.getInstance(entries);
        //2.设置jar包路径

        job.setJarByClass(WordCountDriver.class);  //一般通过全类名反射过去jar包的位置
        //3. 关联mapper和reducer

        job.setReducerClass(WordCountReduce.class);
        job.setMapperClass(WordCountMapper.class);
        //4.设置map输出的kv类型.

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        //5.设置最终的kv类型

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        //6.设置输入路径和输出路径

        FileInputFormat.setInputPaths(job, new Path("D:\\hello.txt"));
        FileOutputFormat.setOutputPath(job, new Path("D:\\hellocount"));
        //7.提交job

        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);  //退出  成功返回0 失败返回1

    }
}

本次测试为本机测试，通过maven导入Hadoop的包来进行的输入输出文件的路径为本地路径，而我们在企业开发中一般是在windows上编写，打包发送到Linux上，如果执行任务较多后期会编写脚本执行程序。

集群测试：

下边重写driver类：只需要把输入输出的路径改成手动输入输入输出路径

进行打包

将不带依赖的包进行复制到Linux系统（Hadoop目录下）

使用hadoop jar 命令执行jar包所在的本地系统中的项目

（输出目录不能存在）

执行计算的过程中可以通过Hadoop yarn看到计算的资源调度的web页面

有关MapReduce之WordCount案例实操的更多相关文章

「Python｜Selenium｜场景案例」如何定位iframe中的元素？ - 2
本文主要介绍在使用Selenium进行自动化测试或者任务时，对于使用了iframe的页面，如何定位iframe中的元素文章目录场景描述解决方案具体代码场景描述当我们在使用Selenium进行自动化测试的时候，可能会遇到一些界面或者窗体是使用HTML的iframe标签进行承载的。对于iframe中的标签，如果直接查找是无法找到的，会抛出没有找到元素的异常。比如近在咫尺的例子就是，CSDN的登录窗体就是使用的iframe，大家可以尝试通过F12开发者模式查看到的tag_name,class_name,id或者xpath来定位中的页面元素，会抛出NoSuchElementException异常。解决
ruby &&= 边缘案例 - 2
有点边缘情况，但知道为什么&&=会这样吗？我正在使用1.9.2。obj=Object.newobj.instance_eval{@bar&&=@bar}#=>nil,expectedobj.instance_variables#=>[],soobjhasno@barinstancevariableobj.instance_eval{@bar=@bar&&@bar}#ostensiblythesameas@bar&&=@barobj.instance_variables#=>[:@bar]#whywouldthisversioninitialize@bar?为了比较，||=将实例变量初始
ruby - 使用散列或案例陈述 [Ruby] - 2
一般来说哪个更好用？:casenwhen'foo'result='bar'when'peanutbutter'result='jelly'when'stack'result='overflow'returnresult或map={'foo'=>'bar','peanutbutter'=>'jelly','stack'=>'overflow'}returnmap[n]更具体地说，什么时候应该使用案例陈述，什么时候应该只使用散列？最佳答案散列是一种数据结构，而case语句是一种控制结构。当你只是检索一些数据时，你应该使用散列(就像你
Ruby:案例使用对象 - 2
有没有办法在case语句的对象上隐式调用方法？即:classFoodefbar1enddefbaz...endend我希望能够做的是这样的事情......foo=Foo.newcasefoowhen.bar==1then"something"when.bar==2then"somethingelse"when.baz==3then"anotherthing"end...其中“when”语句正在评估case对象上方法的返回。这样的结构可能吗？如果是的话，我还没有弄清楚语法...... 最佳答案 FWIW，您根本不需要将对象传递给1.8
BigData/Cloud Computing：基于阿里云技术产品的人工智能与大数据/云计算/分布式引擎的综合应用案例目录来理解技术交互流程 - 2
BigData/CloudComputing：基于阿里云技术产品的人工智能与大数据/云计算/分布式引擎的综合应用案例目录来理解技术交互流程目录一、云计算网站建设：部署与发布网站建设：简单动态网站搭建云服务器管理维护云数据库管理与数据迁移云存储：对象存储管理与安全超大流量网站的负载均衡二、大数据MOOC网站日志分析搭建企业级数据分析平台基于LBS的热点店铺搜索基于机器学习PAI实现精细化营销基于机器学习的客户流失预警分析使用DataV制作实时销售数据可视化大屏使用MaxCompute进行数据质量核查使用Quick BI制作图形化报表使用时间序列分解模型预测商品销量三、云安全云平台使用安全云上服务
ruby-on-rails - 关于这个 Rails 关联案例中的 "<<"运算符 - 2
我是RubyonRails的新手。在Rails应用程序中，我看到了如下代码:在模型中，有一个类Car:classCar在controller中，有一个方法“some_method”classCarsController我有三个问题要问:1.在Controller的代码中@my_car.components，它有什么作用？什么是什么意思？2.“3.是否Car类必须显式定义has_many关联Componentclassif""isused或者是""可用于向Car添加新关联，即使关联未在Car中定义显式类？最佳答案编辑后:第1点@m
Spring Security详细讲解(JWT+SpringSecurity登入案例) - 2
本篇博文目录:一.SpringSecurity简介1.SpringSecurity2.SpringSecurity相关概念二.认证和授权1.认证(1)使用SpringSecurity进行简单的认证(SpringBoot项目中)(2)SpringSecurity的原理(3)SpringSecurity核心类(4)认证登入案例(JWT+SpringSecurity实现登入案例)2.授权(1)加入权限到Authentication中(2)SecurityConfig配置文件中开启注解权限配置(3)给接口中的方法添加访问权限(4)用户权限表的建立3.自定义失败处理(1)创建异常处理类(2)配置移除处理
ruby-on-rails - Ruby on Rails 案例/开关。如何匹配对象？ - 2
我正在开发rubyonrails应用程序。对于sessionController，我想用一个案例来检查用户的帐户是否被锁定或禁止。我正在尝试使用类的对象作为案例，并使用when来检查属性。例如，user=Profile.find(1)caseuserwhenuser.banredirect_to()whenuser.lockredirect_to()elseredirect_to()end唯一的问题是它不起作用。这是什么工作:caseuser.banwhentrueredirect_to()elseredirect_to()end关于如何使用开关检查用户对象是否被禁止或锁定，有什么
HDFS+ MapReduce 数据处理与存储实验 - 2
文章目录实验二：HDFS+MapReduce数据处理与存储实验1.实验目的2.实验环境3.实验内容3.1HDFS部分3.1.1上传文件3.1.2下载文件3.1.3显示文件信息3.1.4显示目录信息3.1.5删除文件3.1.6移动文件3.2MapReduce部分3.2.0Mapreduce原理3.2.1合并和去重3.2.1.1编写Merge.java代码3.2.1.2编译执行3.2.2文件的排序3.2.2.1编写Sort.java代码3.2.2.2编译执行4.踩坑记录5.心得体会6.源码附录6.1Merge.java完整代码6.2Sort.java完整代码实验二：HDFS+MapReduce数据
ruby - Sinatra 成功案例 - 2
关闭。这个问题是off-topic.它目前不接受答案。想改进这个问题吗？Updatethequestion所以它是on-topic用于堆栈溢出。关闭11年前。Improvethisquestion您成功使用过Sinatra吗？这是一个什么样的项目？在什么情况下您会推荐使用Sinatra而不是Rails或Merb？