暂无说说

rhipe测试

R jiajun 6个月前 (11-28) 54次浏览 0个评论 扫描二维码

 昨晚做的测试,由于 win10 更新,没有保存。今天凭记忆写了一份,如果出现问题,肯定是在第二次整理的时候出错,欢迎拍砖,不要打脸。。。

1、系统环境:

centos7.0(user:group = hadoop:hadoop)

uname -a
Linux s200 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

2、安装R

安装 epel-release

sudo yum install -y epel-release

安装R

yum search R-core
R-core.x86_64 : The minimal R components necessary for a functional runtime
R-core-devel.x86_64 : Core files for development of R packages (no Java)
sudo yum install -y R-core.x86_64 R-core-devel.x86_64

3、创建文件夹/soft,并改变所属用户跟用户组

sudo mkdir /soft
sudo chown hadoop:hadoop /soft

4、安装 jdk

tar zxvf jdk-8u121-linux-x64.tar.gz -C /soft
cd /soft
ln -s jdk1.8.0_121/ jdk
sudo vim /etc/profile
export JAVA_HOME=/soft/jdk
export PATH=$PATH:$JAVA_HOME/bin
source /etc/profile

查看 java 版本

java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

5、配置 ssh 无密登陆

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 710 ~/.ssh/authorized_keys

6、安装 hadoop(伪分布)

cd ~
tar -zxvf hadoop-2.7.2.tar.gz  -C /soft
cd /soft
ln -s hadoop-2.7.2/ hadoop
sudo vim /etc/profile
export JAVA_HOME=/soft/jdk
export HADOOP_HOME=/soft/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
source /etc/profile

查看 hadoop 版本

hadoop version

Hadoop 2.7.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r b165c4fe8a74265c792ce23f546c64604acf0e41
Compiled by jenkins on 2016-01-26T00:08Z
Compiled with protoc 2.5.0
From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using /soft/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar

修改 hosts 文件

sudo vim /etc/hosts

增加

192.168.163.200 s200 //s200 为主机名,用 hostname 获得

修改配置文件

cd /soft/hadoop/etc/hadoop
vim hadoop-env.sh

修改

export JAVA_HOME=/soft/jdk

vim core-site.xml

<property>                        
        <name>fs.defaultFS</name>       
        <value>hdfs://s200/</value>
</property>
<property>
        <name>hadoop.tmp.dir</name>
       <value>/soft/data</value>
</property>

vim hdfs-site.xml

<property>                    
        <name>dfs.replication</name>
        <value>1</value>            
</property>

cp mapred-site.xml.template mapred-site.xml

vim  mapred-site.xml

<property>                               
        <name>mapreduce.framework.name</name>
        <value>yarn</value>                  
</property>   

vim yarn-site.xml 

<property>                                   
       <name>yarn.resourcemanager.hostname</name> 
       <value>s200</value>                   
</property>                                  
<property>                                   
       <name>yarn.nodemanager.aux-services</name> 
       <value>mapreduce_shuffle</value>           
</property>  

格式化 hadoop

hadoop namenode -format

启动 hadoop

start-all.sh

 

jps
----------------------------
4150 SecondaryNameNode
4300 ResourceManager
4685 Jps
3870 NameNode
3966 DataNode
4398 NodeManager

7、安装 rhipe

刷新 javareconf

su root
R CMD javareconf
exit

将 hadoop 依赖包路径加到环境变量中

sudo vim /etc/profile 

export HADOOP_LIBS=/soft/hadoop-2.7.2/etc/hadoop:/soft/hadoop-2.7.2/share/hadoop/common/lib/:/soft/hadoop-2.7.2/share/hadoop/common/:/soft/hadoop-2.7.2/share/hadoop/hdfs:/soft/hadoop-2.7.2/share/hadoop/hdfs/lib/:/soft/hadoop-2.7.2/share/hadoop/hdfs/:/soft/hadoop-2.7.2/share/hadoop/yarn/lib/:/soft/hadoop-2.7.2/share/hadoop/yarn/:/soft/hadoop-2.7.2/share/hadoop/mapreduce/lib/:/soft/hadoop-2.7.2/share/hadoop/mapreduce/:/soft/hadoop/contrib/capacity-scheduler
export HADOOP_BIN=/soft/hadoop/sbin
export HADOOP_CONF_DIR=/soft/hadoop/etc/hadoop

注:通过 export HADOOP_LIBS=`hadoop classpath | tr -d '*'`可以获取以上 HADOOP_LIBS 路径,HADOOP_LIBS 路径太长,可以把 hadoop 的 share 文件及子文件夹里的所有 jar 包放在同一目录下,如/soft/libs,然后直接 export HADOOP_LIBS=/soft/libs 就可以了。

mkdir /soft/libs
find /soft/hadoop/share/hadoop/ -name "*.jar" -type f -exec cp {} /soft/libs/ \;

刷新 profile

source /etc/profile

安装 protobuf-2.5.0

cd ~
tar -zxvf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0
./configure --prefix=/usr/local --libdir=/usr/lib64
make 
sudo make install
protoc --version
libprotoc 2.5.0

注:ubuntu 安装方法不太一样,附 ubuntu 安装步骤

./configure
make 
sudo make install
cd /usr/local/lib
pkg-config --modversion protobuf
2.5.0
pkg-config --libs protobuf
-L/usr/lib64 -lprotobuf -pthread -lpthread
sudo vim /etc/ld.so.conf.d/Protobuf-x86.conf
/usr/local/lib    //加入/usr/local/lib 后保存退出
sudo /sbin/ldconfig

添加环境变量

sudo vim /etc/profile
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig/:/usr/local/lib64/pkgconfig/
source /etc/profile

启动 R,安装依赖包

R
install.packages("digest")
install.packages("rJava")
q()

安装 rhipe

cd ~
R CMD INSTALL Rhipe_0.75.1.7_hadoop-2.tar.gz

8、测试

R
library(Rhipe)
rhinit()

hdfs 接口测试

rhls("/",recurse=T)

[1] permission owner      group      size       modtime    file      
<0 rows> (or 0-length row.names)
rhmkdir("/user/hadoop/housing")
[1] TRUE
rhls("/")

permission  owner      group size          modtime  file
1 drwxr-xr-x hadoop supergroup    0 2017-03-18 00:55 /user
rhput("~/housing.txt","/user/hadoop/housing")
rhls("/user/hadoop/housing")

permission  owner      group     size          modtime	file
1 -rw-r--r-- hadoop supergroup 7.683 mb 2017-03-18 01:04	/user/hadoop/housing/housing.txt

mapreduce 测试

rhmkdir("/tmp")
q()

vim mrtest.R

#!/usr/bin/Rscript
library(Rhipe)
rhinit()

map1 <- expression({
        lapply(seq_along(map.keys), function(r) {
        line = strsplit(map.values[[r]], ",")[[1]]
        outputkey <- line[1:3]

        outputvalue <- data.frame(
                  date = as.numeric(line[4]),
                  units =  as.numeric(line[5]),
                  listing = as.numeric(line[6]),
                  selling = as.numeric(line[7]),
                  stringsAsFactors = FALSE
        )
        
       rhcollect(outputkey, outputvalue)
  })
})

reduce1 <- expression(
        pre = {
                 reduceoutputvalue <- data.frame()
               },

       reduce = {
                 reduceoutputvalue <- rbind(reduceoutputvalue, do.call(rbind, reduce.values))
               },
       post = {
                 reduceoutputkey <- reduce.key[1]
                 attr(reduceoutputvalue, "location") <- reduce.key[1:3]
                 names(attr(reduceoutputvalue, "location")) <- c("FIPS","county","state")
                 rhcollect(reduceoutputkey, reduceoutputvalue)
               }
            )

mr1 <- rhwatch(
               map      = map1,
               reduce   = reduce1,
               input    = rhfmt("/user/hadoop/housing/housing.txt", type = "text"),
               output   = rhfmt("/user/hadoop/housing/byCounty", type = "sequence"),
               readback = FALSE
             )

执行脚本

chmod a+x mrtest.R
./mrtest.R

查看结果

R
library(Rhipe)
rhinit()
result <- rhread("/user/hadoop/housing/byCounty/part-r-00000")
keys <- unlist(lapply(result, "[[", 1))

keys
[1] "01001" "01003" "01005" "01007" "01009" "01011" "01013" "01015" "01017"
........

attributes(result[[1]][[2]])

$names
[1] "date"    "units"   "listing" "selling"
$row.names
[1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
$class
[1] "data.frame"
$location
FIPS    county     state 
"01001" "Autauga"      "AL" 

q()

9、须填的坑

1、因为使 rhipe 定义数据类型 type = "sequence",使用 hdfs dfs -text xxx 查看结果是会报如下错误:

java.lang.ClassNotFoundException: Class org.godhuli.rhipe.RHBytesWritable not found

使用 rhread("xxx")读取结果。

2、执行 mapreduce 作业时,container 被 kill 掉

Container [pid=61258,containerID=container_1489843711705_0001_01_000002] is running beyond virtual memory limits. 
Current usage: 180.5 MB of 1 GB physical memory used; 2.1 GB of 2.1 GB virtual memory used. Killing container.

分析:涉及到 physical memory 跟 virtual memory,在 yarn-default.xml 中搜索 virtual memory,出现下面结果,刚好有个 2.1,乐坏了  

<property>
        <description>Ratio between virtual memory to physical memory when setting memory limits for containers. Container 
                       allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this 
                      allocation by this ratio.
        </description>
        <name>yarn.nodemanager.vmem-pmem-ratio</name>
        <value>2.1</value>
</property>

2.1 太小不够用所以被干掉了,那就改成 3 吧(视实际情况而定),然后重启 hadoop,执行顺利执行完成。

注:

yarn.nodemanager.vmem-pmem-ratio 的比率,默认是 2.1,或者加大程序 reduce 的运行个数进行尝试,这个比率的控制影响着虚拟内存的使用,当 yarn 计算出来的虚拟内存,比在 mapred-site.xml 里的

mapreduce.map.memory.mb 或 mapreduce.reduce.memory.mb 的 2.1 倍还要多时,就会发生上面截图中的异常,而默认的 mapreduce.map.memory.mb 或 mapreduce.reduce.memory.mb 得初始大小为 1024M,然后根据异常中的 yarn 自身根据运行环境推算出来的虚拟内存来做比较,发现比 1024*2.1 还要大,所以就会由 NodeManage 守护进程 kill 掉 AM 容器,从而导致整个 MR 作业运行失败,现在我们只需要调大这个比率即可,避免发生这种异常。具体调大多小,可根据具体情况来设置。

3、在 rstudio 上调用,在配置完环境变量 source /etc/profile 后要重启,否则找不到 HADOOP_LIBS、HADOOP_BIN 跟 HADOOP_CONF_DIR

喜欢 (5)
发表我的评论
取消评论

表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址