Table of Contents
Introduction
Apache Hadoop 2.8.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.7.3.
The following are the features and improvements that are said to be available in Apache Hadoop 2.8.0
- Common
- Support async call retry and failover which can be used in async DFS implementation with retry effort.
- Cross Frame Scripting (XFS) prevention for UIs can be provided through a common servlet filter.
- S3A improvements: add ability to plug in any AWSCredentialsProvider, support read s3a credentials from Hadoop credential provider API in addition to XML configuration files, support Amazon STS temporary credentials
- WASB improvements: adding append API support
- Build enhancements: replace dev-support with wrappers to Yetus, provide a docker based solution to setup a build environment, remove CHANGES.txt and rework the change log and release notes.
- Add posixGroups support for LDAP groups mapping service.
- Support integration with Azure Data Lake (ADL) as an alternative Hadoop-compatible file system.
- HDFS
- WebHDFS enhancements: integrate CSRF prevention filter in WebHDFS, support OAuth2 in WebHDFS, disallow/allow snapshots via WebHDFS
- Allow long-running Balancer to log in with keytab
- Add ReverseXML processor which reconstructs an fsimage from an XML file. This will make it easy to create fsimages for testing, and manually edit fsimages when there is corruption
- Support nested encryption zones
- DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness. This can prevent the NameNode from incorrectly marking DataNodes as stale or dead in highly overloaded clusters where heartbeat processing is suffering delays.
- Logging HDFS operation’s caller context into audit logs
- A new datanode command for evicting writers which is useful when data node decommissioning is blocked by slow writers.
- YARN
- NodeManager CPU resource monitoring in Windows.
- NM shut down more graceful: NM will unregister to RM immediately rather than waiting for the timeout to be LOST (if NM work preserving is not enabled).
- Add ability to fail a specific AM attempt in the scenario of AM attempt gets stuck.
- CallerContext support in YARN audit log.
- ATS versioning support: a new configuration to indicate timeline service version.
- MAPREDUCE
- Allow node labels get specified in submitting MR jobs
- Add a new tool to combine aggregated logs into HAR file
Reference: hadoop.apache.org
This blog will help you to install Hadoop 2.8.0 on CentOS operating system and this includes basic configuration required to start working with Hadoop. I have explained the entire process in simple and easy steps.
Step 1 – Installing Java
Java is required for running Hadoop on any system, So before installing hadoop make sure java is installed on your system
$ java -version java version "1.8.0_121" Java(TM) SE Runtime Environment (build 1.8.0_121-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.121-b04, mixed mode)
If Java is not installed in the system then install it by using the following commands. To Install Java OpenJDK 8
$ sudo yum install java-1.8.0-openjdk
After installing Java configure Java Environment Variables /etc/profile.d/java.sh
export JAVA_HOME=/usr/lib/jvm/java-openjdk
export JAVA_PATH=$JAVA_HOME
export PATH=$PATH:$JAVA_HOME/bin
Step 2 – Setup Hadoop user account
It is recommended to create non-root user account for hadoop environment
$ adduser hadoop $ passwd hadoop
Setup key based ssh to its own account
$ su - hadoop $ ssh-keygen -t rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys
Let’s check key based login and exit from Hadoop
$ ssh localhost
Step 3 – Download Hadoop source file
Download Hadoop 2.8.0 source file, For different version, refer http://hadoop.apache.org
$ cd /usr/local $ wget http://apache.claz.org/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz $ tar xzf hadoop-2.8.0.tar.gz $ mv hadoop-2.8.0 hadoop
Step 4 – Configure Hadoop Pseudo-Distributed Mode
- Setup Environment Variables
Edit ~/.bashrc file and append following values at end of file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Now apply the changes in current running environment
$ source ~/.bashrc
Edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh and set JAVA_HOME
# Change Java home path as per java installed on your system
export JAVA_HOME=/usr/lib/jvm/java-openjdk
- Edit Configuration Files
Hadoop contains many configuration files, which need to be configured as per requirements of your hadoop environment.
$ cd $HADOOP_HOME/etc/hadoop
- i) Edit core-site.xml
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
- ii) Edit hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name> <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value> </property> </configuration>
iii) Edit mapred-site.xml
$ cp mapred-site.xml.template mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
- iv) Edit yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
- Format Hadoop Namenode
Once hadoop single node cluster setup has done, it’s time to initialize HDFS file system by formatting
$ hdfs namenode -format
Sample output:
17/02/14 08:13:20 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ip-172-31-10-127.us-west-2.compute.internal/172.31.10.127 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.8.0 ... ... 17/02/14 08:13:30 INFO namenode.FSImage: Allocated new BlockPoolId: BP-415680745-172.31.10.127-1487060010110 17/02/14 08:13:30 INFO common.Storage: Storage directory /home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted. 17/02/14 08:13:30 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 17/02/14 08:13:30 INFO util.ExitUtil: Exiting with status 0 17/02/14 08:13:30 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ip-172-31-10-127.us-west-2.compute.internal/172.31.10.127 ************************************************************/
Step 5 – Start Hadoop Cluster
Let’s start your Hadoop cluster using the scripts provides by hadoop. Just navigate to your Hadoop sbin directory and execute scripts one by one.
$ cd $HADOOP_HOME/sbin/
Run start-dfs.sh to start namenode, datanode and secondary namenodes
$ start-dfs.sh
Sample output:
17/02/14 08:16:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Starting namenodes on [localhost] localhost: starting namenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-namenode-ip-172-31-10-127.out localhost: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-ip-172-31-10-127.out Starting secondary namenodes [0.0.0.0] The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established. RSA key fingerprint is a2:9b:7c:8f:21:43:6e:ce:18:5e:85:5b:a1:57:d2:99. Are you sure you want to continue connecting (yes/no)? yes 0.0.0.0: Warning: Permanently added '0.0.0.0' (RSA) to the list of known hosts. 0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-secondarynamenode-ip-172-31-10-127.out 17/02/14 08:16:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Run start-yarn.sh to start daemons, resourcemanager and nodemanager
$ start-yarn.sh
Sample output:
starting yarn daemons starting resourcemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-resourcemanager-ip-172-31-10-127.out localhost: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-ip-172-31-10-127.out To check services status run jps command.
$ jps
Sample output:
12544 NameNode 13001 ResourceManager 13104 NodeManager 12672 DataNode 13993 Jps 12843 SecondaryNameNode
Step 6 – Check Hadoop Services
Access 50070 for getting information about NameNode
http://HOST_NAME:50070/
Access 8088 for getting information about cluster
http://HOST_NAME:8088/
Access 50090 for getting information about secondary namenode.
http://HOST_NAME:50090/
Access 50075 for getting information about DataNode
http://HOST_NAME:50075/
Step 7 – Test Hadoop Setup
- i) Make the HDFS directories
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/hadoop
Manage Hadoop Services
To start all hadoop instances run the below commands
$ start-dfs.sh $ start-yarn.sh
To stop all hadoop instances run the below commands
$ stop-yarn.sh $ stop-dfs.sh
Hope this article helped you to easily setup Hadoop 2.8.0 (Single Node Cluster) on CentOS. If you have any doubts or queries please comment below. For updates follow agiratechnologies.