Hadoop is an echo system that is used to process big data that can not be handled using traditional processing tools. Hadoop store data in a distributed file system that is really different from the windows file system let us know more about the Hadoop Distributed File System.
A file system is a management of files and directories, the way any machine store and manage data. every file system keeps metadata about its files and directories like when we move any file into any file system then it will store in disk memory so our file system will create a meta log about the location of this file.
Now imagine if your file size is as much as bigger that you can not store it into the disk then you will arrange more then 2 or 3 machines and you will divide your data into 3 separate machines then you have to manage metadata of each divided partitions.
A distributed file system is all about big data partitioning over multiple machines and proper metadata management of all of these partitions that are stored at the Hadoop cluster has a couple of nodes.
Inside the Hadoop cluster, we keep all partitions metadata on a separate machine called namenode so whenever any file is moving inside the Hadoop cluster then Hadoop automatically divides your file into multiple partitions and loading your partitions at multiple machines and managing each partition metadata.
In Hadoop distributed file system one memory block size is 128 Mb so if you load 500 Mb data that it will automatically divide your data into 128 Mb partitions.
partition1 size 128 Mb
partition2 size 128 Mb
partition3 size 128 Mb
partition4 size 116 Mb
So finally we can say HDFS is the management of files and directories that are distributed at the Hadoop cluster.
- Name Node
Name Node is a master machine in HDFS that keep metadata of directories and files that are distributed at Hadoop Cluster, for example, say you load 500 MB data at HDFS then Hadoop divide your data into four parts of 128 MB and these partitions distribute over Hadoop cluster and take some memory blocks so name node keeps information of each memory block address.
- Data Node
Data Nodes keeps actual data and a couple of Hadoop services that are responsible for storage and data processing like task tacker is available at each data node at the Hadoop cluster.
- Job Tracker
Job Tracker is responsible for accepting data processing jobs that are written in a java programming language called Map-Reduce Job parallelly execute at various data nodes so job tracker assigns a job to different task trackers that are present at data nodes so that we can execute a program in various data nodes at the same time.
- Task Tracker
Task tracker is responsible for executing a job at the Hadoop cluster so task tracker use data node resources like RAM, CPU, JVM for job execution.
JVM-Java Virtual Machine is an interpreter for java programs and responsible for executing a map-reduce job at different data nodes at the same time.
HDFS USEFUL COMMANDS
- ls command to list directories and files
Hadoop fs -ls /user/cloudera/
- mkdir command to make a new directory at hdfs
Hadoop fs -mkdir /user/cloudera/NEWDIR
- rm command to remove directory and files
Hadoop fs -rm -r NEWDIR
- chmod command to change directory privileges.
Hadoop fs -chmod 777 NEWDIR
- put command to move data from local filesystem to hdfs.
Hadoop fs -put /home/cloudera/Desktop/myfile /user/cloudera/NEWDIR
- cat command to view the content of a file.
Hadoop fs -cat NEWDIR/myfile
- copyToLocal Command to move data from HDFS to the local file System
Hadoop fs -copyToLocal NEWDIR/myfile /home/cloudera/Desktop
- du command to check folder and file size.
Hadoop fs -du NEWDIR