HADOOP DISTRIBUTED FILE SYSTEM
Abstract - Hadoop Distributed File System, a Java based file system provides reliable and scalable storage for data. It is the key component to understand how a Hadoop cluster can be scaled over hundreds or thousands of nodes. The large amounts of data in Hadoop cluster is broken down to smaller blocks and distributed across small inexpensive servers using HDFS. Now, MapReduce functions are executed on these smaller blocks of data thus providing the scalability needed for big data processing. In this paper I will discuss in detail on Hadoop, the architecture of HDFS, how it functions and the advantages.
I. INTRODUCTION
Over the years it has become very essential to process large amounts of data with high precision and speed. This large amounts of data that can no more be processed using the Traditional Systems is called Big Data. Hadoop, a Linux based tools framework addresses three main problems faced when processing Big Data which the Traditional Systems cannot. The first problem is the speed of the data flow, the second is the size of the data and the last one is the format of data. Hadoop divides the data and computation into smaller pieces, sends it to different computers, then gathers the results to combine them and sends it to the application. This is done using Map Reduce and HDFS i.e., Hadoop Distributed File System. The data node and the name node part of the architecture fall under HDFS.
II. ARCHITECTURE
Hadoop works on
Hadoop \cite{white2012hadoop} is an open-source framework for distributed storage and data-intensive processing, first developed by Yahoo!. It has two core projects: Hadoop Distributed File System (HDFS) and MapReduce programming model \cite{dean2008mapreduce}. HDFS is a distributed file system that splits and stores data on nodes throughout a cluster, with a number of replicas. It provides an extremely reliable, fault-tolerant, consistent, efficient and cost-effective way to store a large amount of data. The MapReduce model consists of two key functions: Mapper and Reducer. The Mapper processes input data splits in parallel through different map tasks and sends sorted, shuffled outputs to the Reducers that in turn groups and processes them using a reduce task for each group.
Client requests file from the cold cache of the server and stores file on its disk.
This paper proposes backup task mechanism to improve the straggler tasks which are the final set of MapReduce tasks that take unusually longer time to complete. The simplified programming model proposed in this paper opened up the parallel computation field to general purpose programmers. This paper served as the foundation for the open source distributing computing software – Hadoop as well as tackles various common error scenarios that are encountered in a compute cluster and provides fault tolerance solution on a framework
HDFS is Hadoop’s distributed file system that provides high throughput access to data, high-availability and fault tolerance. Data are saved as large blocks making it suitable for applications
The Hadoop employs MapReduce paradigm of computing which targets batch-job processing. It does not directly support the real time query execution i.e OLTP. Hadoop can be integrated with Apache Hive that supports HiveQL query language which supports query firing, but still not provide OLTP tasks (such as updates and deletion at row level) and has late response time (in minutes) due to absence of pipeline
Hadoop1 provides a distributed filesystem and a framework for the analysis and transformation of very large data sets using the MapReduce [DG04] paradigm. While the interface to HDFS is patterned after the Unix filesystem, faithfulness to standards was sacrificed in favor of improved performance for the applications at hand.
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a Parallel and distributed computing environment. It makes Use of the commodity hardware Hadoop is Highly Scalable and Fault Tolerant. Hadoop runs in cluster and eliminates the use of a Super computer. Hadoop is the widely used big data processing engine with a simple master slave setup. Big Data in most companies are processed by Hadoop by submitting the jobs to Master. The Master distributes the job to its cluster and process map and reduce tasks sequencially.But nowdays the growing data need and the and competition between Service Providers leads to the increased submission of jobs to the Master. This Concurrent job submission on Hadoop forces us to do Scheduling on Hadoop Cluster so that the response time will be acceptable for each job.
During the period of my curriculum practical training I have learned the HADOOP technology, initial three weeks I was taught the concepts that I should be aware of in order to understand the whole concepts of HADOOP technology. In this regard I was taught collection framework in java at first.
Hadoop is one of the open source frameworks, is used as extension to big data analytics framework which are used by a large group of vendors. This type of framework makes work easy for the companies how they?re going to store and can use the data within the digital products as well as physical products (James, M. et al. 2011). We can analyze data using Hadoop, which is emerging as solution to
The main purpose of this report is to provide a critical review of the processes and own experiences of Hadoop within the context of the assignment which was given to us. The review concentrates on the discussion and evaluation of the overall steps followed during the progress of the project and the reasons for which we have chosen these particular steps. It also draws attention at the main points that were accomplished, both with respect to individual, and with respect to the group 's perspectives. Finally, it concentrates on the project 's progress in terms of changes for a future implementation.
Inspired in part by MapReduce, Hadoop provides a Java based software framework for distributed processing of data intensive transformation and analytics. The top three commercial database suppliers Oracle, IBM, and Microsoft have all adopted Hadoop, some within a cloud infrastructure.
The Hadoop disseminated record framework (HDFS) is a dispersed, adaptable, and compact document framework written in Java for the Hadoop system. Some consider HDFS to rather be an information store because of its absence of POSIX consistence and powerlessness to be mounted, however it provides shell orders and Java API strategies that are like other record frameworks. A Hadoop bunch has ostensibly a solitary namenode in addition to a group of datanodes, in spite of the fact that excess alternatives are accessible for the namenode because of its criticality. Each datanode serves up pieces of information over the system utilizing a square convention particular to HDFS. The record framework utilizes TCP/IP attachments for correspondence. Customers utilize remote methodology call (RPC) to convey between each other.
Big Data is creating great opportunities for businesses, companies and many large scale and small scale industries. Hadoop, as an open-source cloud computing and big data framework, is increasingly used in the IT world. The rapid growth of Hadoop and Cloud Computing clearly indicates its importance as a Big Data enabling technology. Due to the loopholes of security mechanism, the security issues introduced through adaptation of this technology are also increasing. Hadoop services do not authenticate users or other services. As a result, Hadoop is subject to security risks. Big Data is already a prime target for vulnerable attacks due to the valuable information it holds. In this paper,
Modern Data Centers always interested in the new technology for various web search analysis, web log, bigdata analysis, social networking, so in this tasks new technology implemented using parallel processing for large-scale database analysis, so the MapReduce is one of new technology to get amounts of data, perform massive computation, and extract critical knowledge out of big data for business intelligence, proper analysis of large scale of datasets, it requires accurate input output capacity from the large server systems to perform and analyze weblog data which is derived from two steps called mapping and reducing. Between these two steps, MapReduce requires a on important phase called shuffling phase which exchange the intermediate data. So at the point of data shuffling, by physically changing the location(moving) segments of intermediate data across disks, causes major I/O contention and generate the Input/Output problem such as large power consumption, high heat generation which accounts for a large portion of the operating cost of data centers in analyzing such big data. So in this synopsis we introduce the new virtual shuffling approach to enable well-organized data movement and reduce I/O problem for MapReduce shuffling, thereby reducing power consumption and conserving energy. Virtual shuffling is achieved through a combination of three techniques including a three-level segment table, near-demand merging, and dynamic and balanced merging subtrees. Our
Spark is an in memory cluster computing framework which falls under the open source Hadoop project but does not follow the two stage map-reduce paradigm which is done in Hadoop distributed file system (HDFS) and which is meant and designed to be faster. Spark, instead support a wide range of computations and applications which includes interactive queries, stream processing, batch processing, iterative algorithms and streaming by extending the idea of MapReduce model. The execution time is the most important factor for every process which processes large amount of data. While considering large amount of data, the time it usually takes for the exploration of data and execution of queries can be thought of in terms of