Datakriya

Get More From Your Data

Importance of Data Data Management and Analytics

Hadoop and NoSQL

“In today’s modern computerized era we are adding in every 2 days the same amount of data that mankind has ever produced from the inception till early 21st century.“
 
This clearly explains how with such an accelerated rate we are generating data. Here is the most of these data comes from:

  • Thousands of servers generating multiple lines of log
  • Searches made over search engine
  • Posted data over social networking sites
  • User filling up inquiry forms
  • Increasing size of data warehousing
  • The grown up need of mining and analytics make sure increasing size of data

 
Precisely this data that is very huge in size is known as the Big Data. We need an effective mechanism to access, store and process this data. At the same time we should be able to infer best and most important information out of it. The last point “The grown up need of mining and analytics make sure increasing size of data” is very interesting as currently in today’s era the business competition is very high and it demands lot of data mining and data analytics. Data mining and data analytics demands companies to have as much as data they can afford to store, since it would yield better result when your sample size is of bigger and much bigger. This forced them to not delete or purge data but to store as much as they could/can.
 
Due to the increasing size of data we need good programs/algorithms to process it effectively and to make sure is to use it without losing a bit of data. If someone has to ask to propose a solution to this problem then they would suggest increasing the hardware resource/server capacity to store and process this growing large amount of data. Think for a while that is it really a practical solution? Think over this for how long you can afford to do this? Would that be cost effective? And the biggest question would it be a feasible solution on a particular day in future? Definitely this cannot be a practical approach.
 
This forced people to think differently and they came up with idea of Hadoop. If you notice the big picture considering hardware cost for storage against the data processing cost then you will realize storage is not an issue and the problem lies with data processing. We have a need of something that should be something like ‘Write Once & Read Many Times’. Hadoop works exactly on this model.
 
So, what is Hadoop?
Hadoop is a framework that is Scalable, Fault Tolerant, and High Available and offers Distributed computing over large data sets.
 
Two major components of Hadoop:

HDFS

There is an underline file system of Hadoop known as HDFS. HDFS stands for “Hadoop Distributed File System”. Hadoop has a distributed computing power which enables not to rely on single server for execution but to perform execution over a series of many small commodity computers in a cluster of environment.
 
HDFS is made up of master/slave relationship where it has one master known as ‘Name Node’ and several slaves known as ‘Data Node’. Master node keeps a track of files assigned to data node. Data Node takes part in processing the file, read and writes.
 

Map Reduce

One more component of Hadoop framework is ‘Map Reduce’ which is a programming model framework built up of master/slave relationship where it has one master known as ‘Job Tracker’ and several slaves known as ‘Task Tracker’ that runs per data node machine. Job Tracker is responsible for job scheduling and monitoring on each data node. The data node performs execution as instructed by the Job Tracker.
 
Four salient feature of Hadoop:

Scalable

Scalable means the limit of expansion, Hadoop is highly scalable in a way you can extend it from one server to many machines.
 

Fault Tolerant

Hadoop has good tolerance towards fault, we can find fault tolerance in both:

  • Master Node or Name Node: If master node goes down, we have backup master node running to support the operations
  • Slave Node or Data Node: Data is duplicated among various data nodes, so even if any of the data nodes fails our data would be preserved

 

High Available

Hadoop is high available in a way that data would always be available for processing; absolutely no loss of data. Same data is replicated in more than one data node among all the data nodes. The default limit for replication is 3 means same data is present in 3 different data node.
 

Distributed Computing

In Hadoop framework computation takes place in cluster of commodity computers using some programming logic. Hadoop divides a large dataset among various data node and execute small data set over each individual data node separately but parallel and in real time. This feature boost up the speed of operation.
 
NoSQL also referred to Not Only SQL is a database approach. With the birth of Big Data the traditional RDBMS was not able to fully store and process it. Hence an opportunity to something get evolved was there and NoSQL came. We store data in tabular relations in a traditional RDBMS database approach. However NoSQL has altogether a different mechanism for storage and retrieval of data. And hence some operations are faster in NoSQL and some in RDBMS. NoSQL does not mean that it not supports SQL like language, however it supports SQL like language. Note that NoSQL not fully supports all the ACID transactions. NoSQL is widely used in Big Data and Real Time WEB Applications.
 
Primary reasons for why this approach came to the picture are:

  • Simple Design
  • Non-Relational
  • Distributed
  • Open Source
  • Scaling Horizontal
  • Control over Availability

 
NoSQL databases are further divided into categories:

  • Key Value Store
  • Example – Memcached, Redis

  • Tabular
  • Example – BigTable, HBase

  • Document Oriented
  • Example – MongoDB, Cloudant

Ideally if you notice there is no point in comparing Hadoop to NoSQL. Since Hadoop is a framework that holds and supports many compatible tools, programming language, databases for storing and processing big data in distributed environment. On the other hand NoSQL is a database. Even Hadoop supports one NoSQL based database that is known as HBase. This NoSQL database HBase is a built up over Hadoop framework.


Our Expertise