difficulties that i faced during the project

In this blog i am going to write about all the difficulties that i faced during the execution of my project .

To be honest i had no knowledge regarding big data and hadoop before this project but i had some interest on this topic because in last semester i did DAT601 in which our teacher Todd cochrane talk about Big data and how it is becoming a good topic in database.

First problem that i faced during this project was to decide the exact topic for my project which relates to big data .During this whole time even my objective of project also  changeed to some sort .  First confuson was to whether choose a research based topic or choose a practical implementation topic . I decided to do a research based topic in which i compare different data analysing tools to analyse big data but when i told it to my respected supervisor  lars dam about it then he told me that i should also try to focus on practical implementation of these data analysis tools . So under his guidance i decided my topic which was both a research based and practical implementation of that by using “hadoop ” which is a data analysis tool to analyse big data by building a one node cluster on my laptop .

Next problem that i faced was that when i tried to research about hadoop practical implementation , i watched many youtube videos , which discuss in different ways about hadoop cluster implementation which made me confused for a couple of weeks . I was taking my time because if i started my project in a wrong way then it would be difficult to finish it off properly,

In one video i watched hadoop one node  implementation but the virtual machine he used was for vmware playstation but he did not not specifically told that in the video . I did not had Vmware and i had oracle  virtual player in my desktop . Due to not being compatible with oracle virtual player that virtual machine did not work .

when i watched some other youtube videos i realized that the virtual machine that we are downloading from cloudera( which is an open source hadoop distribution) needs to be compatible with the virtualization  software that is installed in my desktop.

This time i wanted to install the virtual machine of cloudera(CDH) which is compatible with the oracle virtual player.But this time while i was watching the video i notice that the RAM requirement for this virtual machine is 8 GB which was not sufficient according to  my laptop specification .

I told this thing to my supervisor Mr Lars Dam and he advised me to use help from Mr . Mark caukill (He is networking specialist ) so that i can use the Talos room as a host for my the CDH  virtual machine.

I even got the permission from him to use Talos server room as a host .

So now i started to work over this . But one more hurdle came in my way again .

I thought that i can directly export my virtual machine from my desktop  into the virtual environment provided by  Mr.Mark Caukill but  was wrong .

For the virtual machine to be exported into the virtual environment , it needed to be into the library of the environment which can be done only by the administrator means Mr Mark Caukill . But at that time he was on vacation so i could not ask help from him.

So therefore i decided to borrow my flatmates laptop for hadoop cluster demonstration because his laptop has 16 Gb of RAM and finally i was able to run the CDH virtual machine

 

cloudera

Apache Hadoop Ecosystem

Hadoop is an ecosystem of open source components that fundamentally changes the way enterprises store, process, and analyze data. Unlike traditional systems, Hadoop enables multiple types of analytic workloads to run on the same data, at the same time, at massive scale on industry-standard hardware. CDH, Cloudera’s open source platform, is the most popular distribution of Hadoop and related projects in the world (with support available via a Cloudera Enterprise subscription).

I have downloaded the CDH .

CDH-Cloudera provides a scalable, flexible, integrated platform that makes it easy to manage rapidly increasing volumes and varieties of data in your enterprise. Cloudera products and solutions enable you to deploy and manage Apache Hadoop and related projects, manipulate and analyze your data, and keep that data secure and protected.

Apache Hadoop’s core components, which are integrated parts of CDH and supported via a Cloudera Enterprise subscription, allow me to store and process unlimited amounts of data of any type all within a single platform.

QuickStarts for CDH 5.13

Virtualized clusters for easy installation on your desktop.

Cloudera QuickStart VMs (single-node cluster) make it easy to quickly get hands-on with CDH for testing, demo, and self-learning purposes, and include Cloudera Manager for managing your cluster. Cloudera QuickStart VM also includes a tutorial, sample data, and scripts for getting started.

Prerequisites

  • These 64-bit VMs require a 64-bit host OS and a virtualization product that can support a 64-bit guest OS.
  • To use a VMware VM, you must use a player compatible with WorkStation 8.x or higher:
    • Player 4.x or higher
    • Fusion 4.x or higher

    You can use older versions of WorkStation to create a new VM using the same virtual disk (VMDK file), but some features in VMware Tools are not available.

  • The amount of RAM required varies by the runtime option you choose:

Advantages and use cases of Apache Hadoop

The Apache Hadoop is an open source framework that allows distributed processing of large data sets across clusters of computers.

By The Term Cluster we mean to say set of computers where one computer is the master and the others are slave.

The advantages of Apache hadoop are

  • No single point failure
  • Faster processing due to data divided into blocks.
  • Fault tolerance due to replication of data. By default replication factor is 3.

Use cases for Apache Hadoop are

Use of Big Data in Retail Industry-With the growth in retail industry with millions of transactions spread across multiple disconnected  systems, it is impossible to see the full picture of the data that is getting generated as Retail stores typically do not communicate with each other. Daily updates are provided in the system and in most of the cases systems do not interact with each other.On the other hand market size is increasing day by day which makes it an impossible task for a marketing analyst to understand the strength and health of their product or campaign . Transaction data in its raw form helps a company understand its sales pattern.  retailers can use BIG DATA – combining data from web browsing patterns, social media, industry forecasts, existing customer records and many other data to predict trends, prepare for demand, pinpoint customers, optimize pricing and promotions, and monitor real-time analytics and results.

2-Big Data HADOOP helping in wildlife Conservation-There are a lot of wildlife projects in progress nowadays in order to prevent our ecosystem and endangered species . A large amount of BIG data is getting generated . APACHE hadoop can  help in analysing this data

3-Credit Card Fraud Detection

As millions of people are using credit card now-a-days, so it has become very necessary to protect people from frauds. It has become a challenge for Credit card companies to identify whether the requested transaction is fraudulent or not.

 

4-Sentiment Analysis

Sentiment analysis provides substance behind social data. A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence — whether the expressed opinion in a document,or  a sentence   is positive, negative, or neutral.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Apache pig vs Apache hive

Difference Between Apache Pig and Apache Hive

The Apache Pig came into the exitsence in the year 2006 when the researcher as Yahoo was struggling with mapreduce java codes. It was difficult to reuse and maintain code for compilation.  Along with that they observed that MapReduce users were not comfortable with declarative languages such as SQL. They started to work on new language that was supposed to fit in a sweet spot between the declarative style of SQL, low-level and procedural style of MapReduce. This resulted in the birth of Pig and the first release of Pig came in September 2008 and by end of 2009 about half of the jobs at Yahoo were Pig jobs.

The Apache hive story begins in the year 2007 when non-Java Programmer have to struggle while using hadoop mapreduce. IT professional from database background were facing challenges to work on Hadoop Cluster. Initially, researchers, working at Facebook came up with Hive language. This language was very similar to SQL language. So language was called Hive Query Language (HQL) and later it becomes project of open source Apache Community. After becoming project of Apache Community there was a major development in Apache Hive. Facebook was the first company to come up with Apache Hive.

Let me explain about Apache Pig vs Apache Hive in more details.

Introducing Apache Pig vs Apache Hive

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Apache is open source project of Apache Community. Apache Pig provides a simple language called Pig Latin, for queries and data manipulation.

Pig is being utilized by companies like Yahoo, google and microsoft  for collecting huge amounts of data sets in the form of click streams, search logs and web crawls.

  • Apache Pig provides nested data types like maps, Tuples, and Bags
  • Apache Pig Follows multi-query approach to avoid multiple scans of the datasets.
  • Programmers familiar with scripting language prefer Apache Pig
  • pig is easy if you are well aware of SQL
  • No need to create schema to work on Apache Pig
  • Pig also provides support to major data operations like Ordering, Filters, and Joins
  • Apache Pig framework translates Pig Latin into sequences of MapReduce programs

 

Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Apache Hive is an Apache open-source project built on top of Hadoop for querying, summarizing and analyzing large data sets using a SQL-like interface. Apache hive provides the SQL-like language called HiveQL, which transparently convert queries to MapReduce for execution on large datasets stored in Hadoop Distributed File System (HDFS).

  • Apache Hive is a Data warehouse Infrastructure.
  • Apache Hive is an ETL tool (Extraction-Transformation-Loading)
  • Apache hive is similar to SQL
  • Apache Hive enables customized mappers and reducers
  • Apache Hive increases the schema design flexibility using data serialization and deserialization
  • Apache hive is an analytical tool

Head to Head Comparison Between Apache Pig vs Apache Hive (Infographics):

 

Key differences between Apache Pig vs Apache Hive:

  • Apache Pig is more faster comparing Apache Hive
  • Apache Pig and Apache Hive both runs on top of Hadoop MapReduce
  • Apache Pig is best for Structured and Semi-structured while Apache Hive is best for structured data
  • Apache Pig is a procedural language while Apache Hive is a declarative language
  • Apache Pig supports cogroup feature for outer joins while Apache Hive does not support
  • Apache Pig does not have a pre-defined database to store table/ schema while Apache Hive has pre-defined tables/schema and stores its information in a database.
  • Apache Pig is also suited for complex and nested data structure while Apache Hive is less suited for complex data
  • Researchers and programmers use Apache pig while Data Analysts use Apache Hive
When to use Apache Pig:
  • When you are a programmer and know scripting language
  • When you don’t want to create schema while loading
  • ETL requirements
  • When you are working on client side of the Hadoop cluster
  • When you are working on avro hadoop file format
When to use Apache Hive:
  • Data warehousing requirements
  • Analytical Queries of historical data
  • Data Analysis who are familiar with SQL
  • While working on structured data
  • By Data Analysts
  • To visualize and create reports

Apache Pig vs Apache Hive Comparison Table

I am discussing major artifacts and distinguishing between Apache Pig and Apache Hive.

  Apache Pig Apache Hive
Data Processing Apache Pig is High-level data flow language Apache Hive is used for batch processing i.e. Online Analytical Processing (OLAP)
Processing Speed Apache Pig has higher latency because of executing MapReduce job in background Apache Hive also has higher latency because of executing MapReduce job in background
Compatibility with Hadoop Apache Pig runs on top of MapReduce Apache Hive also runs on top of MapReduce
Definition Apache Pig is open source, high-level data flow system that renders you a simple language platform properly known as Pig Latin that can be used for manipulating data and queries. Apache Hive is open source and  similar to SQL used for Analytical Queries
Language Used Apache Pig uses procedural data flow language called Pig Latin Apache Hive uses a declarative language called HiveQL
Schema Apache Pig doesn’t have a concept of schema. You can store data in an alias. Apache hive supports Schema for inserting data in tables
Web Interface Apache Pig does not support web Interface Apache Hive supports web interface
Operations Apache Pig is used for Structured and Semi-Structured data Apache Hive is used for structured data.
User Specification Apache Pig is used by Researchers and Programmers Apache Hive is used by Data Analyst
Operates On Apache Pig operates on Client side of cluster Apache hive Operates on Server side of Cluster
Partition Methods There is no concept of Partition in Apache Pig Apache Hive supports Sharding features
File Format Apache Pig Supports Avro file format Apache hive directly does not support Avro format but can support using “org.apache.hadoop.hive.serde2.avro”
JDBC / ODBC Apache Pig does not support Apache hive supports but limited
Debugging It is easy to debug Pig scripts We can debug, but it is bit complex

Conclusion -Between Apache Pig vs Apache Hive:

Apache Pig and Apache Hive, both are commonly used on Hadoop cluster. Both Apache Pig and Apache Hive is a powerful tool for data analysis . Apache Pig and Apache Hive are mostly used in the production environment. A user needs to select a tool based on data types and expected output. Both tools provide a unique way of analyzing Big Data on Hadoop cluster. Based on above discussion user can choose between Apache Pig and Apache Hive for their requirement.

HADOOP

INTRODUCTION TO HADOOP- It is open source framework that allows distributed processing of large datasets on the cluster of commodity hardware.Apache owns it , so from here onwards wherever i will write hadoop , it will automatically means apache hadoop.

In simple words Hadoop is a data management tool and uses scale out storage.

DEFINING HADOOP CLUSTER-

It is a system in which HADOOP is installed on many nodes and each node is connected to each other

Size of data is most important factor while defining hadoop cluster .

There are two version of HADOOP

  • HADOOP 1
  • HADOOP2

Let us discuss about the components of both the Hadoop versions

HADOOP 1 Components

-HDFS(Hadoop distributed file system)- It is used to store data

-MapReduce-This is the framework for processing. In simple terms it is the processor of the Hadoop.

HADOOP 2 COMPONENTS

HDFS(Hadoop distributed file system)- It is used to store data

YARN/MRv2- It is the second version of the processor of HADOOP

 

 

BIG DATA

In the last blog i write down regarding my objective for the upcoming blogs.

In this blog i am going to write down about the base of my objective and that is BIG DATA.

In order to do my research , the first thing i need to understand is BIG DATA .

I started my research by watching some videos of BIG DATA on youtube . I saw many videos on youtube but i found out a youtube channel named ‘TECHNOLOGICAL GEEKS’ very helpful.

According to the knowledge gained from this youtube channel i came to understand that BIG DATA is a term used to express exponential growth of data around us. I also came to understand that it is very difficult to store, collect, maintain, analyse and visualize it.

The main focus of  BIG DATA  analysts is to extract meaningful data from  it.

Now let us focus on some of the characterstics of BIG DATA

According to my research i found out that BIG DATA have the following characterstics

1: VOLUME- The data volume of BIG DATA is very large scale

2: VELOCITY- It tells us that big data is getting generated at a very rapid rate. It is growing more and more with each passing day. Thats the reason it is very difficult  to do analysis of big data.

3:VARIETY- Not only big data is very large in volume , but also it includes different types of data. The different types of data includes

  • Structured data-eg MYSql. This type of data is in a proper structured form
  • Semi-structured data-example-xml,json.This type of data is a hybrid bw stuctured and unstructured data.
  • Unstructured data-eg text,audio,video. most of the data nowadays is unstructured.

Now let us talk about the sources of BIG DATA

  • Social media- The information that we are sharing on social media sites is contributing to BIG DATA
  • Banks- The maximum transactions nowadays is online. All that data is also contributing to BIG DATA
  • Instruments- like Rfid readers,Security cameras are also contributing to big data.
  • Websites- for ex-amazon . A lot of data is stored on those websites , which is also contributing to BIG DATA
  • Stock market- They also generate a lot of data everyday. They  makeup a major portion of the BIG DATA around us.

Now i would like to discuss some of the Use cases of BIG DATA.

  •  Recommendation engines- best example for this are marketing websites like  AMAZON . All of us have noticed many times that when we search for any item for shopping  on Amazon , items similar to that will be shown always for recommendation whenever we login to Amazon. This was made possible because of the use of proper big data analysis techniques. Other example of this include Youtube recommendation engine. All of us would have noticed that we always see recommended videos in accordance to the videos that we have searched previously on youtube.BIG DATA analysis play an important role in these types of recommendation engines.
  • Analyzing call detail record (CDR)- These type of analysis is done by the telecom companies . They do so in order to find out the needs and expectations of their customer. This is very trendy nowadays.
  • Fraud detection-Big data helps in detecting a lot of fraud detection like credit card frauds , online banking frauds.
  • Market basket analysis- Companies are trying to sell out their stuff using this analysis. In this analysis they try to understand what the customers are trying to purchase and accordingly try to sell related items along with that. for example if they came to know that the customer is trying to purchase mobile phone , then they will try to sell mobile covers and other mobile accessories along with that.
  • Sentimental analysis- In this analysis anyone topic is picked up and uploaded on social media and then they get peoples review over it and then they analyse those views and give a result according to those reviews.

 

 

 

PRJ702

This is my first blog regarding my graduate project PRJ702.

In my project i am going to do a comparison of  various Approaches to Large-Scale Data Analysis and also show a practical demonstration of HADOOP on a database

. All my folllowing blogs will be regarding my research on the new trending tools which are used to analyse BIG DATA  and how they will help in analysing large scale data in more efficient way along with practical demonstration of HADOOP

 The main focus of my research will be

  • What is Hadoop? How is it different from traditional “data storage” architectures? (note, I have purposely not used the term ‘database’)
  • What do you gain for using Hadoop? What do you lose?
  • what use cases are suited to Hadoop? Conversely, what use cases are not appropriate for Hadoop (or what use cases wouldn’t benefit from Hadoop)?
  • Finally,I will develop a set of guidelines for interested parties which would help them understand whether Hadoop is right for them.
Design a site like this with WordPress.com
Get started