difficulties that i faced during the project

In this blog i am going to write about all the difficulties that i faced during the execution of my project .

To be honest i had no knowledge regarding big data and hadoop before this project but i had some interest on this topic because in last semester i did DAT601 in which our teacher Todd cochrane talk about Big data and how it is becoming a good topic in database.

First problem that i faced during this project was to decide the exact topic for my project which relates to big data .During this whole time even my objective of project also changeed to some sort . First confuson was to whether choose a research based topic or choose a practical implementation topic . I decided to do a research based topic in which i compare different data analysing tools to analyse big data but when i told it to my respected supervisor lars dam about it then he told me that i should also try to focus on practical implementation of these data analysis tools . So under his guidance i decided my topic which was both a research based and practical implementation of that by using “hadoop ” which is a data analysis tool to analyse big data by building a one node cluster on my laptop .

Next problem that i faced was that when i tried to research about hadoop practical implementation , i watched many youtube videos , which discuss in different ways about hadoop cluster implementation which made me confused for a couple of weeks . I was taking my time because if i started my project in a wrong way then it would be difficult to finish it off properly,

In one video i watched hadoop one node implementation but the virtual machine he used was for vmware playstation but he did not not specifically told that in the video . I did not had Vmware and i had oracle virtual player in my desktop . Due to not being compatible with oracle virtual player that virtual machine did not work .

when i watched some other youtube videos i realized that the virtual machine that we are downloading from cloudera( which is an open source hadoop distribution) needs to be compatible with the virtualization software that is installed in my desktop.

This time i wanted to install the virtual machine of cloudera(CDH) which is compatible with the oracle virtual player.But this time while i was watching the video i notice that the RAM requirement for this virtual machine is 8 GB which was not sufficient according to my laptop specification .

I told this thing to my supervisor Mr Lars Dam and he advised me to use help from Mr . Mark caukill (He is networking specialist ) so that i can use the Talos room as a host for my the CDH virtual machine.

I even got the permission from him to use Talos server room as a host .

So now i started to work over this . But one more hurdle came in my way again .

I thought that i can directly export my virtual machine from my desktop into the virtual environment provided by Mr.Mark Caukill but was wrong .

For the virtual machine to be exported into the virtual environment , it needed to be into the library of the environment which can be done only by the administrator means Mr Mark Caukill . But at that time he was on vacation so i could not ask help from him.

So therefore i decided to borrow my flatmates laptop for hadoop cluster demonstration because his laptop has 16 Gb of RAM and finally i was able to run the CDH virtual machine

cloudera

Apache Hadoop Ecosystem

Hadoop is an ecosystem of open source components that fundamentally changes the way enterprises store, process, and analyze data. Unlike traditional systems, Hadoop enables multiple types of analytic workloads to run on the same data, at the same time, at massive scale on industry-standard hardware. CDH, Cloudera’s open source platform, is the most popular distribution of Hadoop and related projects in the world (with support available via a Cloudera Enterprise subscription).

I have downloaded the CDH .

CDH-Cloudera provides a scalable, flexible, integrated platform that makes it easy to manage rapidly increasing volumes and varieties of data in your enterprise. Cloudera products and solutions enable you to deploy and manage Apache Hadoop and related projects, manipulate and analyze your data, and keep that data secure and protected.

Apache Hadoop’s core components, which are integrated parts of CDH and supported via a Cloudera Enterprise subscription, allow me to store and process unlimited amounts of data of any type all within a single platform.

QuickStarts for CDH 5.13

Virtualized clusters for easy installation on your desktop.

Cloudera QuickStart VMs (single-node cluster) make it easy to quickly get hands-on with CDH for testing, demo, and self-learning purposes, and include Cloudera Manager for managing your cluster. Cloudera QuickStart VM also includes a tutorial, sample data, and scripts for getting started.

Prerequisites

These 64-bit VMs require a 64-bit host OS and a virtualization product that can support a 64-bit guest OS.
To use a VMware VM, you must use a player compatible with WorkStation 8.x or higher:
- Player 4.x or higher
- Fusion 4.x or higher
You can use older versions of WorkStation to create a new VM using the same virtual disk (VMDK file), but some features in VMware Tools are not available.
The amount of RAM required varies by the runtime option you choose:

Advantages and use cases of Apache Hadoop

The Apache Hadoop is an open source framework that allows distributed processing of large data sets across clusters of computers.

By The Term Cluster we mean to say set of computers where one computer is the master and the others are slave.

The advantages of Apache hadoop are

No single point failure
Faster processing due to data divided into blocks.
Fault tolerance due to replication of data. By default replication factor is 3.

Use cases for Apache Hadoop are

Use of Big Data in Retail Industry-With the growth in retail industry with millions of transactions spread across multiple disconnected systems, it is impossible to see the full picture of the data that is getting generated as Retail stores typically do not communicate with each other. Daily updates are provided in the system and in most of the cases systems do not interact with each other.On the other hand market size is increasing day by day which makes it an impossible task for a marketing analyst to understand the strength and health of their product or campaign . Transaction data in its raw form helps a company understand its sales pattern. retailers can use BIG DATA – combining data from web browsing patterns, social media, industry forecasts, existing customer records and many other data to predict trends, prepare for demand, pinpoint customers, optimize pricing and promotions, and monitor real-time analytics and results.

2-Big Data HADOOP helping in wildlife Conservation-There are a lot of wildlife projects in progress nowadays in order to prevent our ecosystem and endangered species . A large amount of BIG data is getting generated . APACHE hadoop can help in analysing this data

3-Credit Card Fraud Detection

As millions of people are using credit card now-a-days, so it has become very necessary to protect people from frauds. It has become a challenge for Credit card companies to identify whether the requested transaction is fraudulent or not.

4-Sentiment Analysis

Sentiment analysis provides substance behind social data. A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence — whether the expressed opinion in a document,or a sentence is positive, negative, or neutral.

Apache pig vs Apache hive

Difference Between Apache Pig and Apache Hive

The Apache Pig came into the exitsence in the year 2006 when the researcher as Yahoo was struggling with mapreduce java codes. It was difficult to reuse and maintain code for compilation. Along with that they observed that MapReduce users were not comfortable with declarative languages such as SQL. They started to work on new language that was supposed to fit in a sweet spot between the declarative style of SQL, low-level and procedural style of MapReduce. This resulted in the birth of Pig and the first release of Pig came in September 2008 and by end of 2009 about half of the jobs at Yahoo were Pig jobs.

The Apache hive story begins in the year 2007 when non-Java Programmer have to struggle while using hadoop mapreduce. IT professional from database background were facing challenges to work on Hadoop Cluster. Initially, researchers, working at Facebook came up with Hive language. This language was very similar to SQL language. So language was called Hive Query Language (HQL) and later it becomes project of open source Apache Community. After becoming project of Apache Community there was a major development in Apache Hive. Facebook was the first company to come up with Apache Hive.

Let me explain about Apache Pig vs Apache Hive in more details.

Introducing Apache Pig vs Apache Hive

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Apache is open source project of Apache Community. Apache Pig provides a simple language called Pig Latin, for queries and data manipulation.

Pig is being utilized by companies like Yahoo, google and microsoft for collecting huge amounts of data sets in the form of click streams, search logs and web crawls.

Apache Pig provides nested data types like maps, Tuples, and Bags
Apache Pig Follows multi-query approach to avoid multiple scans of the datasets.
Programmers familiar with scripting language prefer Apache Pig
pig is easy if you are well aware of SQL
No need to create schema to work on Apache Pig
Pig also provides support to major data operations like Ordering, Filters, and Joins
Apache Pig framework translates Pig Latin into sequences of MapReduce programs

Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Apache Hive is an Apache open-source project built on top of Hadoop for querying, summarizing and analyzing large data sets using a SQL-like interface. Apache hive provides the SQL-like language called HiveQL, which transparently convert queries to MapReduce for execution on large datasets stored in Hadoop Distributed File System (HDFS).

Apache Hive is a Data warehouse Infrastructure.
Apache Hive is an ETL tool (Extraction-Transformation-Loading)
Apache hive is similar to SQL
Apache Hive enables customized mappers and reducers
Apache Hive increases the schema design flexibility using data serialization and deserialization
Apache hive is an analytical tool

Head to Head Comparison Between Apache Pig vs Apache Hive (Infographics):

Key differences between Apache Pig vs Apache Hive:

Apache Pig is more faster comparing Apache Hive
Apache Pig and Apache Hive both runs on top of Hadoop MapReduce
Apache Pig is best for Structured and Semi-structured while Apache Hive is best for structured data
Apache Pig is a procedural language while Apache Hive is a declarative language
Apache Pig supports cogroup feature for outer joins while Apache Hive does not support
Apache Pig does not have a pre-defined database to store table/ schema while Apache Hive has pre-defined tables/schema and stores its information in a database.
Apache Pig is also suited for complex and nested data structure while Apache Hive is less suited for complex data
Researchers and programmers use Apache pig while Data Analysts use Apache Hive

When to use Apache Pig:

When you are a programmer and know scripting language
When you don’t want to create schema while loading
ETL requirements
When you are working on client side of the Hadoop cluster
When you are working on avro hadoop file format

When to use Apache Hive:

Data warehousing requirements
Analytical Queries of historical data
Data Analysis who are familiar with SQL
While working on structured data
By Data Analysts
To visualize and create reports

Apache Pig vs Apache Hive Comparison Table

I am discussing major artifacts and distinguishing between Apache Pig and Apache Hive.

	Apache Pig	Apache Hive
Data Processing	Apache Pig is High-level data flow language	Apache Hive is used for batch processing i.e. Online Analytical Processing (OLAP)
Processing Speed	Apache Pig has higher latency because of executing MapReduce job in background	Apache Hive also has higher latency because of executing MapReduce job in background
Compatibility with Hadoop	Apache Pig runs on top of MapReduce	Apache Hive also runs on top of MapReduce
Definition	Apache Pig is open source, high-level data flow system that renders you a simple language platform properly known as Pig Latin that can be used for manipulating data and queries.	Apache Hive is open source and similar to SQL used for Analytical Queries
Language Used	Apache Pig uses procedural data flow language called Pig Latin	Apache Hive uses a declarative language called HiveQL
Schema	Apache Pig doesn’t have a concept of schema. You can store data in an alias.	Apache hive supports Schema for inserting data in tables
Web Interface	Apache Pig does not support web Interface	Apache Hive supports web interface
Operations	Apache Pig is used for Structured and Semi-Structured data	Apache Hive is used for structured data.
User Specification	Apache Pig is used by Researchers and Programmers	Apache Hive is used by Data Analyst
Operates On	Apache Pig operates on Client side of cluster	Apache hive Operates on Server side of Cluster
Partition Methods	There is no concept of Partition in Apache Pig	Apache Hive supports Sharding features
File Format	Apache Pig Supports Avro file format	Apache hive directly does not support Avro format but can support using “org.apache.hadoop.hive.serde2.avro”
JDBC / ODBC	Apache Pig does not support	Apache hive supports but limited
Debugging	It is easy to debug Pig scripts	We can debug, but it is bit complex

Conclusion -Between Apache Pig vs Apache Hive:

Apache Pig and Apache Hive, both are commonly used on Hadoop cluster. Both Apache Pig and Apache Hive is a powerful tool for data analysis . Apache Pig and Apache Hive are mostly used in the production environment. A user needs to select a tool based on data types and expected output. Both tools provide a unique way of analyzing Big Data on Hadoop cluster. Based on above discussion user can choose between Apache Pig and Apache Hive for their requirement.

HADOOP

INTRODUCTION TO HADOOP- It is open source framework that allows distributed processing of large datasets on the cluster of commodity hardware.Apache owns it , so from here onwards wherever i will write hadoop , it will automatically means apache hadoop.

In simple words Hadoop is a data management tool and uses scale out storage.

DEFINING HADOOP CLUSTER-

It is a system in which HADOOP is installed on many nodes and each node is connected to each other

Size of data is most important factor while defining hadoop cluster .

There are two version of HADOOP

HADOOP 1
HADOOP2

Let us discuss about the components of both the Hadoop versions

HADOOP 1 Components

-HDFS(Hadoop distributed file system)- It is used to store data

-MapReduce-This is the framework for processing. In simple terms it is the processor of the Hadoop.

HADOOP 2 COMPONENTS

HDFS(Hadoop distributed file system)- It is used to store data

YARN/MRv2- It is the second version of the processor of HADOOP

BIG DATA

In the last blog i write down regarding my objective for the upcoming blogs.

In this blog i am going to write down about the base of my objective and that is BIG DATA.

In order to do my research , the first thing i need to understand is BIG DATA .

I started my research by watching some videos of BIG DATA on youtube . I saw many videos on youtube but i found out a youtube channel named ‘TECHNOLOGICAL GEEKS’ very helpful.

According to the knowledge gained from this youtube channel i came to understand that BIG DATA is a term used to express exponential growth of data around us. I also came to understand that it is very difficult to store, collect, maintain, analyse and visualize it.

The main focus of BIG DATA analysts is to extract meaningful data from it.

Now let us focus on some of the characterstics of BIG DATA

According to my research i found out that BIG DATA have the following characterstics

1: VOLUME- The data volume of BIG DATA is very large scale

2: VELOCITY- It tells us that big data is getting generated at a very rapid rate. It is growing more and more with each passing day. Thats the reason it is very difficult to do analysis of big data.

3:VARIETY- Not only big data is very large in volume , but also it includes different types of data. The different types of data includes

Structured data-eg MYSql. This type of data is in a proper structured form
Semi-structured data-example-xml,json.This type of data is a hybrid bw stuctured and unstructured data.
Unstructured data-eg text,audio,video. most of the data nowadays is unstructured.

Now let us talk about the sources of BIG DATA

Social media- The information that we are sharing on social media sites is contributing to BIG DATA
Banks- The maximum transactions nowadays is online. All that data is also contributing to BIG DATA
Instruments- like Rfid readers,Security cameras are also contributing to big data.
Websites- for ex-amazon . A lot of data is stored on those websites , which is also contributing to BIG DATA
Stock market- They also generate a lot of data everyday. They makeup a major portion of the BIG DATA around us.

Now i would like to discuss some of the Use cases of BIG DATA.

Recommendation engines- best example for this are marketing websites like AMAZON . All of us have noticed many times that when we search for any item for shopping on Amazon , items similar to that will be shown always for recommendation whenever we login to Amazon. This was made possible because of the use of proper big data analysis techniques. Other example of this include Youtube recommendation engine. All of us would have noticed that we always see recommended videos in accordance to the videos that we have searched previously on youtube.BIG DATA analysis play an important role in these types of recommendation engines.
Analyzing call detail record (CDR)- These type of analysis is done by the telecom companies . They do so in order to find out the needs and expectations of their customer. This is very trendy nowadays.
Fraud detection-Big data helps in detecting a lot of fraud detection like credit card frauds , online banking frauds.
Market basket analysis- Companies are trying to sell out their stuff using this analysis. In this analysis they try to understand what the customers are trying to purchase and accordingly try to sell related items along with that. for example if they came to know that the customer is trying to purchase mobile phone , then they will try to sell mobile covers and other mobile accessories along with that.
Sentimental analysis- In this analysis anyone topic is picked up and uploaded on social media and then they get peoples review over it and then they analyse those views and give a result according to those reviews.

PRJ702

This is my first blog regarding my graduate project PRJ702.

In my project i am going to do a comparison of various Approaches to Large-Scale Data Analysis and also show a practical demonstration of HADOOP on a database

. All my folllowing blogs will be regarding my research on the new trending tools which are used to analyse BIG DATA and how they will help in analysing large scale data in more efficient way along with practical demonstration of HADOOP

The main focus of my research will be

What is Hadoop? How is it different from traditional “data storage” architectures? (note, I have purposely not used the term ‘database’)
What do you gain for using Hadoop? What do you lose?
what use cases are suited to Hadoop? Conversely, what use cases are not appropriate for Hadoop (or what use cases wouldn’t benefit from Hadoop)?
Finally,I will develop a set of guidelines for interested parties which would help them understand whether Hadoop is right for them.