This is the second tutorial in the Big Data Tutorials for Beginners. It is also the Hadoop Tutorial for Beginners. This Big Data beginner tutorial explains what is data analytics, big data analytics, Big data and Hadoop, big data applications, big data analytics tools, big data visualization tools and big data use cases. Please view the Big Data tutorial 2 or read on... What is Data Analytics? It means the analyses of data sets to find patterns and insights. Data Analytics uses multiple technologies and techniques. Data Analytics enable informed business decisions.
As shown above, Data Analytics can be divided into the following sub-categories :
- Descriptive analytics: analysis of past data to describe the current state
- Predictive analytics: data analysis to find patterns and forecast the future situation
- Prescriptive analytics: data analysis to recommend actions to exploit an advantage or mitigate a future issue
Big Data and Hadoop: Hadoop is a big data open source tool. It is an open-source framework created by the Apache Software Foundation. Apache Hadoop uses distributed storage (many computers) to handle big data. Hadoop uses the Map Reduce data analysis technique. Hadoop has two components 1) HDFS (Hadoop Distributed File System) manages the big data storage 2) MapReduce manages the data processing. Hadoop divides data into many blocks and distributes these blocks across the computers in a cluster. Then, Hadoop sends code to the nodes for data processing using the Map and Reduce technique. All the tools used by an organization in it's big data architecture form the big data stack. Some tools in the Hadoop ecosystem include:
- Apache Hadoop YARN is the resource manager, job scheduler and job monitor in Hadoop.
- Apache HBase is the distributed database that works on HDFS in big data. HBase is a non-relational database that stores data as key-value pairs.
- Apache Hive is a tool for the purpose of data querying and analysis. Hive allows SQL-like queries to fetch data from the HDFS and the databases managed by Hadoop.
- Apache Mahout is a tool for machine learning and data mining tasks.
- Apache Pig is a platform to write code to run on Hadoop. Pig uses Pig Latin which makes it easier to write programs using the Map Reduce technique.
- Apache Ambari is a tool to provision, manage and monitor Hadoop clusters.
- Apache Spark is a compute engine for massive data. Spark big data offers a programming model for ETL, streaming, machine learning and graph generation. In order to use Apache Spark, we can write programs using Java, Python, Scala, R or SQL.
Some of the popular big data use cases applicable to many industry domains are shown above. These are:
- 360-degree view creation of an entity (e.g the customer or the patient or the student)
- Customer classification (into several categories) for relevant communication
- Price optimization based on demand, competition and customer profiles (especially useful in eCommerce, airline and hotel industries)
- New product/ service development (based on features that contribute to success)
- Distribution optimization (based on forecasted demand, expected traffic conditions and so on)
- Fraud prevention (to flag potentially fraudulent transactions in real-time)
Want to learn more details about Big Data tools? Also, see Big Data questions and answers? Please view my Big Data tutorial 2. Thank you.