Sunday, April 26, 2020

Big Data Tutorial 2 | big data Analytics | Hadoop tutorial for beginners

This is the second tutorial in the Big Data Tutorials for Beginners. It is also the Hadoop Tutorial for Beginners. This Big Data beginner tutorial explains what is data analytics, big data analytics, Big data and Hadoop, big data applications, big data analytics tools, big data visualization tools and big data use cases. Please view the Big Data tutorial 2 or read on... What is Data Analytics? It means the analyses of data sets to find patterns and insights. Data Analytics uses multiple technologies and techniques. Data Analytics enable informed business decisions.

As shown above, Data Analytics can be divided into the following sub-categories :
  • Descriptive analytics: analysis of past data to describe the current state
  • Predictive analytics: data analysis to find patterns and forecast the future situation
  • Prescriptive analytics: data analysis to recommend actions to exploit an advantage or mitigate a future issue
Next, what is Big Data Analytics? It is the process to examine and analyze big data to find patterns, correlations and trends. Big Data Analytics allows data analysts to make informed decisions faster. Big Data Analytics includes techniques like natural language processing, statistics, machine learning, predictive analytics and data mining to draw inferences. Big Data Analytics tools include Hadoop and related tools like HBase, Hive and Pig.

Big Data and Hadoop:  Hadoop is a big data open source tool. It is an open-source framework created by the Apache Software Foundation. Apache Hadoop uses distributed storage (many computers) to handle big data. Hadoop uses the Map Reduce data analysis technique. Hadoop has two components 1) HDFS (Hadoop Distributed File System) manages the big data storage 2) MapReduce manages the data processing. Hadoop divides data into many blocks and distributes these blocks across the computers in a cluster. Then, Hadoop sends code to the nodes for data processing using the Map and Reduce technique. All the tools used by an organization in it's big data architecture form the big data stack. Some tools in the Hadoop ecosystem include:
  • Apache Hadoop YARN is the resource manager, job scheduler and job monitor in Hadoop.
  • Apache HBase is the distributed database that works on HDFS in big data. HBase is a non-relational database that stores data as key-value pairs.
  • Apache Hive is a tool for the purpose of data querying and analysis. Hive allows SQL-like queries to fetch data from the HDFS and the databases managed by Hadoop.
  •  Apache Mahout is a tool for machine learning and data mining tasks.
  • Apache Pig is a platform to write code to run on Hadoop. Pig uses Pig Latin which makes it easier to write programs using the Map Reduce technique.
  • Apache Ambari is a tool to provision, manage and monitor Hadoop clusters.
  • Apache Spark is a compute engine for massive data. Spark big data offers a programming model for ETL, streaming, machine learning and graph generation. In order to use Apache Spark, we can write programs using Java, Python, Scala, R or SQL.
There are many big data applications or big data tools that help organizations create their custom applications. Some examples of big data tools are Teradata database (to import data to Hadoop, query data and export data from Hadoop) and big data analytics software like Statistica (for predictive analytics), IBM's Watson Analytics and MongoDB (for querying unstructured data). Big data analytics means analysis of big data sets to find patterns and extract insights. Some examples of big data analytics tools are Tableau Public, Knime, Plotly and ElasticSearch. Some of the popular big data visualization tools are Tableau, Google Chart and D3.js. There are other tools for big data visualization like DataWrapper, FusionCharts and Plotly.

Some of the popular big data use cases applicable to many industry domains are shown above. These are:
  • 360-degree view creation of an entity (e.g the customer or the patient or the student)
  • Customer classification (into several categories) for relevant communication
  • Price optimization based on demand, competition and customer profiles (especially useful in eCommerce, airline and hotel industries)
  • New product/ service development (based on features that contribute to success)
  • Distribution optimization (based on forecasted demand, expected traffic conditions and so on)
  • Fraud prevention (to flag potentially fraudulent transactions in real-time)
These are just a few big data analytics examples. Big data analytics enables risk assessment in the insurance industry, product recommendation in eCommerce and customer care in every industry.

Want to learn more details about Big Data tools? Also, see Big Data questions and answers? Please view my Big Data tutorial 2. Thank you.

Sunday, April 19, 2020

Big Data Tutorial 1 | What is big data | Characteristics of big data

This is the first tutorial in the Big Data Tutorials for Beginners. This Big Data beginner tutorial explains what is Big Data, what is Big Data technology, Big Data introduction, types of big data and characteristics of Big Data. Please view the Big Data tutorial 1 or read on... First, what is big data? Big data means the massive data sets that we can analyze to find patterns. Such data sets are too big/   complex/ varied/ dynamic to be processed by traditional DBMSs, software applications and procedures. On the other hand, traditional data is in relational databases and files in various formats. Big data comes from multiple sources including emails, messages, devices (e.g. mobile devices, cameras, RFID readers and wireless sensors), software logs and databases etc. That is big data vs traditional data. Big data synonyms can be large data, massive data, lots of data or large data volumes.

Next, what is big data technology? These include Apache Hadoop, Apache Spark, NoSQL databases and Predictive Analytics. Big data technologies enable organizations to analyze operations and take informed decisions. The advantages of big data for organizations are that they can acquire new customers, retain existing customers, increase revenues and decrease costs. The advantages of big data for consumers are personalized service and more features. The advantages of big data for the public are health benefits and social benefits.

Next, let us see big data introduction. Each one of us is creating and using big data daily. For example, if you search a term on a search engine like Google, the search engine uses big data to provide search results. If you search a product on an ecommerce website like Amazon, you see recommended products. The recommendation engines use big data.

Next, let us know the types of big data. Structured data is organized, labelled and exists in a fixed arrangement e.g. tabular data in spreadsheets and relational databases. Semi-structured data may be organized but does not follow a fixed arrangement e.g. XML and JSON files. Unstructured data forms the majority of big data. Unstructured data is neither organized nor follows any fixed arrangement e.g. text files, images, audio files, video files, emails and social media data (tweets and blog posts).

Next, let us learn the characteristics of big data or Big Data 5V. The first V is Volume that is the size of data or the increase in quantity of data. Variety refers to the different formats of the data. Big data can be in various formats like relational data, text (documents, emails, messages in social media), images, audio and video and machine generated data (from mobile devices, wearables, sensors, RFID tags and server logs).

Velocity is the speed of data generation or processing. Typically, the velocity of structured data is less than the velocity of unstructured data. Big data can be processed via batch processing or (typically, costlier) real-time processing. The extreme velocity of big data needs cloud-based technology to process data quickly. Veracity refers to the quality of data i.e. the accuracy, consistency and completeness of data. Big data can include data of different quality. Value is a characteristic of big data. Value refers to unlocking the big data into increased revenue or reduced costs or other benefits like customer satisfaction, employee satisfaction, social benefits or health benefits. Value example in ecommerce is referring products that the customer is likely to buy. In logistics, it is route optimization leading to reduced delivery timelines and reduced transportation costs. In utilities, it is reduced customer churn etc. Value is the most important Big Data V.

There are even more big data characteristics. Valence refers to the connections between the data. Data with more interconnections is more complex than data with fewer interconnections. Variability refers to the data generation consistency, meaning is the data available regularly or intermittently? Virality refers to the speed at which the data is spread.

Want to learn more details with examples? Also, see Big Data interview questions and answers? Please view my Big Data tutorial 1. Thank you.

Sunday, April 12, 2020

VBScript tutorial 8 | Working with SQL

This is the last tutorial in the VBScript Tutorials for Beginners. This VBScript beginner tutorial explains how to work with databases using VBScript. The COM objects that VBScript can use are the ADO objects such as the Connection object and the Recordset object. Please view the VBScript tutorial 8 or read on... What is Connection? A Connection object connects the VBScript to the database. A Connection object may be open or closed. If the Connection is open, we may run SQL commands on the database to get data from the database. What is Recordset? A Recordset object is a result set from the database. We may display the Recordset data using our VBScript. Now, let us see a VBScript example with SQL commands.

' VBScript code
' This script is just an example. Write your own script, test it on a non-production database and use it at your own risk.
Option Explicit
' Define the ADO objects for connection and recordset.
Dim objConnection, objRecordSet

Sub ShowCustomers
With objRecordSet
    ' Display the records
    Do While Not .EOF
     WScript.Echo objRecordSet("Name") &" has their office address at " & objRecordSet("BillAddress")
End With
End Sub
' More VBScript code follows
Sub AddTempCustomer
    ' Add a dummy record
    objRecordSet("CustomerId") = 4
    objRecordSet("Name") =  "Cust4"
    objRecordSet("BillAddress") = "xxxxxxxxxxxxxxxxxxxxxxx"
    objRecordSet("ShipAddress") = "yyyyyyyyyyyyyyyyyyy"
End Sub

Sub UpdateCustomer
    ' Update existing customer
    objConnection.Execute "UPDATE Customers SET Name = 'Customer1' WHERE NAME='Cust1'"
End Sub

Sub DeleteTempCustomer
    ' Delete the dummy record (added using the procedure AddTempCustomer above).
    ' Write the Delete SQL carefully! It must have the Where clause to select only the dummy record.
    objConnection.Execute "DELETE Customers WHERE CustomerId=4"   
End Sub

Sub OpenADOObjects
    Set objConnection = CreateObject("ADODB.Connection")
    objConnection.Open ConnectionString()
    Set objRecordSet = CreateObject("ADODB.RecordSet")
    ' Assuming a Customers table exists with CustomerId, Name, BillAddress and ShipAddress columns.
    ' Open the recordset with the SQL query.
    objRecordSet.Open "SELECT * FROM Customers", objConnection, 1, 3 ' the last two arguments are adOpenKeySet and adLockOptimistic
End Sub

Sub CloseADOObjects
End Sub

Function ConnectionString()
    ' A connection string has data about server name, database name and login information.
    ' I wrote the connection string using the format from
End Function

Want to learn with a diagram and working VBScript code? Please view my VBScript tutorial 8. Thank you.

Sunday, April 5, 2020

VBScript tutorial 7 | Arrays

This is the next tutorial in the VBScript Tutorials for Beginners. This VBScript beginner tutorial explains the VBScript array variable. Please view the VBScript tutorial 7 or read on... What is array in VBScript? An array variable in VBScript can store multiple values in it. Such data values may be customer names or phone numbers or email addresses etc. A VBScript array has indexes to refer the different values in it. The indexes start with the index 0. In the VBScript example below, the Sub FixedArray shows an array example, strCustomers(3). We can run this VBScript in the Command Prompt using the CScript  command e.g. CScript Array1.vbs

' VBScript code
Option Explicit
Call FixedArray
Call DynamicArray

Sub FixedArray
    ' Declare a fixed array i.e. an array with the specified number of elements.
    Dim strCustomers(3)
    strCustomers(0) = "Abe"
    strCustomers(1) = "Ben"
    strCustomers(2) = "Chris"
    strCustomers(3) = "Dustin"
    ' Display the first data value in the command window, instead of a message box.
    WScript.Echo "strCustomers(0) is " &  strCustomers(0)
End Sub

Sub DynamicArray
    ' Declare a dynamic array i.e. an array whose number of elements is unknown at present.
    Dim strCustomersNew()
    ' VBScript Redim statement defines the number of elements in the array.
    Redim strCustomersNew(3)
    strCustomersNew(0) = "Abe"
    strCustomersNew(1) = "Ben"
    strCustomersNew(2) = "Chris"
    strCustomersNew(3) = "Dustin"
    ' Preserve in the Redim statement retains the existing array elements.
    Redim Preserve strCustomersNew(5)
    strCustomersNew(4) = "Eddie"
    strCustomersNew(5) = "Fred"
    Dim i
    ' VBScript UBound function gives the upper bound of the array.
    For i = 0 to UBound(strCustomersNew)
        WScript.Echo "The element" &  i & " is " &  strCustomersNew(i)
End Sub

Next, let us see the VBScript code with an array of numbers.