PIG Introduction
- Big is an Apache open source project
- pig works on MapReduce and hdfs and the script that we will read write to perform data analysis is called pig Latin.
- PIG is capable to read data from local file system and hdfs and able to write output data on hdfs or local file system.
- It is data flow language which can eat anything eat anything means that pig is capable to work on structured data, semi structured data, and unstructured data.
- Pig Developed by the scientist of Yahoo.
- Pig can ingest data from files, streams or other sources using the User Defined Functions(UDF).
- Once it has the data it can perform select, iteration, and other transforms the data.
How to run PIG
grant is a interactive shell for users to write pig Latin script.
pig -x local
It executes in a single JVM and is used for development experimenting and prototyping.Local mode works on local file system.
mapreduce – The MapReduce mode is also known as Hadoop Mode. In this Pig renders Pig Latin into MapReduce jobs and executes them on the cluster. It can be executed against semi-distributed or fully distributed hadoop installation.
Type pig -x mapreduce or just write PIG to enter the shell .
Data Types of PIG
1. Scalar Data Types : int ,long , float, double , chararray, bytearray
2. Complex Data Types.: Maps , Tuples and Bags
Maps:A map in Pig is a chararray to data element mapping, where that element can be any
Pig type, including a complex type.
Ex [‘name’#’bob’,’age’#55]
Tuple: A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided
into fields, with each field containing one data element.
For example, (‘bob’, 55) describes a tuple constant with two fields.
Bag : A bag is an unordered collection of tuples. Because it has no order, it is not possible to
reference tuples in a bag by position.
For example, {(‘bob’, 55), (‘sally’, 52), (‘john’, 25)} constructs a bag with
three tuples, each with two fields.
How to declare PIG SCHEMAS
Two Ways :
1. dividends = load ‘NYSE_dividends’ as (exchange:chararray, symbol:chararray, date:chararray, dividend:float);
2. dividends = load ‘NYSE_dividends’ as (exchange, symbol, date, dividend);