Friday, August 4, 2017

Pivot columns to rows in Apache Pig

Pivot columns to rows in Pig (Apache pig)
==============================

Transposing columns to rows in pig

Data input
id1, col1,col2,col3
id2,col1,col2,col3

Expected o/p
id1 col1 col3
id1 col2 col3
id2 col1 col3
id2 col2 col3


A = LOAD '/home/hadoop/work/surjan/pivotData.txt' using PigStorage(',','-noschema') as (ID: chararray, col2: chararray, col3: chararray,col4:chararray);
B = foreach A generate ID,CONCAT(CONCAT(col2,'#'),col3),col4;
C = foreach B generate ID, FLATTEN(TOKENIZE($1,'#')),col4;
dump C;

Thursday, February 23, 2017

org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.RuntimeException: Datum 1490267964939 is not in union ["null","long"]

Problem in Pig when using Store as AvroStorage():

org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.RuntimeException: Datum 1490267964939 is not in union ["null","long"]

Solution: 
1. Check for the datatypes carefully while storing the final schema
2. Most probable reason for this is that there is some null value being type casted to int or long. But in avro we always get the error for next line (which can be correct) 

In our case it was complaining Not about the actual null value but the giving error for a valid value. I think this misleading error when using avro makes it difficult to diagnose.  
for example. 
A = LOAD 'surjan/data1' using org.apache.pig.piggybank.storage.avro.AvroStorage();
B = foreach A generate date, empId;
C = DISTINCT B;
store C into 'surjan/data2' ;

If the dataset  'surjan/data1' is not present , then avro will complain saying no date found or no empId found instead of saying data does not exist or matches 0 files.


3. Using AvroStorage with index option using schema. Index option should be used when storing more than 1 datasets using avro schema

Store finalData  into 'surjan/location' USING org.apache.pig.piggybank.storage.avro.AvroStorage('index', '0','schema','{"namespace":"com.surjan.schema.myapp.avro","type":"record","name":"Mydaily jon","doc":"Avro storing with schema using Pig.","fields" ...rest of schema


org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.RuntimeException: Unsupported type in record:class java.lang.Long at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263) at org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvroRecordWriter.java:49) at org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.java:722) at 

Solutiomn: This is issue with storing single field in avro. Store 1 more dummy field and error will go.

also see this : https://issues.apache.org/jira/browse/PIG-3358

Tuesday, January 3, 2017

Oozie Error: E0701 : E0701: XML schema error

Error: E0701 : E0701: XML schema error

Solution : This is caused if there is some comment section towards beginning of workflow.xml or coordinator.xml. To avoid any such error its best to use the below validation commands for workflow.xml and coordinator.xml

oozie validate workflow.xml
oozie validate coordinator.xml

And for validating any XML use 

xmllint workflow.xml