Friday, August 4, 2017

Pivot columns to rows in Apache Pig

Pivot columns to rows in Pig (Apache pig)
==============================

Transposing columns to rows in pig

Data input
id1, col1,col2,col3
id2,col1,col2,col3

Expected o/p
id1 col1 col3
id1 col2 col3
id2 col1 col3
id2 col2 col3


A = LOAD '/home/hadoop/work/surjan/pivotData.txt' using PigStorage(',','-noschema') as (ID: chararray, col2: chararray, col3: chararray,col4:chararray);
B = foreach A generate ID,CONCAT(CONCAT(col2,'#'),col3),col4;
C = foreach B generate ID, FLATTEN(TOKENIZE($1,'#')),col4;
dump C;

Thursday, February 23, 2017

org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.RuntimeException: Datum 1490267964939 is not in union ["null","long"]

Problem in Pig when using Store as AvroStorage():

org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.RuntimeException: Datum 1490267964939 is not in union ["null","long"]

Solution: 
1. Check for the datatypes carefully while storing the final schema
2. Most probable reason for this is that there is some null value being type casted to int or long. But in avro we always get the error for next line (which can be correct) 

In our case it was complaining Not about the actual null value but the giving error for a valid value. I think this misleading error when using avro makes it difficult to diagnose.  
for example. 
A = LOAD 'surjan/data1' using org.apache.pig.piggybank.storage.avro.AvroStorage();
B = foreach A generate date, empId;
C = DISTINCT B;
store C into 'surjan/data2' ;

If the dataset  'surjan/data1' is not present , then avro will complain saying no date found or no empId found instead of saying data does not exist or matches 0 files.


3. Using AvroStorage with index option using schema. Index option should be used when storing more than 1 datasets using avro schema

Store finalData  into 'surjan/location' USING org.apache.pig.piggybank.storage.avro.AvroStorage('index', '0','schema','{"namespace":"com.surjan.schema.myapp.avro","type":"record","name":"Mydaily jon","doc":"Avro storing with schema using Pig.","fields" ...rest of schema


org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.RuntimeException: Unsupported type in record:class java.lang.Long at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263) at org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvroRecordWriter.java:49) at org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.java:722) at 

Solutiomn: This is issue with storing single field in avro. Store 1 more dummy field and error will go.

also see this : https://issues.apache.org/jira/browse/PIG-3358

Tuesday, January 3, 2017

Oozie Error: E0701 : E0701: XML schema error

Error: E0701 : E0701: XML schema error

Solution : This is caused if there is some comment section towards beginning of workflow.xml or coordinator.xml. To avoid any such error its best to use the below validation commands for workflow.xml and coordinator.xml

oozie validate workflow.xml
oozie validate coordinator.xml

And for validating any XML use 

xmllint workflow.xml

Friday, November 4, 2016

Pig UDF Error input.get(0) : The type org.apache.hadoop.io.WritableComparable cannot be resolved. It is indirectly referenced from required .class files

Error: The type org.apache.hadoop.io.WritableComparable cannot be resolved. It is indirectly referenced from required .class files. Getting error in Pig UDF when trying to get the value of a field from Tuple.


public String exec(Tuple input) throws IOException {
if(null == input || input.size()==0)return null;try{epochTime = Long.parseLong((String)input.get(0));//this line gives compilation error in eclipse}catch(Exception ex ){throw new IOException("Caught exception processing input "+input, ex);}

Solution: For Maven project, the below dependency needs to be there for the above compilation error.

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.0.0-cdh4.7.1</version>
    <!-- <scope>provided</scope> -->
</dependency>

<dependency>
    <groupId>org.apache.pig</groupId>
    <artifactId>pig</artifactId>
    <version>0.11.0-cdh4.7.1</version>
 <!-- <scope>provided</scope> -->
</dependency> 

Some times the same error message is seen for other missing jars like commons-logging or any other jar (hadoop related). We can find the jar name for the missing class and get the maven repo entry for that jar.

Tuesday, June 21, 2016

Scala Beginners Issues


1. Compiling / Running  scala from command line just like javac /java

scalac com/PrintMesgObj.scala 
scala -classpath . com.PrintMesgObj

2. Common Errors:

package com

class PrintMsg {

  def main(args: Array[String]) = {
    println("Hello World !!")
  }
}

if you try to run the above code and expect Hello World to be printed, it won't. See the output
surjanrawat$ scalac com/PrintMsg.scala 
surjanrawat$ scala -classpath . com.PrintMsg
java.lang.NoSuchMethodException: com.PrintMsg.main is not static
at scala.reflect.internal.util.ScalaClassLoader$class.run(ScalaClassLoader.scala:68)

Reason: PrintMsg is class and to use this we need to define object of this class.

1.  Define object for PrintMsg
package com

object PrintMesgObj {
   val obj = new PrintMsg
   def main(args:Array[String])= {
     obj.main(args)
   }
}

surjanrawat$ scala -classpath . com.PrintMesgObj surjan
Hello World !!

Common Errors
1. java.lang.NoSuchMethodException: com.PrintMsg.main is not static
at scala.reflect.internal.util.ScalaClassLoader$class.run(ScalaClassLoader.scala:68)

2. scala -classpath . com.PrintMesgObj
java.lang.NoSuchMethodException: com.PrintMesgObj.main([Ljava.lang.String;)

Tuesday, April 26, 2016

Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java

Problem with Custom value / key class when implementing Writable interface. Error Caused by: java.io.EOFException at java.io.DataInputStream.readInt

Solution: If having any java primitive types in Custom writable class, then while reading or writing
we should use the overloaded method for that type.
for e.g
- writeChars(String s) for String
 - writeInt(int i) for int
- writeLong(long l) for long
- writeBoolean(boolean b) for boolean

Stack-Trace:
java.lang.Exception: java.io.EOFException
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:400)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at com.apple.Comments.MinMaxCountTuple.readFields(MinMaxCountTuple.java:48)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:145)

at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)


Custom Class: I am getting this exception with below custom Class

public class MinMaxCountTuple implements Writable {
private int min;
private int max;
private int count;

public void write(DataOutput out) throws IOException {
out.write(this.min);
out.write(this.max);
out.write(this.count);
}
public void readFields(DataInput in) throws IOException {
this.min = in.readInt();
this.max = in.readInt();
this.count = in.readInt();
}

// Changed the write method to below and it fixed.
public void write(DataOutput out) throws IOException {
out.writeInt(this.min);
out.writeInt(this.max);
out.writeInt(this.count);
}

Monday, April 11, 2016

Splitting on dot operator in Pig

For Input
101, iOS8.4

102, POS6.7

Expected Output is 
101, iOS8
102, POS6


A = LOAD '/home/hadoop/work/surjan/token/Test.txt' USING PigStorage(',') AS(id:long,a1:chararray);
B = FOREACH A GENERATE $0, FLATTEN(STRSPLIT(a1,'\\u002E')) as (a1:chararray, a1of1:chararray);
C = FOREACH B GENERATE $0, a1;

Wednesday, March 16, 2016

Secondary Sort example in Pig

Problem: Get the month wise temparature in descending order

Input Data: SecodSort.txt
2012, 01, 01, 5
2012, 01, 02, 45
2012, 01, 03, 35
2012, 01, 04, 10
2001, 11, 01, 46
2001, 11, 02, 47
2001, 11, 03, 48
2001, 11, 04, 40
2005, 08, 20, 50
2005, 08, 21, 52
2005, 08, 22, 38
2005, 08, 23, 70


//Secondary sort
A = LOAD '/Users/surjanrawat/Documents/SecodSort.txt' using PigStorage(',') as (year:long,month:long,date:long,temp:long);
B = foreach A generate year,month,temp;
C = group B by (year, month);
D = foreach C {
X = ORDER B by temp desc;
Y  =  foreach X generate $2;
generate flatten(group),BagToString(Y,',');
};
Dump D;


Output
--------
(2001,11,48,47,46,40)
(2005,8,70,52,50,38)

(2012,1,45,35,10,5)

Tuesday, March 15, 2016

How is Pig job translated / converted to MapReduce Step by Step process

How is Pig jobs translated /converted to MapReduce Step by Step process

Refer to the below link for details about how is Pig Script translated/Converted to MapReduce.

Excerpt from the link.
The Pig system takes a Pig Latin program as input, compiles it into one or more Map-Reduce jobs, and then executes those jobs on a given Hadoop cluster. 


Any Pig script/program whether its running in local mode or MapReduce Mode  goes through a series of transformation steps before being executed.

Steps:




Wednesday, February 24, 2016

java.lang.RuntimeException: readObject can't find class

INFO mapred.JobClient: Task Id : attempt_201512031955_66234_m_00025_0, Status : FAILED
java.lang.RuntimeException: readObject can't find class
at org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit.readClass(TaggedInputSplit.java:135)
at org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit.readFields(TaggedInputSplit.java:121)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:356)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:640)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)


Possible Causes: 
1. Check if below line is present in the driver class 
      job.setJarByClass(MyDriver.class);  // This method sets the jar file in which each node will look for the Mapper and Reducer classes. if this is not present then you will see lots of FAILED Tasks in the job tracker.

2. Any other classes being set like combiner, partitioner. Check if they are properly set


   job.setPartitionerClass(CustomPartitioner.class);
   job.setCombinerClass(MyCombiner.class);