Introduction
This post will explain you the way to clean data and store JSON format to HBase. Hadoop architect experts also explain Apache Pig and its advantages in Hadoop in this post. Read more and find out how they do it.
This post contains steps to do some basic clean the duplication data and convert the data to JSON format to store to HBase. Actually, we have some built-in lib to parse JSON in Pig but it is important to manipulate the JSON data in Java code before store to HBase.
Apache Pig is data flow language and is built on the top of Hadoop, it helps to process, extract, loading, cleansing, analyse the big data with Map Reduce in high level language.
Pig has some advantages in Hadoop:
- Development time is rapidly.
- High level language to work with Map Reduce and Spark, Tez recently.
- This is procedural, not declarative language so it is easily to process the data in pipe line command.
- Provide powerful built-in UDF and support to build UDF with many langugaes such as Java, Python…
- Lazy evaluation while processing the data.
Environment Java:
JDK 1.7
Cloudera version: CDH5.4.7, please refer to this link: http://www.cloudera.com/downloads/cdh/5-4-7.html
Initial steps
- We need to create the HBase table with this command:
- We need to prepare some input data to verify our code:
Open the file with vi tool to create a local file:
visampleData 1;Henrik;Los Angeles 2;Hank;San Jose 3;Alex;San Diego 4;Alex;San Francissco 1;Henrik;Log Angeles 5;Alan;Hanoi
- We need to put the local file to Hadoop Distributed File System (HDFS), use this command:
hadoop fs –put sampleData /data/dv/sample/
Code walk through
This is pig script file to store the JSON data to HBase, these are steps in pig:
- Register the jar file which contains the UDF and define the UDF name
- Load the raw data at /data/dv/sample/ location on HDFS
- Group the rawData by id as key field
- We will clean the duplication record on the key field by limit 1 and foreach in idGroup relation
- To store the data to HBase, we will generate the format data as key, value: Key will be rowkey in HBase table, value will be in JSON format which handles in UDF
- We will store the data at step 5 to sampleJson table with column cf:jsonFormat
Note: Please note that this pig script will compile to Map Reduce Job to store the data to HBase in parallel.
/* Register jar file which contains our UDF */ REGISTER'HelloPigHbase-0.0.1-SNAPSHOT.jar'; /* We create a function as JsonFormatConvertor from our UDF */ DEFINEJsonFormatConvertorcom.blog.pigudf.JsonConverter(); /* We load the data from HDFS location with three fields: id, name and city */ rawData = LOAD'/data/dv/sample/'USINGPigStorage(';') AS(id:chararray, name:chararray, city:chararray); /* To remove the duplication data by key, we need group the data by id */ idGroup = GROUPrawDataBY id; /* For each data by key, we only get one record so all duplication records by key will be ignore */ newData = FOREACHidGroup { A = LIMITrawData 1; GENERATEFLATTEN(A.id) AS id, FLATTEN(A.name) AS name, FLATTEN(A.city) AS city; }; /* We generate the format before storing the data HBase as key, value Key is id and value is data which is passed to our UDF to convert to JSON String */ finalId = FOREACHnewDataGENERATE id, JsonFormatConvertor(name, city); /* Store the data to Hbas table */ STOREfinalIdINTO'hbase://sampleJson'USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:jsonFormat');
This is entity of JSON format for our data
publicclassAccountEntity { publicAccountEntity(String name, String city) { super(); this.name = name; this.city = city; } public String getName() { returnname; } publicvoidsetName(String name) { this.name = name; } public String getCity() { returncity; } publicvoidsetCity(String city) { this.city = city; } private String name; private String city; }
This is UDF Java code to parse text data to JSON data:
- We will read the input data from Pig as a Tuple.
- We will use get the data from Tuple input for two fields: name and city. For more manipulate data steps, we will handle easily at this step.
- We use Object Mapper to write the String to JSON format and return the result.
publicclassJsonConverterextendsEvalFunc { privateObjectMappermapper = newObjectMapper(); public String exec(Tuple input) throwsIOException { if (input == null || input.size() == 0) returnnull; try { String name = input.get(0) == null ? "" :input.get(0).toString().trim(); String city = input.get(1) == null ? "" :input.get(1).toString().trim(); AccountEntity account = newAccountEntity(name, city); returnmapper.writeValueAsString(account); } catch (Exception e) { thrownewRuntimeException( "Unable to compute FilterForIntegerBags schema."); } } }
Build the maven project to get the jar file of UDF:
Verify the output data
- The log file when the Map Reduce job finished
- Scan HBase table to see the output data in JSON format:
- The structure of project should be like this:
We hope this program will help you to understand about how to cleansing the data and store the JSON format data to HBase.
Comments for this post are open. You can tell if you did not get any term or subject in this post. This post is solely intended by hadoop architect & experts to make you learn about the steps to clean duplicate data and store JSON format to HBase.
Excellent info
Very helpful