Published on

Writing UDF To Parse JSON In Hive

Authors

Sometimes we need to perform data transformation in ways too complicated for SQL (even with the Custom UDF’s provided by hive). Let’s take JSON manipulation as an example.

JSON is widely used to store and transfer data. Hive comes with a built-in json_tuple() function that can extract values for multiple keys at once. But if you have a nested JSON the query using json_tuple() can get messy very quickly. Say we have the following JSON:

{
  "customer": {
    "given_name": "Anuvrat",
    "surname": "Singh"
  },
  "order": {
    "id": "123dfe523gd"
  }
}

If we were to use the json_tuple() function of Hive we would write something like:

create table json_table (
  record string
);

select
  v2.given_name,
  v2.surname,
  v3.id
from json_table jt
  lateral view json_tuple(jt.record, 'customer', 'order') v1 as customer, order
  lateral view json_tuple(v1.customer, 'given_name', 'surname') v2 as given_name, surname
  lateral view json_tuple(v1.order, 'id') v3 as id;

Which, you have got to agree, looks really ugly!

Instead, we could quickly write a script (python being my fav.) that does the transformation for us. The input to the script is a single record of json from the table, and the output of the script should be tab separated values. The values can be operated upon or inserted into another table using hive.

Here’s the script. The script calls transform_json() method for each line, which extracts the values we are interested in and prints them with tab as the separation character.

#!/usr/bin/python
import sys
import json

def get_order_details(order_record):
  json_record = json.loads(order_record)

  customer = json_record['customer']
  order = json_record['order']

  values = [customer['given_name'], customer['surname'], order['id']]

  print '\t'.join(map(str, values))

for line in sys.stdin:
    line = line.strip()
    get_order_details(line)

In the hive script we first add the file and then call it using the transform() command. Here’s a neat looking script that does what our previous hive script did using json_tuple().

add file 'get_order_details_mapper.py';

select transform ( record )
using 'python get_order_details_mapper.py'
as given_name, surname, order_id
from json_table;

That’s it! Neat and simple!

In my particular case the keys were variables and I just could not have used json_tuple() to extract info from json.