Query escaped JSON string in Hive

There are times when we want to parse a string that is actually a JSON. Usually that could be done with built in functions of Hive such as get_json_object(). Though get_json_object cannot parse JSON Array from my experience. These array needs to be exploded first using explode() and then use get_json_object on each element (JSON) of the exploded Array.

The one option that was left to me was to convert that JSON object Array to Hive Array.

  • We will be using Hive JSON Serde from – https://github.com/rcongiu/Hive-JSON-Serde
  • You can get help setting it up from here.

Below is a sample JSON for our exercise.

JSON Sample, having Message_json attribute containing a valid JSON Array, and Message_string attribute containing similar array in string notation. We are talking about parsing such values that are “JSON array”, “escaped” and present in “string format”.

{
  "Message_json": [
    {
      "Key": "GetService1", 
      "ResponseTime": "12"
    }, 
    {
      "Key": "ServiceOperation", 
      "ResponseTime": "2"
    }, 
    {
      "Key": "Service", 
      "ResponseTime": "16"
    }
  ], 
  "Message_string": "[{\"Key\":\"GetService2\",\"ResponseTime\":\"12\"},{\"Key\":\"ServiceOperation\",\"ResponseTime\":\"2\"},{\"Key\":\"Service\",\"ResponseTime\":\"16\"}]"
}

 

To prep your hive environment, use the following commands –

Create sample JSON file on local file system that will be loaded in a Hive managed table.

[]$ echo '{"Message_json":[{"ResponseTime":"12","Key":"GetService1"},{"ResponseTime":"2","Key":"ServiceOperation"},{"ResponseTime":"16","Key":"Service"}],"Message_string":"[{\"Key\":\"GetService2\",\"ResponseTime\":\"12\"},{\"Key\":\"ServiceOperation\",\"ResponseTime\":\"2\"},{\"Key\":\"Service\",\"ResponseTime\":\"16\"}]"}' > /tmp/s.json

In hive/beeline shell create the sample table

create table sample (
Message_json array<struct<key:string, ResponseTime:string>>,
Message_string string
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

Load JSON data in sample table

LOAD DATA local INPATH '/tmp/s.json' OVERWRITE INTO TABLE sample;

 

Lets check with a basic select if everything is in order

0: jdbc:hive2://www.robin.eu.org:10000> select * from sample;
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+--+
| sample.message_json | sample.message_string |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+--+
| [{"key":"GetService1","responsetime":"12"},{"key":"ServiceOperation","responsetime":"2"},{"key":"Service","responsetime":"16"}] | [{"Key":"GetService2","ResponseTime":"12"},{"Key":"ServiceOperation","ResponseTime":"2"},{"Key":"Service","ResponseTime":"16"}] |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+--+
1 row selected (0.147 seconds)

 

Following is how we would process a JSON array in normal circumstances. Message_json in our case –

0: jdbc:hive2://www.robin.eu.org:10000>
0: jdbc:hive2://www.robin.eu.org:10000> select val_json.Key, val_json.ResponseTime from sample
0: jdbc:hive2://www.robin.eu.org:10000> lateral view explode(Message_json) tbl_json as val_json;
+-------------------+---------------+--+
|        key        | responsetime  |
+-------------------+---------------+--+
| GetService1       | 12            |
| ServiceOperation  | 2             |
| Service           | 16            |
+-------------------+---------------+--+
3 rows selected (0.151 seconds)

Lets try that with the other attribute, Message_string –

0: jdbc:hive2://www.robin.eu.org:10000> select val_json.Key, val_json.ResponseTime from sample
0: jdbc:hive2://www.robin.eu.org:10000> lateral view explode(Message_string) tbl_json as val_json;
Error: Error while compiling statement: FAILED: UDFArgumentException explode() takes an array or a map as a parameter (state=42000,code=40000)

We get the error as we expected.

So lets take a look at the solution, that involves converting JSON Array represented in string to real Hive array.

0: jdbc:hive2://www.robin.eu.org:10000> select get_json_object(val_json,'$.Key') as key, get_json_object(val_json,'$.ResponseTime') as ResponseTime from sample
0: jdbc:hive2://www.robin.eu.org:10000> lateral view explode(split(regexp_replace(regexp_replace(Message_string,'\\}\\,\\{','\\}\\#\\{'),'\\[|\\]',''),'\\#')) tbl_json as val_json;
+-------------------+---------------+--+
|        key        | responsetime  |
+-------------------+---------------+--+
| GetService2       | 12            |
| ServiceOperation  | 2             |
| Service           | 16            |
+-------------------+---------------+--+
3 rows selected (0.156 seconds)

 

All commands are below for easy executing –

-- sample JSON: {"Message_json":[{"ResponseTime":"12","Key":"GetService1"},{"ResponseTime":"2","Key":"ServiceOperation"},{"ResponseTime":"16","Key":"Service"}],"Message_string":"[{\"Key\":\"GetService2\",\"ResponseTime\":\"12\"},{\"Key\":\"ServiceOperation\",\"ResponseTime\":\"2\"},{\"Key\":\"Service\",\"ResponseTime\":\"16\"}]"}
-- prep sample: echo '{"Message_json":[{"ResponseTime":"12","Key":"GetService1"},{"ResponseTime":"2","Key":"ServiceOperation"},{"ResponseTime":"16","Key":"Service"}],"Message_string":"[{\"Key\":\"GetService2\",\"ResponseTime\":\"12\"},{\"Key\":\"ServiceOperation\",\"ResponseTime\":\"2\"},{\"Key\":\"Service\",\"ResponseTime\":\"16\"}]"}' > /tmp/s.json

drop table sample;
create table sample (
Message_json array<struct<key:string, ResponseTime:string>>,
Message_string string
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
LOAD DATA local INPATH '/tmp/s.json' OVERWRITE INTO TABLE  sample;


--OKAY: Message_json
select val_json.Key, val_json.ResponseTime
from sample
lateral view explode(Message_json) tbl_json as val_json;

--NOT OKAY: Message_string - with previous approach
select val_json.Key, val_json.ResponseTime
from sample
lateral view explode(Message_string) tbl_json as val_json;

--SOLUTION: Message_string - conversion of string to array
select get_json_object(val_json,'$.Key') as key, get_json_object(val_json,'$.ResponseTime') as ResponseTime
from sample
lateral view explode(split(regexp_replace(regexp_replace(Message_string,'\\}\\,\\{','\\}\\#\\{'),'\\[|\\]',''),'\\#')) tbl_json as val_json;

 

Additional set of replacements patterns for unescaping escaped JSON

\"             ->    "
:([^"{}[]+),   ->    :"\1",
="([^"{}]+)"   ->    =\\"\1\\"

Line 2, Adds missing ” (quotes) to values

Line 3, takes care of href URLs in data

 

 

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *