Query escaped JSON string in Hive
There are times when we want to parse a string that is actually a JSON. Usually that could be done with built in functions of Hive such as get_json_object(). Though get_json_object cannot parse JSON Array from my experience. These array needs to be exploded first using explode() and then use get_json_object on each element (JSON) of the exploded Array.
The one option that was left to me was to convert that JSON object Array to Hive Array.
- We will be using Hive JSON Serde from – https://github.com/rcongiu/Hive-JSON-Serde
- You can get help setting it up from here.
Below is a sample JSON for our exercise.
JSON Sample, having Message_json attribute containing a valid JSON Array, and Message_string attribute containing similar array in string notation. We are talking about parsing such values that are “JSON array”, “escaped” and present in “string format”.
{ "Message_json": [ { "Key": "GetService1", "ResponseTime": "12" }, { "Key": "ServiceOperation", "ResponseTime": "2" }, { "Key": "Service", "ResponseTime": "16" } ], "Message_string": "[{\"Key\":\"GetService2\",\"ResponseTime\":\"12\"},{\"Key\":\"ServiceOperation\",\"ResponseTime\":\"2\"},{\"Key\":\"Service\",\"ResponseTime\":\"16\"}]" }
To prep your hive environment, use the following commands –
Create sample JSON file on local file system that will be loaded in a Hive managed table.
[]$ echo '{"Message_json":[{"ResponseTime":"12","Key":"GetService1"},{"ResponseTime":"2","Key":"ServiceOperation"},{"ResponseTime":"16","Key":"Service"}],"Message_string":"[{\"Key\":\"GetService2\",\"ResponseTime\":\"12\"},{\"Key\":\"ServiceOperation\",\"ResponseTime\":\"2\"},{\"Key\":\"Service\",\"ResponseTime\":\"16\"}]"}' > /tmp/s.json
In hive/beeline shell create the sample table
create table sample ( Message_json array<struct<key:string, ResponseTime:string>>, Message_string string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
Load JSON data in sample table
LOAD DATA local INPATH '/tmp/s.json' OVERWRITE INTO TABLE sample;
Lets check with a basic select if everything is in order
0: jdbc:hive2://www.robin.eu.org:10000> select * from sample; +----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+--+ | sample.message_json | sample.message_string | +----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+--+ | [{"key":"GetService1","responsetime":"12"},{"key":"ServiceOperation","responsetime":"2"},{"key":"Service","responsetime":"16"}] | [{"Key":"GetService2","ResponseTime":"12"},{"Key":"ServiceOperation","ResponseTime":"2"},{"Key":"Service","ResponseTime":"16"}] | +----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+--+ 1 row selected (0.147 seconds)
Following is how we would process a JSON array in normal circumstances. Message_json in our case –
0: jdbc:hive2://www.robin.eu.org:10000> 0: jdbc:hive2://www.robin.eu.org:10000> select val_json.Key, val_json.ResponseTime from sample 0: jdbc:hive2://www.robin.eu.org:10000> lateral view explode(Message_json) tbl_json as val_json; +-------------------+---------------+--+ | key | responsetime | +-------------------+---------------+--+ | GetService1 | 12 | | ServiceOperation | 2 | | Service | 16 | +-------------------+---------------+--+ 3 rows selected (0.151 seconds)
Lets try that with the other attribute, Message_string –
0: jdbc:hive2://www.robin.eu.org:10000> select val_json.Key, val_json.ResponseTime from sample 0: jdbc:hive2://www.robin.eu.org:10000> lateral view explode(Message_string) tbl_json as val_json; Error: Error while compiling statement: FAILED: UDFArgumentException explode() takes an array or a map as a parameter (state=42000,code=40000)
We get the error as we expected.
So lets take a look at the solution, that involves converting JSON Array represented in string to real Hive array.
0: jdbc:hive2://www.robin.eu.org:10000> select get_json_object(val_json,'$.Key') as key, get_json_object(val_json,'$.ResponseTime') as ResponseTime from sample 0: jdbc:hive2://www.robin.eu.org:10000> lateral view explode(split(regexp_replace(regexp_replace(Message_string,'\\}\\,\\{','\\}\\#\\{'),'\\[|\\]',''),'\\#')) tbl_json as val_json; +-------------------+---------------+--+ | key | responsetime | +-------------------+---------------+--+ | GetService2 | 12 | | ServiceOperation | 2 | | Service | 16 | +-------------------+---------------+--+ 3 rows selected (0.156 seconds)
All commands are below for easy executing –
-- sample JSON: {"Message_json":[{"ResponseTime":"12","Key":"GetService1"},{"ResponseTime":"2","Key":"ServiceOperation"},{"ResponseTime":"16","Key":"Service"}],"Message_string":"[{\"Key\":\"GetService2\",\"ResponseTime\":\"12\"},{\"Key\":\"ServiceOperation\",\"ResponseTime\":\"2\"},{\"Key\":\"Service\",\"ResponseTime\":\"16\"}]"} -- prep sample: echo '{"Message_json":[{"ResponseTime":"12","Key":"GetService1"},{"ResponseTime":"2","Key":"ServiceOperation"},{"ResponseTime":"16","Key":"Service"}],"Message_string":"[{\"Key\":\"GetService2\",\"ResponseTime\":\"12\"},{\"Key\":\"ServiceOperation\",\"ResponseTime\":\"2\"},{\"Key\":\"Service\",\"ResponseTime\":\"16\"}]"}' > /tmp/s.json drop table sample; create table sample ( Message_json array<struct<key:string, ResponseTime:string>>, Message_string string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'; LOAD DATA local INPATH '/tmp/s.json' OVERWRITE INTO TABLE sample; --OKAY: Message_json select val_json.Key, val_json.ResponseTime from sample lateral view explode(Message_json) tbl_json as val_json; --NOT OKAY: Message_string - with previous approach select val_json.Key, val_json.ResponseTime from sample lateral view explode(Message_string) tbl_json as val_json; --SOLUTION: Message_string - conversion of string to array select get_json_object(val_json,'$.Key') as key, get_json_object(val_json,'$.ResponseTime') as ResponseTime from sample lateral view explode(split(regexp_replace(regexp_replace(Message_string,'\\}\\,\\{','\\}\\#\\{'),'\\[|\\]',''),'\\#')) tbl_json as val_json;
Additional set of replacements patterns for unescaping escaped JSON
\" -> " :([^"{}[]+), -> :"\1", ="([^"{}]+)" -> =\\"\1\\"
Line 2, Adds missing ” (quotes) to values
Line 3, takes care of href URLs in data