ottomata

« Puppet and CDH4 Kafka at Wikimedia »

Too many Hive JSON SerDes

| Comments

One of the big advantages of Hive is the ability to query almost any data format. All one has to do is to provide a ’SerDe’ class so that Hive knows how to serialize and deserialize data. Hive ships with a few built in SerDes (Avro, ORC, RegEx, Thrift), but not JSON! I found a few third party JSON SerDes out there, but most were either incomplete or threw exceptions for simple use cases. Hive’s SerDe documentation references Amazon’s JSON SerDe, but I can’t seem to find the source code, and I’d rather not use this in production if I don’t know what it is doing.

I had mostly been using Cloudera’s JSONSerde from their cdh-twitter-example. My table defines a bigint sequence number field. The data in this field is a contiguous increasing sequence number starting at 0. This JSONSerde uses Jackson to parse the JSON into Java objects, and infers the types of the Java objects from the values in the JSON. If a value looks like a an integer, it will be a Java Integer. If a value looks like a Long, it will be a Java Long. The same applies for Java Floats and Doubles as well. Since sequence numbers start small but eventually get very large, this SerDe will end up throwing ClassCastExceptions for my sequence field. Others have had similar issues.

After reading the source of this and the other JSON SerDes, I realized that this could be fixed by casting the Jackson parsed value to whatever the Hive table expects. I then noticed that HCatalog has a JSON SerDe that does exactly this. This code is specific to HCatalog, so it is possible that writing data using this SerDe may result in something I’m not expecting. But, this is the most complete JSON SerDe code I’ve seen so far, and I am able to read my JSON data with no problems using it. Also, an HCatalog jar ships with CDH4, which makes it easier to use in our production systems.

If you are using CDH4, you add the hcatalog-core to your Hive auxpath and create a Hive table that uses this SerDe like this:

1
2
3
4
5
6
7
8
9
10
$ cdh_version=4.3.1
$ hive --auxpath \
/usr/lib/hcatalog/share/hcatalog/hcatalog-core-0.5.0-cdh${cdh_version}.jar

hive (default)>
create table my_table(...)
ROW FORMAT SERDE
  'org.apache.hcatalog.data.JsonSerDe'
...
;

JSON SerDes Overview

See also: