ottomata

Too many Hive JSON SerDes

| Comments

One of the big advantages of Hive is the ability to query almost any data format. All one has to do is to provide a ’SerDe’ class so that Hive knows how to serialize and deserialize data. Hive ships with a few built in SerDes (Avro, ORC, RegEx, Thrift), but not JSON! I found a few third party JSON SerDes out there, but most were either incomplete or threw exceptions for simple use cases. Hive’s SerDe documentation references Amazon’s JSON SerDe, but I can’t seem to find the source code, and I’d rather not use this in production if I don’t know what it is doing.

I had mostly been using Cloudera’s JSONSerde from their cdh-twitter-example. My table defines a bigint sequence number field. The data in this field is a contiguous increasing sequence number starting at 0. This JSONSerde uses Jackson to parse the JSON into Java objects, and infers the types of the Java objects from the values in the JSON. If a value looks like a an integer, it will be a Java Integer. If a value looks like a Long, it will be a Java Long. The same applies for Java Floats and Doubles as well. Since sequence numbers start small but eventually get very large, this SerDe will end up throwing ClassCastExceptions for my sequence field. Others have had similar issues.

After reading the source of this and the other JSON SerDes, I realized that this could be fixed by casting the Jackson parsed value to whatever the Hive table expects. I then noticed that HCatalog has a JSON SerDe that does exactly this. This code is specific to HCatalog, so it is possible that writing data using this SerDe may result in something I’m not expecting. But, this is the most complete JSON SerDe code I’ve seen so far, and I am able to read my JSON data with no problems using it. Also, an HCatalog jar ships with CDH4, which makes it easier to use in our production systems.

If you are using CDH4, you add the hcatalog-core to your Hive auxpath and create a Hive table that uses this SerDe like this:

1
2
3
4
5
6
7
8
9
10
$ cdh_version=4.3.1
$ hive --auxpath \
/usr/lib/hcatalog/share/hcatalog/hcatalog-core-0.5.0-cdh${cdh_version}.jar

hive (default)>
create table my_table(...)
ROW FORMAT SERDE
  'org.apache.hcatalog.data.JsonSerDe'
...
;

JSON SerDes Overview

See also:

Puppet and CDH4

| Comments

I just pushed the first bits of a complete Puppet module for Cloudera’s CDH4 distribution for Hadoop. Currently it only supports YARN (not MapReduce v1), and assumes that your NameNode also runs your ResourceManager.

puppet-cdh4

There is another CDH4 puppet module out there, but it uses MapReduce v1, and also adds the HDFS data directories as hardcoded facter variables. puppet-cdh4 will allows you to specify Hadoop directories using a config class. As long as your JBOD mounts have been partitioned and are mounted, then you should be able to configure a DataNode like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
include cdh4

class { "cdh4::hadoop::config":
    namenode_hostname => "namenode.hostname.org",
    mounts            => [
        "/var/lib/hadoop/data/a",
        "/var/lib/hadoop/data/b",
        "/var/lib/hadoop/data/c"
    ],
    dfs_name_dir      => ["/var/lib/hadoop/name", "/mnt/hadoop_name"],
}

# Installs and starts the DataNode and NodeManager services.
include cdh4::hadoop::worker

This is a work in progress, so leave me a comment on github if you try to use it and have troubles.

What’s up Scribe?

| Comments

I first used Scribe at CouchSurfing back in 2009. We used it for application logs and eventually for collecting webserver access logs as well. It worked for years with zero hiccups. CouchSurfing’s scale is nothing near Facebook’s, but we weren’t no dinky little website either. Scribe just worked. However, we had to compile it and build our own RPMs. Fine fine, Scribe was only recently open sourced at the time, so that’s fair.

Now it is 2012, and I’m on the the Wikimedia Foundation’s new Analytics team. We’ve been considering using Scribe as a way to get data into the analytics cluster we’re building. By now, you’d think it would be as easy as apt-get install scribe. But, nopers, you’ve still got to build it all yourself. This means choosing the correct Thrift and fb303 versions. I also needed to build Java bindings for Thrift and Scribe, which meant modifying Makefiles. On top of this, I can’t just make install! (Does anyone actually do this in production?) I needed to build .deb packages.

Welp, it’s done! simplegeo had already done the hard work writing the debian/ directory. I forked each of their thrift, fb303 and scribe repositories, modified the debian/ configuration and any required Makefile changes. I wanted this process to be reproducible for folks in the future, so I scripted up the process over at the scribe-debian repository. The .debs I built for Ubuntu Lucid amd64 are available there as well.

The forks I used build Scribe 2.2 against Thrift 0.2.0. Thrift is currently at 0.8.0. I struggled for a while building Scribe against newer Thrift versions, but was never quite able to make it work. Someone else has figured this out, and I wish I had seen this before I had settled on Thrift 0.2.0. I need to go back and try again. But, before I (or you) do, consider the following.

As far as I can tell, Facebook is no longer supporting or maintaining Scribe. From this Quora article:

We actually don’t use “scribe”.

Instead, Facebook has written a new series of tools for streaming log processing on top of HDFS. Watch Sam Rash’s presentation from the 2011 Hadoop Summit on that Quora article. In summary, they have ‘rewritten’ Scribe in Java as Calligraphus. Calligraphus is a pub/sub distributed logging service / message queue, with a persistence layer provided by HDFS. They then use ptail to stream logs out of HDFS. You can think of the whole piece as a distributed Scribe buffered network source. Instead of Scribe writing buffered or file logs to disk on a single node, Calligraphus writes logs to HDFS, and a consumer (e.g. ptail) reads them off at will. Unfortunately, Calligraphus has not yet been open sourced, although they say they plan to make is so soon.

Anyway, the fact that Facebook seems to be phasing out Scribe doesn’t bode well for Scribe’s future. fluentd might be a contender for those who don’t mind (or who want) a distributed logger written in Ruby, and for those where extremely high throughput isn’t a big concern. If you are going for big big data, you might want to check out Flume or Kafka.

See also: