Niraj Bhatt – Architect's Blog

Ruminations on .NET, Architecture & Design

Passing Parameters to Hadoop Streaming

This weekend I would be presenting @ BDOTNET UG meet on topic – “Big Talk: Hadoop on Azure”. In case you are around and plan to attend here’s the facebook event link. In this talk I would help you get started with Hadoop and show how you can leverage it with Windows Azure. With that, let’s focus on the subject of this blog post.

Hadoop Streaming allows you to write and run MapReduce jobs in language of your choice. For Azure and Microsoft world this would be mostly C#. You can create your programs / executable in C#, read input from Console and write output to Console. Mapper task would feed input lines to your executable via console (standard input) and also collect output via console (standard output). It converts output into key value pairs. Reducer task on other hand converts key value pairs into input lines, feeds it to your executable (via console), and collects the output (via console) converting it back to key value pairs. For scenarios where you need only mapper you can emit reducer and set ‘numReduceTasks’ to zero as shown below:

call hadoop.cmd jar hadoop-streaming.jar -files "hdfs://" -mapper "Mapper.exe" -input "asv://account/inputdata/" -output "/example/data/StreamingOutput/mywc" -numReduceTasks=0

IP address in above case is that of Namenode (you can get by executing following command from Javascript console – #cat file:///apps/dist/conf/core-site.xml).

Now at times, your mapper program would need additional parameters to carry out its operations e.g. say you want mapper to filter data on few attributes. So, how do can we pass these attributes to Mapper executable? Simple – pass them as command line parameters and in your program read them from args.

call hadoop.cmd jar hadoop-streaming.jar -files "hdfs://" -mapper "Mapper.exe param1 param2 param3" -input "asv://account/inputdata/" -output "/example/data/StreamingOutput/mywc" -numReduceTasks=0

static void Main(string[] args)
string line;

string parameterOne = args[0];
string parameterTwo = args[1];
string parameterThree = args[2];

Hope this helps!


One response to “Passing Parameters to Hadoop Streaming

  1. Pingback: Link Resource # 56 : May 24 – May 30 « Dactylonomy of Web Resource

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: