May 25, 2012
Posted by on
This weekend I would be presenting @ BDOTNET UG meet on topic – “Big Talk: Hadoop on Azure”. In case you are around and plan to attend here’s the facebook event link. In this talk I would help you get started with Hadoop and show how you can leverage it with Windows Azure. With that, let’s focus on the subject of this blog post.
Hadoop Streaming allows you to write and run MapReduce jobs in language of your choice. For Azure and Microsoft world this would be mostly C#. You can create your programs / executable in C#, read input from Console and write output to Console. Mapper task would feed input lines to your executable via console (standard input) and also collect output via console (standard output). It converts output into key value pairs. Reducer task on other hand converts key value pairs into input lines, feeds it to your executable (via console), and collects the output (via console) converting it back to key value pairs. For scenarios where you need only mapper you can emit reducer and set ‘numReduceTasks’ to zero as shown below:
call hadoop.cmd jar hadoop-streaming.jar -files "hdfs://10.186.36.85:9000/example/apps/Mapper.exe" -mapper "Mapper.exe" -input "asv://account/inputdata/account.data" -output "/example/data/StreamingOutput/mywc" -numReduceTasks=0
Now at times, your mapper program would need additional parameters to carry out its operations e.g. say you want mapper to filter data on few attributes. So, how do can we pass these attributes to Mapper executable? Simple – pass them as command line parameters and in your program read them from args.
call hadoop.cmd jar hadoop-streaming.jar -files "hdfs://10.186.36.85:9000/example/apps/Mapper.exe" -mapper "Mapper.exe param1 param2 param3" -input "asv://account/inputdata/ account.data" -output "/example/data/StreamingOutput/mywc" -numReduceTasks=0
static void Main(string args)
string parameterOne = args;
string parameterTwo = args;
string parameterThree = args;
Hope this helps!