Big Data, C#, Hadoop

Getting started with Big Data in .NET. How to write a MapReduce job in C#

In this blog post we are going to discuss what Big Data is and why is it important. Once we understand that we will see tools that we would need to develop Big Data applications on a Windows machine. Will end this part 1 with writing a small MapReduce job in C# to understand the entire flow.

What is Big Data?

In simpler terms Big Data refers to data sets which are so large that it is challenging to process it using traditional data processing tools.

In certain scenarios, even if the velocity of data is large, the problem is referred as a Big Data problem. Usually data which is not relational i.e. no fixed schema which requires deep analytics is also a suitable candidate for leveraging Big Data technology.

How to solve the Big Data Problem?

A community proposed architecture was presented to counter some of the challenges faced during processing Big Data.

The architecture is known as the Lambda Architecture and you can read about it at http://lambda-architecture.net/

In a nutshell, the Lambda Architecture has the following features:

  • Fault-tolerant both against hardware failures and human mistakes
  • Allows low-latency reads and writes
  • Resulting system should be able to scale out rather than scale up.

Hadoop is one implementation of such a platform which allows massive scalability possibilities which cannot be accomplished using a traditional relational database platform.

Core components

  • MapReduce
  • HDFS

MapReduce

The MapReduce component handles the job execution in Hadoop. It is responsible for processing large datasets using a distributed, parallel algorithm.

  • The Map method is responsible for mapping a given key/value pair to 0 or more key/value pairs. In the example which we will focus on in a while, the map function would break the line into words and output a key/value pair for each word. Each output pair would contain the word “Prime” or “Composite ” as the key and the number of instances of the primes and composites as the value
  • The Reduce method is responsible for iterating through the values that are associated with that key and produce zero or more outputs.

HDFS

Acronym for Hadoop Distributed File System is the storage layer. On windows wherein we will be using HDInsight, the Windows Azure Blob Storage is the implementation of the HDFS interface.

Windows Azure HDInsight

As I mentioned before, this tutorial is about how can we write applications targeting Big Data problems on a Windows machine.

Now HDInsight makes the Hadoop framework available on Windows Azure platform as a service.

Note that in this demo we would be using the HDInsight Emulator, which is a single node Hadoop cluster you can run locally, in the coming posts we will also

Installation

You would need the following tools to get started:

  • Visual Studio 2010,2012 or 2013
  • HDInsight Emulator, which is a single node Hadoop cluster you can run locally for testing/ getting started.

Prerequisites:

  • The Emulator requires a 64 bit Windows machine
  • Works on Windows 8/ Windows 7 with SP1

Once you have everything installed, you should see 3 new icons on your desktop

Hadoop Installation

Now you need to go to the directory where Hadoop got installed and browse to the following location:

C:\Hadoop\hadoop-1.1.0-SNAPSHOT\conf

 

Open hadoopenv.cmd and set the JAVA_HOME environment variable as following

set JAVA_HOME=C:\java\jdk1.7.0_67

 

Open hadoopenv.sh and export the JAVA_HOME environment variable as following

export JAVA_HOME=C:\\java\\jdk1.7.0_67

NOTE: You must have jdk installed in your machine not just jre. Also another issue that I faced was when you install jdk by default it gets installed at C:\Program Files\Java but because HDFS doesn’t support spaces in paths hence you need to install jdk in a path like C:\java\ etc.

 

Now that we are ready, run the Hadoop Command Prompt and make sure it isn’t showing any errors.

Let’s start coding!

Now that we have Hadoop installed, we can start writing our MapReduce program.

The job of our MapReduce program would be to detect if a number is prime or not and give back the count of Primes and Composite numbers in a given list of numbers.

For input I am using a text file which has numbers ranging from 1-10000.

We will start with a Console Application and in order to work with Hadoop in .NET we would need the .NET sdk for Hadoop.

To install Microsoft .NET Map Reduce API for Hadoop, run the following command in the Package Manager Console

PM> Install-Package Microsoft.Hadoop.MapReduce

Let us first see the Map functionality:

using Microsoft.Hadoop.MapReduce;
 
namespace BigData
{
    public class PrimeMapper : MapperBase
    {
        public override void Map(string inputLineMapperContext context)
        {
            int value = int.Parse(inputLine);
            string key = IsPrime(value) ? "Prime" : "Composite";
 
            //output key assignment with value
            context.EmitKeyValue(key, value.ToString());
        }
 
        private bool IsPrime(int input)
        {
            if (input == 1) return false;
            if (input == 2) return true;
            for (int i = 2; i <= input/2; i++)
            {
                if (input%i == 0)
                {
                    return false;
                }
            }
            return true;
        }
    }
} 

Reduce functionality:

using System.Collections.Generic;
using System.Linq;
using Microsoft.Hadoop.MapReduce;
 
namespace BigData
{
    public class PrimeReducer : ReducerCombinerBase
    {
        public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
 
        {
            context.EmitKeyValue(key, values.Count().ToString());
        }
    }
} 

Main method:

using System;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Hadoop;
using Microsoft.Hadoop.MapReduce;
using Microsoft.Hadoop.WebClient.WebHCatClient;
 
namespace BigData
{
    internal class Program
    {
        private static void Main(string[] args)
        {
            const string userName = "hadoop";
 
            var config = new HadoopJobConfiguration {InputPath = "/prime/in", OutputFolder = "/prime/out"};
 
            var uri = new Uri("http://localhost");
 
            IHadoop cluster = Hadoop.Connect(uri, userName, null);
 
            MapReduceResult result = cluster.MapReduceJob.Execute<PrimeMapperPrimeReducer>(config);
 
            int exitCode = result.Info.ExitCode;
 
            Console.WriteLine();
 
            Console.Write("Exit Code: 0==Success = " + exitCode);
 
            Console.Read();
        }
    }
}

Now let us run the job. From Visual Studio, click the Start button and a console should appear. Once the job has completed, the Console window should look like the following:

We got Exit code 0, Success. Let us now look at the output. Launch the Hadoop Command Prompt and inspect the contents of the /prime/out folder.

Hadoop Job

 

So we got 1229 numbers which were prime and the rest 8771 composite.

Hadoop Output

Final words

In upcoming posts we will dive deeper into more complex MapReduce jobs and also see technologies such as Hive which can abstract the complexities involved in writing a MapReduce functionality.

Also, we will see how to set up an Azure Cluster and deploy out MapReduce job and can configure the same.

Finally will look into different tools that we can use to analyse the data in hand after the computation done by the MapReduce job.

Advertisements
Standard