Monday, October 31, 2016

Finding bigrams using Map Reduce

Natural language processing and computational linguistics applications often use an n-gram for analyzing textual data. An n-gram is a contiguous sequence of n items from a given text. If n=2,  is called a bigram.
Suppose the text you have is

It is raining outside.

The bigrams from this text are (It is), (is raining), (raining outside). Usually the sequence is considered from left to right and not in the backward direction. So (is,It) would not be regarded as a valid bigram. 

Suppose now, you had a lot of text from which bigrams have to be extracted. Why should this happen? There could be several reasons - you want to understand the semantics of text or conversation; you are interested in studying the co-occurrence statistics of words (a.k.a which words tend to occur together frequently); you want to use bigrams as a primitive for more involved natural language processing tasks and perhaps there are other related problems. 


Now if the text is large (such as a whole book or newspaper articles for a whole month/year) the simple task of extracting bigrams tends to become compute intensive. 



A Distributed File System (DFS) is typically used to store the large volume of data. Hadoop DFS is open source and has a relatively easy learning curve. Hence it is a popular choice for storage. Manipulating this data is primarily done with MapReduce programming. It has been used efficiently at Google, Inc., Yahoo!, Facebook, and many other companies that deal with large data volumes. 

So how to extract our bigrams using MapReduce?

Here is a possibility - the Map task will take in (key, value) pairs. In our example, the Map task could take in and output a series of intermediate keys of the form (bigrams, count). The Reduce task then adds up all the counts associated with a certain bigram. 


Straight forward enough?


Here is the Mapper code to do the job. 

So what does the code do? It simply tokenizes the string and puts together the previous token along with the current token as long as the previous one was not null. It outputs (bigram, count).

Now the Reduce code can work as follows. 
It keeps a count of the number of bigrams seen so far.

Putting it all together - we have the main method as follows.


The main method is slightly more involved. First, we input three arguments - the input path where the text will be stored; the output path where the generated bigrams will be stored (these paths will be on the HDFS) and the last is the number of reducers to use. Lines 13-15 simply state that you are running the MyBigramCount program. 

Job (Line 13) refers to the MapReduce job configuration. It can be usedto present the MapReduce job to the Yarn execution framework which comes built in Apache Hadoop 2.0 and later versions.
FileInputFormat indicates where the input files are available for the Map task and FileOutputFormat gives the location of the output on Lines 19-20. Job is used to set the Mapper, Reducer and Combiner (is used) implementations (Lines 22-24). Additionally, it can be used to specify which Comparator to be used, whether files should be put in the DistributedCache, whether intermediate and/or job outputs are to be compressed (and how), whether job tasks can be executed in a maximum number of attempts per task and other details(Lines 22-28).


To compile the program, execute
hadoop com.sun.tools.javac.Main (path-to the location of MyBigramCount.java)
Assuming there are no compile errors, create a jar file that contains all the class files for MyBigramCount using the command below:
jar cf MyBigramCount.jar MyBigramCount*.class
Execute using the command
hadoop jar MyBigramCount.jar MyBigramCount Input-Dir Output-Dir

3 comments: