Natural language processing and computational linguistics applications often use an n-gram for analyzing textual data. An n-gram is a contiguous sequence of n items from a given text. If n=2, is called a bigram.
Suppose the text you have is
It is raining outside.
It is raining outside.
The bigrams from this text are (It is), (is raining), (raining outside). Usually the sequence is considered from left to right and not in the backward direction. So (is,It)
Now if the text is large (such as a whole book or newspaper articles for a whole month/year) the simple task of extracting bigrams tends to become compute intensive.
A Distributed File System (DFS) is typically used to store the large volume of data. Hadoop DFS is open source and has a relatively easy learning curve. Hence it is a popular choice for storage. Manipulating this data is primarily done with MapReduce programming. It has been used efficiently at Google, Inc., Yahoo!, Facebook, and many other companies that deal with large data volumes.
Here is a possibility - the Map task will take in (key, value) pairs. In our example, the Map task could take in
Straight forward enough?
Here is the Mapper code to do the job.
Now the Reduce code can work as follows.
It keeps a count of the number of bigrams seen so far.
Putting it all together - we have the main method as follows.
The main method is slightly more involved. First, we input three arguments - the input path where the text will be stored; the output path where the generated bigrams will be stored (these paths will be on the HDFS) and the last is the number of reducers to use. Lines 13-15 simply state that you are running the MyBigramCount program.
Job (Line 13) refers to the MapReduce job configuration. It can be usedto present the MapReduce job to the Yarn execution framework which comes built in Apache Hadoop 2.0 and later versions.
FileInputFormat indicates where the input files are available for the Map task and FileOutputFormat gives the location of the output on Lines 19-20. Job is used to set the Mapper, Reducer and Combiner (is used) implementations (Lines 22-24). Additionally, it can be used to specify which Comparator to be used, whether files should be put in the DistributedCache, whether intermediate and/or job outputs are to be compressed (and how), whether job tasks can be executed in a maximum number of attempts per task and other details(Lines 22-28).
To compile the program, execute
hadoop com.sun.tools.javac.Main (path-to the location of MyBigramCount.java)
Assuming there are no compile errors, create a jar file that contains all the class files for MyBigramCount using the command below:
jar
cf MyBigramCount.jar MyBigramCount*.class
Execute using the command
hadoop jar MyBigramCount.jar MyBigramCount Input-Dir Output-Dir