Natural language processing and computational linguistics applications often use an n-gram for analyzing textual data. An n-gram is a contiguous sequence of n items from a given text. If n=2, is called a bigram.
Suppose the text you have is
It is raining outside.
It is raining outside.
The bigrams from this text are (It is), (is raining), (raining outside). Usually the sequence is considered from left to right and not in the backward direction. So (is,It)
Now if the text is large (such as a whole book or newspaper articles for a whole month/year) the simple task of extracting bigrams tends to become compute intensive.
A Distributed File System (DFS) is typically used to store the large volume of data. Hadoop DFS is open source and has a relatively easy learning curve. Hence it is a popular choice for storage. Manipulating this data is primarily done with MapReduce programming. It has been used efficiently at Google, Inc., Yahoo!, Facebook, and many other companies that deal with large data volumes.
Here is a possibility - the Map task will take in (key, value) pairs. In our example, the Map task could take in
Straight forward enough?
Here is the Mapper code to do the job.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> | |
{ | |
private static final IntWritable ONE = new IntWritable(1); | |
private static final Text BIGRAM = new Text(); | |
public void map(Object key, Text value, Context context) throws IOException, InterruptedException | |
{ | |
String line = value.toString(); | |
String prev = null; | |
StringTokenizer itr = new StringTokenizer(line); | |
while (itr.hasMoreTokens()) | |
{ | |
String cur = itr.nextToken(); | |
// Emit only if we have an actual bigram. | |
if (prev != null) | |
{ | |
BIGRAM.set(prev + " " + cur); | |
context.write(BIGRAM, ONE); | |
} | |
prev = cur; | |
} | |
} | |
} |
Now the Reduce code can work as follows.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> | |
{ | |
private IntWritable SUM = new IntWritable(); | |
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException | |
{ | |
int sum = 0; | |
Iterator<IntWritable> iter = values.iterator(); | |
while (iter.hasNext()) { | |
sum += iter.next().get(); | |
} | |
SUM.set(sum); | |
context.write(key, SUM); | |
} | |
} |
It keeps a count of the number of bigrams seen so far.
Putting it all together - we have the main method as follows.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public static void main(String[] args) throws Exception | |
{ | |
Configuration conf = new Configuration(); | |
String inputPath = args[0]; | |
String outputPath = args[1]; | |
int reduceTasks = Integer.parseInt(args[2]); | |
if (args.length < 3) { | |
System.err.println("BigramCount usage: [input-path] [output-path] [num-reducers]"); | |
System.exit(0); | |
} | |
Job job = Job.getInstance(conf, "bigram count"); | |
job.setJobName(MyBigramCount.class.getSimpleName()); | |
job.setJarByClass(MyBigramCount.class); | |
job.setNumReduceTasks(reduceTasks); | |
FileInputFormat.setInputPaths(job, new Path(inputPath)); | |
FileOutputFormat.setOutputPath(job, new Path(outputPath)); | |
job.setMapperClass(TokenizerMapper.class); | |
job.setCombinerClass(IntSumReducer.class); | |
job.setReducerClass(IntSumReducer.class); | |
job.setOutputKeyClass(Text.class); | |
job.setOutputValueClass(IntWritable.class); | |
System.exit(job.waitForCompletion(true) ? 0 : 1); | |
} |
The main method is slightly more involved. First, we input three arguments - the input path where the text will be stored; the output path where the generated bigrams will be stored (these paths will be on the HDFS) and the last is the number of reducers to use. Lines 13-15 simply state that you are running the MyBigramCount program.
Job (Line 13) refers to the MapReduce job configuration. It can be usedto present the MapReduce job to the Yarn execution framework which comes built in Apache Hadoop 2.0 and later versions.
FileInputFormat indicates where the input files are available for the Map task and FileOutputFormat gives the location of the output on Lines 19-20. Job is used to set the Mapper, Reducer and Combiner (is used) implementations (Lines 22-24). Additionally, it can be used to specify which Comparator to be used, whether files should be put in the DistributedCache, whether intermediate and/or job outputs are to be compressed (and how), whether job tasks can be executed in a maximum number of attempts per task and other details(Lines 22-28).
To compile the program, execute
hadoop com.sun.tools.javac.Main (path-to the location of MyBigramCount.java)
Assuming there are no compile errors, create a jar file that contains all the class files for MyBigramCount using the command below:
jar
cf MyBigramCount.jar MyBigramCount*.class
Execute using the command
hadoop jar MyBigramCount.jar MyBigramCount Input-Dir Output-Dir