Sunday, February 12, 2017

Effect of Different MapReduce Configurations on Identifying Trigrams From Text - Guest Post by Aviral Kumar Gupta

Trigrams - what are they?
An n-gram is typically a contiguous sequence of words extracted from a given text or speech. In the fields of text mining and analytics, n-grams are used to develop analytical models that can be used to carry out a variety of natural language processing tasks like spelling correction, search and text summarization.

Trigrams are a special case of n-grams where n=3, which means we need to pick three consecutive words from a sentence to form a trigram. Suppose the text you are given is:

The quick brown fox jumps over the lazy dog.

Some of the trigrams from this sentence are (The quick brown), (quick brown fox), (the lazy dog), etc. The sequence of trigrams is considered only from left to right and not the other way round. Each sentence has N-2 number of trigrams where N is the number of words in the sentence.

How do we extract the trigrams?
In our class project, we were using MapReduce programming to extract trigrams from the given text on a Hadoop Distributed File System (HDFS). MapReduce works on (key, value) pairs. Trigrams act as the ‘key’ and their count in the data set is the ‘value’ we are estimating. The mapper breaks down the text strings into trigrams and assigns a count of ‘1’ to each trigram. The intermediate keys are subsequently passed to the reducer and it shuffles and sorts them to count the number of occurrences of each trigram in the entire data set.

Here is the ‘Mapper code’ to extract the trigrams:

Given below is the ‘Reducer code’ for counting the number of occurrences for each trigram:


Now that we have extracted and counted the trigrams, we can experiment with the effect of different mapper and reducer configurations and study its effect on execution time.

Aims
In this example, we are interested in analyzing the execution time of the MapReduce program with different number of mappers and reducers involved while running the tasks on the University at Buffalo’s high-performance computing cluster – Lake Effect.
Lake Effect is a subscription-based Infrastructure as a Service (IAAS) cloud provided by the Center for Computational Research (CCR) for UB’s research purposes. Researchers at UB can access this cloud and make use of it for high performance computing requirements like Big Data analytics.

Hadoop was pre-configured on the cluster - in particular, there was one master and two data nodes. Jobs could be submitted remotely to the JobTracker. There were 120 students who were running experiments on this cluster during the course of the semester. This allowed testing the impact of other jobs on the task at hand and experimenting with scalability of the system, albeit on a small scale.

Data set

Our data set comprises of a sample of 50 different articles from the Chronicling America website (http://chroniclingamerica.loc.gov/). The data sets were extracted using the Web Scraper tool that merged the 50 extracted articles in to a single file. In order to ensure that the reducers are optimally utilized, we replicated 20 of these files using the same tool so that the frequency of trigrams increases. We merged both the files (50 articles and 20 articles) and created a single file containing data of the 70 articles.


In order to run our code using multiple mappers and multiple reducers, we need to configure these changes in our ‘main method’ of the code.


Below is the ‘Main method’ of the program:


The important thing to notice about the code is that it offers the flexibility to set the number of reducers and mappers on which the tasks would be executed on the CCR cluster. The user can assign these inputs as command line arguments at the time of command execution. For e.g:

hadoop jar Trigram.jar trigram \input \output 2 3

2’ in red is the number of reducers and ‘3’ in green is the number of mappers.
 The number of mappers are spawn based on the number of input splits in the data set. Therefore, here the number ‘3’ is actually specifying the number of input splits that the code should create in the input file, in order to create as many mappers. The default splits occur according to the block size, which is 128 MB for CCR, but as our input data set is not so big, we have parametrized the number of input splits so that the user can run the code with multiple mappers.

Empirical Results
We have carried out an experiment whereby the same input file is executed with multiple combinations of reducers and mappers.We calculated the execution time of each run by adding the ‘Total time spent by all maps’, ‘Total time spent by all reducers’ and the ‘CPU time’ (refer Table 1).
Table 1: Total and Average execution times for all the mapper-reducer combinations
Below is the graph where we have mapped the mapper-reducer pair vs the execution time taken by each run with that configuration.





Discussions
Here are our observations:
a.     Best execution time was observed when only 1 mapper and 1 reducer were used, and worst case was when 4 mappers and 4 reducers were used.
b.     For 1 mapper, as the number of reducers are increasing the total time of execution is also increasing.
c.      Similarly, as the number of mappers increase, we can see a steady trend in increase of execution time.
d.     The slope and height of the graph for reducers is almost the same across all combinations with mappers.

From the above observations, we can conclude that the execution time of a program increases with an increase in the number of mappers or reducers, when the data set is not very large.

As our data set was very small, adding more than one reducer only results in increasing the overhead for the program. By increasing the number of reducers, we are adding an extra time cost of instantiating the reduce task as well as the network transfer and parsing time which would be spent in transferring data to different reducers, while there is not enough data for each reducer to work upon. In addition, each reducer creates an output file, so, if multiple reducers are being used, for a small data set, then multiple output files would be created. This would again add to the overhead, as extra I/O operations would be required to create these multiple reducer output files.

In another scenario, if the input data set to be processed were very large, then running the program with a single reducer would have affected the performance adversely. The entire load would have gone to a single reducer. In such cases, using multiple reducer to divide the tasks would have been beneficial.

Hence, the number of reducers used for a map-reduce code is a significant factor. Having too many or too few reducers would hamper the productivity of the program.
While using the CCR cluster for executing our jobs, we can expect long waiting time at some points in time, when the cluster would be busy executing other tasks and our jobs would be in the queue. Although the actual execution time of the programs would be in the order of seconds, but the elapsed time for executing the jobs might be much higher, almost an hour in some cases. In case the cluster fails to respond due to some configuration challenges or power failure, the queued jobs could delay with infinite waiting time.

About the author: Aviral Kumar Gupta is a graduate student in the MIS Department, University at Buffalo, NY.

No comments:

Post a Comment