Trigrams - what are they?
An n-gram is typically a contiguous
sequence of words extracted from a given text or speech. In the fields of text
mining and analytics, n-grams are used to develop analytical models that can be
used to carry out a variety of natural language processing tasks
like spelling correction, search and text
summarization.
Trigrams are a special case of
n-grams where n=3, which means we need to pick three consecutive words from a
sentence to form a trigram. Suppose the text you are given is:
The quick brown
fox jumps over the lazy dog.
Some of the trigrams from this
sentence are (The quick brown), (quick
brown fox), (the lazy dog), etc. The sequence of trigrams is considered
only from left to right and not the other way round. Each sentence has N-2
number of trigrams where N is the number of words in the sentence.
How do we extract the trigrams?
In our class project, we were using MapReduce programming to
extract trigrams from the given text on a Hadoop Distributed File System (HDFS). MapReduce
works on (key, value) pairs.
Trigrams act as the ‘key’ and
their count in the data set is the ‘value’
we are estimating. The mapper breaks down the text strings into trigrams and
assigns a count of ‘1’ to each
trigram. The intermediate keys are subsequently passed to the reducer and it
shuffles and sorts them to count the number of occurrences of each trigram in
the entire data set.
Here is the ‘Mapper code’ to extract the trigrams:
Given below is the ‘Reducer code’ for counting the number of
occurrences for each trigram:
Now that we have extracted and
counted the trigrams, we can experiment with the effect of different mapper and reducer configurations and study its effect on execution time.
Aims
In this example, we are interested
in analyzing the execution time of the MapReduce program with different number
of mappers and reducers involved while running the tasks on the University at
Buffalo’s high-performance computing cluster – Lake Effect.
Lake Effect is a subscription-based Infrastructure as
a Service (IAAS) cloud provided by the Center for Computational Research (CCR)
for UB’s research purposes. Researchers at UB can access this cloud and make
use of it for high performance computing requirements like Big Data analytics.
Hadoop was pre-configured on the cluster - in particular, there was one master and two data nodes. Jobs could be submitted remotely to the JobTracker. There were 120 students who were running experiments on this cluster during the course of the semester. This allowed testing the impact of other jobs on the task at hand and experimenting with scalability of the system, albeit on a small scale.
Data set
Our data set comprises of a sample of 50
different articles from the Chronicling America website (http://chroniclingamerica.loc.gov/). The data sets were extracted using the Web Scraper tool that merged
the 50 extracted articles in to a single file. In order to ensure that the
reducers are optimally utilized, we replicated 20 of these files using the same
tool so that the frequency of trigrams increases. We merged both the files (50
articles and 20 articles) and created a single file containing data of the 70
articles.
In order to run our code using
multiple mappers and multiple reducers, we need to configure these changes in
our ‘main method’ of the code.
Below is the ‘Main method’ of the program:
The important thing to
notice about the code is that it offers the flexibility to set the number of
reducers and mappers on which the tasks would be executed on the CCR cluster.
The user can assign these inputs as command line arguments at the
time of command execution. For e.g:
hadoop jar Trigram.jar trigram \input \output 2 3
‘2’
in red is the number of reducers and ‘3’ in green is the number of mappers.
The number of mappers are
spawn based on the number of input splits in the data set. Therefore, here the
number ‘3’ is actually specifying the number of input splits that the code
should create in the input file, in order to create as many mappers. The default
splits occur according to the block size, which is 128 MB for CCR, but as our
input data set is not so big, we have parametrized the number of input splits
so that the user can run the code with multiple mappers.
Empirical Results
We have carried out an experiment whereby the
same input file is executed with multiple combinations of reducers and mappers.We calculated the execution time of each run by adding
the ‘Total time spent by all maps’, ‘Total time spent by all reducers’ and the
‘CPU time’ (refer Table 1).
|
Table 1: Total and Average execution
times for all the mapper-reducer combinations |
Below is the graph where we have mapped the
mapper-reducer pair vs the execution time taken by each run with that
configuration.
Discussions
Here are our
observations:
a.
Best execution
time was observed when only 1 mapper and 1 reducer were used, and worst case was
when 4 mappers and 4 reducers were used.
b.
For 1 mapper, as
the number of reducers are increasing the total time of execution is also
increasing.
c.
Similarly, as the
number of mappers increase, we can see a steady trend in increase of execution
time.
d.
The slope and
height of the graph for reducers is almost the same across all combinations
with mappers.
From the above
observations, we can conclude that the execution time of a program increases
with an increase in the number of mappers or reducers, when the data set is not
very large.
As our data set was
very small, adding more than one reducer only results in increasing the
overhead for the program. By increasing the number of reducers, we are adding
an extra time cost of instantiating the reduce task as well as the network
transfer and parsing time which would be spent in transferring data to
different reducers, while there is not enough data for each reducer to work
upon. In addition, each reducer creates an output file, so, if multiple
reducers are being used, for a small data set, then multiple output files would
be created. This would again add to the overhead, as extra I/O operations would
be required to create these multiple reducer output files.
In another scenario,
if the input data set to be processed were very large, then running the program
with a single reducer would have affected the performance adversely. The entire
load would have gone to a single reducer. In such cases, using multiple reducer
to divide the tasks would have been beneficial.
Hence, the
number of reducers used for a map-reduce code is a significant factor. Having
too many or too few reducers would hamper the productivity of the program.
While using the CCR
cluster for executing our jobs, we can expect long waiting time at some points
in time, when the cluster would be busy executing other tasks and our jobs
would be in the queue. Although the actual execution time of the programs would
be in the order of seconds, but the elapsed time for executing the jobs might
be much higher, almost an hour in some cases. In case the cluster fails to
respond due to some configuration challenges or power failure, the queued jobs
could delay with infinite waiting time.
About the author: Aviral Kumar Gupta is a graduate student in the MIS Department, University at Buffalo, NY.