Monday, October 31, 2016

Finding bigrams using Map Reduce

Natural language processing and computational linguistics applications often use an n-gram for analyzing textual data. An n-gram is a contiguous sequence of n items from a given text. If n=2,  is called a bigram.
Suppose the text you have is

It is raining outside.

The bigrams from this text are (It is), (is raining), (raining outside). Usually the sequence is considered from left to right and not in the backward direction. So (is,It) would not be regarded as a valid bigram. 

Suppose now, you had a lot of text from which bigrams have to be extracted. Why should this happen? There could be several reasons - you want to understand the semantics of text or conversation; you are interested in studying the co-occurrence statistics of words (a.k.a which words tend to occur together frequently); you want to use bigrams as a primitive for more involved natural language processing tasks and perhaps there are other related problems. 


Now if the text is large (such as a whole book or newspaper articles for a whole month/year) the simple task of extracting bigrams tends to become compute intensive. 



A Distributed File System (DFS) is typically used to store the large volume of data. Hadoop DFS is open source and has a relatively easy learning curve. Hence it is a popular choice for storage. Manipulating this data is primarily done with MapReduce programming. It has been used efficiently at Google, Inc., Yahoo!, Facebook, and many other companies that deal with large data volumes. 

So how to extract our bigrams using MapReduce?

Here is a possibility - the Map task will take in (key, value) pairs. In our example, the Map task could take in and output a series of intermediate keys of the form (bigrams, count). The Reduce task then adds up all the counts associated with a certain bigram. 


Straight forward enough?


Here is the Mapper code to do the job. 

So what does the code do? It simply tokenizes the string and puts together the previous token along with the current token as long as the previous one was not null. It outputs (bigram, count).

Now the Reduce code can work as follows. 
It keeps a count of the number of bigrams seen so far.

Putting it all together - we have the main method as follows.


The main method is slightly more involved. First, we input three arguments - the input path where the text will be stored; the output path where the generated bigrams will be stored (these paths will be on the HDFS) and the last is the number of reducers to use. Lines 13-15 simply state that you are running the MyBigramCount program. 

Job (Line 13) refers to the MapReduce job configuration. It can be usedto present the MapReduce job to the Yarn execution framework which comes built in Apache Hadoop 2.0 and later versions.
FileInputFormat indicates where the input files are available for the Map task and FileOutputFormat gives the location of the output on Lines 19-20. Job is used to set the Mapper, Reducer and Combiner (is used) implementations (Lines 22-24). Additionally, it can be used to specify which Comparator to be used, whether files should be put in the DistributedCache, whether intermediate and/or job outputs are to be compressed (and how), whether job tasks can be executed in a maximum number of attempts per task and other details(Lines 22-28).


To compile the program, execute
hadoop com.sun.tools.javac.Main (path-to the location of MyBigramCount.java)
Assuming there are no compile errors, create a jar file that contains all the class files for MyBigramCount using the command below:
jar cf MyBigramCount.jar MyBigramCount*.class
Execute using the command
hadoop jar MyBigramCount.jar MyBigramCount Input-Dir Output-Dir

Saturday, October 22, 2016

A little indulgence on a Saturday afternoon.

Had been craving for crab curry for quite a while. Experimented with Alaskan king crabs and a Goanese recipe that is well reviewed by crab lovers. Loved the taste. Come home if you want to try some. Warning: It is disappearing fast!

Crab curry

Friday, October 21, 2016

Big Data and the Mortuary

Sounds gross, right?

But here's the story. I get on the "big data" bandwagon.
I develop these nice scalable algorithms for learning from the big data. I test them on data that does not fit in the memory of my personal laptop (otherwise the reviewers for my journal papers do not agree that I am doing GOOD research). I use a cluster for all of this work - which I have spent hours configuring and playing around with.

But then I decide to relocate. With all my tangible possessions. How do I move this large cluster?

Needless to say, I am emotionally attached to it, by now. It was not easy getting it up and running in the first place. Then I did all the dirty data cleaning work on it. That was another 100+ hours of my existence. And now I am facing this scenario -- I do not have the physical space to store the machines; the administrator who spent many sleepless nights with me has decided to take a break.

The research organizations and foundations that helped set up the cluster have not given thought to the problem. They are only concerned with getting useful things out of it still. Once you build an empire, you usually do not want to relocate.

The environmentalists roll their eyes at me. Large scale computing infrastructure in the electronic waste recycling? You serious? Well go find a place to host them.

Yes, true.

Meanwhile, does any one care about what would happen to these large scale repos once no one rides the bandwagon anymore?

I feel like the Lorax in Dr. Seuss's book(http://www.seussville.com/books/book_detail.php?isbn=9780385372022).

The Onceler and the Truffula trees had their way. The Onceler is unrepentant. He  defiantly tells the Lorax that he will keep on "biggering" his business, but at that moment one of his machines fells the very last of the Truffula trees. Without raw materials, the factory shuts down. The Lorax says nothing. Just one sad backward glance and disappears behind the smoggy clouds. Where he last stood is a small monument engraved with a single word: "UNLESS". The Onceler ponders the message for years, in solitude.