Saturday, October 10, 2015

Measuring influence in prospographical research

Suppose you wanted to know the life story of the Australian operatic soprano, Dame Nellie Melba who was famous in the late Victorian era and early 20th century. Where would you begin your search? You could start reading about her life on the internet (perhaps on Wikipedia) and follow-up with references therein. Very soon, however, you could find yourself digging through stacks of old newspapers such as "The Musical Times" or "The Sun" published from New York to learn more about her debut at the Metropolitan Opera. Information is sparse and often incomplete.

Relying on secondary biographical information such as family archives and photographs, publicly available archives including newspaper articles, financial accounts from cities, economic and fiscal sources such as sales of deeds and tax lists, and other surviving documents of the era helps weave a story of the subject's life. It helps verification of facts from multiple sources.

A team of researchers from the School of Management, SUNY Buffalo (Dr. Haimonti Dutta), the Department of Computer Science at IIIT-Delhi (Aayushee Gupta), IBM Research India (Dr. Srikanta Bedathur) and TCS Research India (Dr. Lipika Dey) have been involved in this prospographical research. The project was funded in its initial phases by the National Endowment of Humanities.

Article level data was obtained from the historical newspaper archive of the New York Public Library after curation by the New York Public Library Labs. Using techniques from natural language processing and large scale machine learning, the team was able to build a system that could identify influential people. The noisy text from old historic newspapers were subjected to Optical Character Recognition (OCR) and the text from articles was spell corrected. This was then used to form a people gazetteer from which people with influence were identified. It was particularly interesting to find local people who held a sway in the government offices, arts and sciences, and the armed forces at that time.


A small set of influential people detected using their algorithm from two months of newspaper data published in ``The Sun" newspaper.
The team is now in the process of building a historical timeline very similar to the jazz timeline hosted by pbskids.org to help kids learn about influential people local to a geographic region.

Related Publications
1. Aayushee Gupta, Finding influential people from a historical news repository, 2014. [Master's thesis] 
https://repository.iiitd.edu.in/jspui/handle/123456789/166
2. Aayushee Gupta and Haimonti Dutta, Evaluation of Spell Correction on Noisy OCR Data. INFORMS Workshop on Data Mining and Analytics at INFORMS Annual Meeting, Philadelphia, October 2015.
3. Aayushee Gupta, Haimonti Dutta, Srikanta Bedathur and Lipika Dey. A Machine Learning Framework for Prosopographical Research. In preparation.

No comments:

Post a Comment