My class on "Distributed Computing and Big Data Technologies" (MGS 655 (Fall 2020)), was delivered virtually due to the outbreak of the coronavirus pandemic and students enrolled in the course had the opportunity to study why it is useful to process big data for the Humanities.
Digital Humanities refers to the academic field concerned with application of computational tools and methods to solve problems in traditional disciplines in Humanities, including literature, art, history, philosophy anthropology and others. The students enrolled in the course participated in a project entitled, Chitra Anusandhan, which involves digitization, transcription and art recommendation from painted narrative scrolls of West Bengal, India.
The project studies an endangered group of singing painters from rural West Bengal, India who walk from village to village unfurling their scrolls and singing about them. The performance of these painters, also called Patuas, celebrates ancient folk lores, endangered languages and stories that have come down to them from earlier generations. The songs are primarily written in Bengali, but often incorporate regional dialects including Santhali, Ho and Bhumij.
The goal of Chitra Anusandhan is to digitize paintings frame-by-frame and transcribe the songs associated with them, transliterate them and then translate them into English for use in an art recommendation system. The multi-modal data (audio, image, and text) generated during the digitization process along with annotations from users provides a rich medium for exploration of big data technologies.
The students in the class were given a repository of folk songs containing both English and non-English words and they had to design a Hadoop based system to filter out non-English words from the text.
Here is an example of a folk song: Srimanta Bhasan
This project showcase features two students -- Anthony Guarnieri, a student from the MS MIS program and Ryan Young, an MBA student in the School of Management.
=======
Anthony wrote a python script to check for non-English words in song lyrics. The program has 3 main functions, the first of which is the split_files function. This function is used to split longer files and is an optional step. The function takes two parameters, file paths and words (files are only split if the file has more words than this number). The files are split by paragraphs or verse in the case of song lyrics.
Next is the freq function. This function takes one parameter of File Paths. The output of this function is the Term Frequency the Document Frequency of each file. Each word processed is stripped of all formatting, special characters or punctuation.
Third is the check_english function, this function takes the Term and Document frequency and filters it down to only Non-English words. It takes one parameter -- File Paths.
======
Ryan Young adapted the following strategy
1. Search for all the English words using regular expressions (regex)
2. Delete the English words from the string
3. Count the remaining non-English words
Flexibility is important in any distributed system and so he designed a script which provides that flexibility.
Flexibility Component #1: Input Folder -- End users don't have to move individual text files out of any subfolders. The script just searches and finds all text files.
Flexibility Component #2: Python 3 Version Agnostic -- He wanted to make sure the script was agnostic to any version of Python 3 (i.e., no f-strings, etc.). To make the script work with Python 2, the parenthesis in the print functions should be replaced with a space or just commented out.
Flexibility Component #3: User Controls -- Since there wasn't a graphical user interface, he wanted to make the script as easy as possible for an end user to make changes to its output. This is why NON_ENGLISH_WORDS_TO_IGNORE_LIST and EXPORT_TO_EXCEL_BOOLEAN are prominently displayed at the top of the script.
Any words in NON_ENGLISH_WORDS_TO_IGNORE_LIST will be removed as well. This allows the end user to specify words not in the English dictionary which should also be removed.
When EXPORT_TO_EXCEL_BOOLEAN = True, the script requires Microsoft Excel to be installed so the report can be generated. Of course, if the end user doesn't have Excel installed, then he wanted to provide the flexibility so that they could generate the output as a text file.
Improvements to the script would really involve creating a user interface using wxPython or Tkinter and multiprocessing (i.e., run in parallel) the "for song_text_file_path in original_text_file_path_list:" for loop so multiple input text files are processed at the same time.
Here is the code generated from his project.
===========
Good job, everyone! Please keep honing your coding skills.
No comments:
Post a Comment