//
you're reading...
Uncategorized

Lab #1: The Most Significant Words in the Declaration of Independence

The Declaration of Independence from http://www.monticello.org

In this exercise we will design an algorithm and implement each step in code (python software) aimed at answering the question: What are The Most Significant Words in the Declaration of Independence?

First let us look at the words in the Declaration of Independence as adopted by Congress on July 4, 1776 beginning with the powerful phrases:

When in the course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the laws of nature and of nature’s God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable rights, that among these are life, liberty and the pursuit of happiness….

The full transcript is available at:
http://www.archives.gov/exhibits/charters/declaration_transcript.html

If we consider only the frequency with which words occur, we will be led to words that have little importance. These words are termed ‘stop’ words. For example, the most frequently appearing words in the Declaration are

[‘the’, ‘of’, ‘to’, ‘and’, ‘for’, ‘our’, ‘their’, ‘in’, ‘has’, ‘he’, ‘them’, ‘a’, ‘that’, ‘these’, ‘by’, ‘have’],

and these are all considered stop words . So in our design we intend to remove such stop words before we determine frequencies.

.

Lab #1 Algorithm:

0. Create a list of all the words in the transcript.
1. Remove ‘Stop’ Words.
2. Determine Frequency of remaining words.
3. Find the word with the largest frequency.
4. Print it – and its frequency – and remove it.
5. Prompt the user to continue or not.
6. If yes then go to step 3. Else end the program gracefully.

Let us look at developing code for each step.

Step 0. Create a list of all the words.

We can create a text file of words copied from the transcript. We open that file in a python session, and by using the python split and list functions create a list of all the words.   We will use two coding tricks found in an answer posted on stackoverflow at:
stackoverflow..

  1. Using the builtin string function ‘.lower()’: after reading the file, this function will replace all upper case characters with lower case equivalents.
  2. Using python’s  regular expression library called re, we apply an re pattern rule that matches all instances of non-lower case characters and for each match substitutes a space.
    import re
    file = open('/Users/fred/Desktop/declaration.txt', 'r')
    # .read() returns a string and .lower() returns same string with all upper case characters replaced with lower case characters.
    text = file.read().lower()
    file.close()
    # replaces anything that is not a lowercase letter, a space, or an apostrophe with a space:
    text = re.sub('[^a-z\ \']+', ' ', text)
    # create a list by splitting the text string
    words = list(text.split())
    print words
    
    %run
    [ 'when', 'in', 'the', 'course', 'of', 'human', 'events', 'it', 'becomes', 'necessary', 'for', 'one', 'people', 'to', 'dissolve', 'the', 'political', 'bands', 'which', 'have', 'connected', 'them', 'with', 'another', 'and', 'to', 'assume', 'among', 'the', 'powers', 'of', 'the', 'earth', 'the', 'separate', 'and', 'equal', 'station', 'to', 'which', 'the', 'laws', 'of', 'nature', 'and', 'of', 'nature's, 'god', 'entitle', 'them', 'a', 'decent', 'respect', 'to', 'the', 'opinions', 'of', 'mankind', 'requires', 'that', 'they', 'should', 'declare', 'the', 'causes', 'which', 'impel', 'them', 'to', 'the', 'separation', 'we', 'hold', 'these', 'truths', 'to', 'be', 'self', 'evident', 'that', 'all', 'men', 'are', 'created', 'equal', 'that', 'they', 'are', 'endowed', 'by', 'their', 'creator', 'with', 'certain', 'unalienable', 'rights', 'that', 'among', 'these', 'are', 'life', 'liberty', 'and', 'the', 'pursuit', 'of', 'happiness', '.......'  ]
    

    Step 1: Remove ‘Stop’ Words

    Although there is no agreement about which words are considered stop words, wikipedia has a source http://en.wikipedia.org/wiki/Stop_words and they link to http://www.textfixer.com/resources/common-english-words.txt

    Running our listwords.py program with this file yields:

    %run
    ['a', 'able', 'about', 'across', 'after', 'all', 'almost', 'also', 'am', 'among', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'because', 'been', 'but', 'by', 'can', 'cannot', 'could', 'dear', 'did', 'do', 'does', 'either', 'else', 'ever', 'every', 'for', 'from', 'get', 'got', 'had', 'has', 'have', 'he', 'her', 'hers', 'him', 'his', 'how', 'however', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'least', 'let', 'like', 'likely', 'may', 'me', 'might', 'most', 'must', 'my', 'neither', 'no', 'nor', 'not', 'of', 'off', 'often', 'on', 'only', 'or', 'other', 'our', 'own', 'rather', 'said', 'say', 'says', 'she', 'should', 'since', 'so', 'some', 'than', 'that', 'the', 'their', 'them', 'then', 'there', 'these', 'they', 'this', 'tis', 'to', 'too', 'twas', 'us', 'wants', 'was', 'we', 'were', 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'would', 'yet', 'you', 'your']
    

    Note that the  first 3 words in the decwordlist of the declaration words are also in the stopword list. To test each word we only need to check if it is a stop word by using an “in” statement on this stopwordlist, and then call the list-remove function to edit it out of the decwordlist.

    Here is all the code we have so far  – we open two files, create two lists, and then using a for loop with an in-test we filter out all stop words.

    import re
    dfile = open('/Users/fred/Desktop/d.txt', 'r')
    dtext = dfile.read().lower()
    dfile.close()
    # replaces anything that is not a lowercase letter, a space, or an apostrophe with a space:
    dtext = re.sub('[^a-z\ \']+',' ', dtext)
    decwordlist = list(dtext.split())
    # print decwordlist
    
    sfile = open('/Users/fred/Desktop/stopwords.txt', 'r')
    stext = sfile.read().lower()
    sfile.close()
    stext = re.sub('[^a-z\ \']+', ' ', stext)
    stopwordlist = list(stext.split())
    
    for dword in dtext.split():
        if dword in stopwordlist:
           decwordlist.remove(dword)
    
    print decwordlist
    
    %run /Users/fred/Desktop/listwords.py
    
    ['course', 'human', 'events', 'becomes', 'necessary', 'one', 'people', 'dissolve', 'political', 'bands', 'connected', 'another', 'assume', 'powers', 'earth', 'separate', 'equal', 'station', 'laws', 'nature', 'nature's', 'god', 'entitle', 'decent', 'respect', 'opinions', 'mankind', 'requires', 'declare', 'causes', 'impel', 'separation', 'hold', 'truths', 'self', 'evident', 'men', 'created', 'equal', 'endowed', 'creator', 'certain', 'unalienable', 'rights', 'life', 'liberty', 'pursuit', 'happiness', 'secure', .... ]
    
    

    Step 2. Determine Frequency of remaining words.

    Here we introduce a mapping type object called a dictionary or dict for short. A dict allows us to map keys to values. Here our mapped keys are all the non-stop words, and the values are the frequencies. We add 1 to the frequencies each time we read a word that has already been seen, otherwise we add the word to the dict with a value 0.

    wordfreq = {}
    for dword in decwordlist:
            if dword in wordfreq:
                wordfreq[dword] += 1
            else:
                wordfreq[dword]=1
    
    # or equivalently use the shorter get method
    # for dword in decwordlist:
    #    wordfreq[dword] = wordfreq.get(dword,0) + 1
    

    Step 3. Find the word with the largest frequency.

    maxval=0
    for dword in wordfreq:
          if wordfreq[dword] > maxval:
             maxval = wordfreq[dword]
             maxword = dword
    print maxword, " has frequency", maxval
    
    #Alternatives
    # print max(wordfreq.values())
    # print max(wordfreq.items(), key=lambda b: b[1])
    # print max(wordfreq, key=wordfreq.get)
    
    

    The most significant word occurs 10 times. Did you determine what that significant word is? Here are the remaining steps to complete this lab.

    Step 4. Print it – and its  frequency – and mark it for removal.

    The del command can be used to remove a key, value pair from a dictionary. For example, here is an example of deleting key from dict.

    phoneNums = {'jack': 64098, 'jill': 64139, 'guido': 64127, 'grace': 60124}
    print phoneNums['guido'] # =>  64127
    del phoneNums['guido']
    print phoneNums # =>{'grace': 60124, 'jack': 64098, 'jill': 64139}
    
    

    Step 5. Prompt the user to continue or not.

    Step 6. If yes then go to step 3. Else end the program gracefully.

    To handle this common looping condition you should define a variable called userDone and initialize to true. Use this variable at the top of a while loop. Then after you prompt the user to continue, you should re-evaluate the truth value of userDone, as follows:

    
    userDone = False
    while (not userDone):
      # user is not done yet
      #..... your code for steps 3 and 4 goes here
    
      userResponse = raw_input("Would you like to continue (y/n): ")
      userDone = (userResponse == 'n')
    
    print "Thank you for using this program. Goodbye!"
    
    

    Finish your completed program and submit a single file lab1.py to BB.:)

    Notes:

    1. You need to take care using mapping types like the dict we used above. You will get an error if you try to access the dict using a key that does not exist in the dict. For example, if a dword is a stopword then, the dword will not be a key in the dict wordfreq. Thus if we map it using wordfreq[dword] we will get an ERROR. Try it.

Advertisements

Discussion

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: