October 15, 2009 · homework · 1 comment

It seems that several people are still a little bit confused about what I would like you to do for homework 7. Your program should function much like any other UNIX program. For example, consider the wc program. I can use wc to count the words from several files. (From the resources/py directory)

 wc args.py opts.py auto_histling.py

I specified three files on the command line, separated by spaces.

The output should be something like:

 37     116     895 args.py
 34     105     847 opts.py
 77     335    3167 auto_histling.py
148     556    4909 total

Note that the columns are nicely aligned. Your program should work in a similar way, except that it will be printing out mean word and sentence length.

The getopt method will return a list of options and arguments. The arguments should be the filenames you specified on the command line. You will want to loop over the arguments and process each file one at a time.

Rob

October 15, 2009 · slides · (No comments)

Here are the slides from today’s class covering normalization and tokenization using regular expressions and the nltk. We did not get to the last section on tokenization, so we will postpone that until Tuesday. Have a nice weekend.

pdf iconling5200-nltk3-3-slides.pdf

October 15, 2009 · notes · (No comments)

Here are the notes for today covering the use of regular expressions for text normalization and tokenization.

pdf iconling5200-nltk3-3-notes.pdf

October 13, 2009 · slides · (No comments)

Here are the slides from today’s class covering unicode and regular expressions in python. I corrected the problem with the codecs.open example.

pdf iconling5200-nltk3-2-slides.pdf

October 13, 2009 · notes · (No comments)

Here are the notes for today’s class covering unicode and regular expressions in python

pdf iconling5200-nltk3-2-notes.pdf

October 12, 2009 · homework · (No comments)

Most students did well on this homework.
I made an error in the description for question 3. I had the wrong values for mean word and sentence length for Moby Dick, as I had not converted words to lowercase before comparing with the stopword corpus. I have corrected that in the solution here, and I did not take off any points if you did not convert to lower case. My solution python file is in the repository under resources/homework

I would also like to remind you of a couple things:

  1. make sure your python files are executable
  2. have a proper shbang line as the very first line of your file.
  3. use 4 spaces for indenting, not tabs, and please do not mix tabs and spaces

Starting with homework 7, I will begin taking off 5 points each for any of the above mistakes.

Class statistics for Homework 5
mean 49.71
standard deviation 14.14
  1. Create a function called mean_word_len, which accepts a list of words (e.g. text1 — Moby Dick), and returns the mean characters per word. You should remove punctuation and stopwords from the calculation. (10 points)

    from pprint import pprint
    import nltk
    from nltk.corpus import stopwords
    import string
    def mean_word_len(words):
        eng_stopwords = stopwords.words('english')
        words_no_punc = [w for w in words
                  if w not in string.punctuation and w.lower() not in eng_stopwords]
        num_words = len(words_no_punc)
        num_chars = sum([len(w) for w in words_no_punc])
        return (num_chars / num_words)
  2. Create a function called mean_sent_len, which accepts a list of sentences, and returns the mean words per sentence. You should remove punctuation and stopwords from the calculation. Note that the NLTK .sents() method returns a list of lists. That is, each item in the list represents a sentence, which itself is composed of a list of words. (15 points)

    import string
    def mean_sent_len(sents):
        eng_stopwords = stopwords.words('english')
        words_no_punc = [w for s in sents for w in s
                    if w not in string.punctuation and w.lower() not in eng_stopwords]
        num_words = len(words_no_punc)
        num_sents = len(sents)
        return (num_words / num_sents)
  3. Now use your two new functions to print out the mean sentence length and the mean word length for all of the texts from the gutenberg project included in the NLTK. You should print out these statistics with one file per line, with the fileid first, and then the mean word length and sentence length. One example would be:
    melville-moby_dick.txt 5.94330208809 8.86877613075
    (10 points)

    from nltk.corpus import gutenberg
    for fileid in gutenberg.fileids():
        words = gutenberg.words(fileid)
        sents = gutenberg.sents(fileid)
        print fileid, mean_word_len(words), mean_sent_len(sents)
  4. Using the CMU pronouncing dictionary, create a list of all words which have 3 letters, and 2 syllables. Your final list should include just the spelling of the words. To calculate the number of syllables, use the number of vowels in the word (every vowel includes the digit 1, 2, or 0, marking primary, secondary, or no stress). (15 points)
    entries = nltk.corpus.cmudict.entries()
    stress_markers = ['0','1','2']
    three_letter_two_syl_words = []
    for word , pron in entries :
        if len(word) == 3:
            syllables = 0
            for phoneme in pron:
                for marker in stress_markers:
                  if marker in phoneme:
                      syllables += 1
            if syllables == 2:
                three_letter_two_syl_words.append((word,pron))
    pprint(three_letter_two_syl_words)
  5. Imagine you are writing a play, and you are you thinking of interesting places to stage a scene. You would like it be somewhere like a house, but not exactly. Use the wordnet corpus to help you brainstorm for possible locations. First, find the hypernyms of the first definition of the word house. Then find all the hyponyms of those hypernyms, and print out the names of the words. Your output should contain one synset per line, with first the synset name, and then all of the lemma_names for that synset, e.g.:
    lodge.n.05 - lodge, indian_lodge
    (10 points)

    from nltk.corpus import wordnet
    house = wordnet.synsets('house')[0]
    house_hypernyms = wordnet.synset(house.name).hypernyms()
    for hypernym in house_hypernyms:
        print "-------", hypernym.name, "---------"
        for hyponym in hypernym.hyponyms():
            print hyponym.name, " - ", ", ".join(hyponym.lemma_names)
October 11, 2009 · homework · (No comments)

In this homework you will expand upon some of the code you wrote homework 6, using the functions you wrote to calculate mean word and sentence length. However, now you will accept command line arguments and options to use these functions, and print the output in a nice-looking format. Make sure to read all questions before starting the assignment. It is due Oct. 16th and covers material up to Oct. 8th.

  1. From BASH, use svn to copy your hmwk6.py file to hmwk7.py. This will preserve all of the history from hmwk6, so you can see how you have improved your code from homework 6 to homework 7. (3 points)
  2. Create a function called usage, which prints out information about how the script should be used, including what arguments should be specified, and what options are possible. It should take one argument, which is the name of the script file. (7 points)
  3. Write your script to process the following options. Look at opts.py under resources/py for an example. If both -s and -w are specified, it should print out both options. (14 points)
    -w --word print only mean word length
    -s --sent print only mean sentence length
    -h --help print this help information and exit
    
  4. Instead of specifying which texts to process in your code, change your code so
    that it accepts filenames from the command line. Look at the args.py file
    under resources/py for an example of how to do this. Your code should print out
    the name of each file (you can use the os.path.basename function to print out only the name of the file) specified on the command line, and the mean word length
    and sentence length, with a width of 13 and a precision of 2. Note that it
    should only print word length or sentence length if that option has been
    specified. If no files are specified, it should print the usage information
    and exit. Also note that after reading in a text you will have to first convert
    it to a list of words or sentences using the tokenize functions in the nltk,
    before calculating the mean word length and sentence length with the functions
    you defined in homework 6. See chapter 13 in the notes for examples on how to
    tokenize text to homework 5 for how to do this. The first line of output should
    be a line of headers describing the columns (28 points) Here is some example
    output:

    filename        mean_word_len mean_sent_len
    fooey                    3.45         13.47
    bar                      3.15          9.29
    
  5. Use your script to print out mean word length and sentence length for huckFinn, tomSawyeer, Candide, and devilsDictionary (in resources/texts). Save the output to a file called hmwk7_stats.txt in your personal directory, and commit it to the svn repository. Show the command you use in BASH. Make your paths relative to the root of your working copy of the repository. Do the same command, but also try the -s and -w option, and print to the screen. (8 points)
October 8, 2009 · slides · (No comments)

Here are the slides from today’s class covering string basics and methods in python.

pdf iconling5200-nltk-3-1-slides.pdf

October 8, 2009 · notes · (No comments)

Here are the notes for today’s class, including a lengthy discussion of string manipulation in python.

pdf iconling5200-nltk3-1-notes.pdf

October 6, 2009 · slides · (No comments)

Here are the slides from today’s class covering file input and output, reading from stdin, and command line arguments and options. Please also look at the args.py and opts.py files under resources/py, which have some examples. Two other things to note:

  1. The combined course notes file now has an appendix with solutions to practice problems
  2. I updated the celex.txt file in resources/texts. Please run svn update to get the latest version

pdf iconling5200-nltk3-slides.pdf