October 19, 2009 · homework

This homework proved to be challenging for students. We will go over some of the common problems in class on Tuesday. Please take a detailed look at my solution in resources/hmwk

Class statistics for Homework 7
mean 46
standard deviation 8.98
  1. From BASH, use svn to copy your hmwk6.py file to hmwk7.py. This will preserve all of the history from hmwk6, so you can see how you have improved your code from homework 6 to homework 7. (3 points)

    # in bash:
    svn cp hmwk6.py hmwk7.py
    # Now in python
    # we keep our functions from hmwk6
    import sys
    import os
    import getopt
    import string
    from pprint import pprint
    import nltk
    from nltk.corpus import stopwords
    def mean_sent_len(sents):
        eng_stopwords = stopwords.words('english')
        words_no_punc = [w for s in sents for w in s
                    if w not in string.punctuation and w.lower() not in eng_stopwords]
        num_words = len(words_no_punc)
        num_sents = len(sents)
        return (num_words / num_sents)

    def mean_word_len(words):
        eng_stopwords = stopwords.words('english')
        words_no_punc = [w for w in words
                  if w not in string.punctuation and w.lower() not in eng_stopwords]
        num_words = len(words_no_punc)
        num_chars = sum([len(w) for w in words_no_punc])
        return (num_chars / num_words)
  2. Create a function called usage, which prints out information about how the script should be used, including what arguments should be specified, and what options are possible. It should take one argument, which is the name of the script file. (7 points)
    def usage(script):
        print 'Usage: ' + script + ' <options> file(s)'
        print '''
        Possible options:
            -w --word print only mean word length
            -s --sent print only mean sentence length
            -h --help print this help information and exit
        '
    ''
  3. Write your script to process the following options. Look at opts.py under resources/py for an example. If both -s and -w are specified, it should print out both options. (14 points)
    -w --word print only mean word length
    -s --sent print only mean sentence length
    -h --help print this help information and exit
    
    try:
        opts, args = getopt.gnu_getopt(sys.argv[1:], "hws",
                     ["help", "word", "sent"])
    except getopt.GetoptError, err:
        # print help information and exit:
        print str(err) # will print something like "option -a not recognized"
        usage(sys.argv[0])
        sys.exit(2)
    sent = False
    word = False
    if len(opts) == 0:
        sent = True
        word = True
    for o, a in opts:
        if o in ("-h", "--help"):
            usage(sys.argv[0])
            sys.exit()
        if o in ("-s", "--sent"):
            sent = True
        if o in ("-w", "--word"):
            word = True
  4. Instead of specifying which texts to process in your code, change your code so
    that it accepts filenames from the command line. Look at the args.py file
    under resources/py for an example of how to do this. Your code should print out
    the name of each file (you can use the os.path.basename function to print out only the name of the file) specified on the command line, and the mean word length
    and sentence length, with a width of 13 and a precision of 2. Note that it
    should only print word length or sentence length if that option has been
    specified. If no files are specified, it should print the usage information
    and exit. Also note that after reading in a text you will have to first convert
    it to a list of words or sentences using the tokenize functions in the nltk,
    before calculating the mean word length and sentence length with the functions
    you defined in homework 6. See chapter 13 in the notes for examples on how to
    tokenize text to homework 5 for how to do this. The first line of output should
    be a line of headers describing the columns (28 points) Here is some example
    output:

    filename        mean_word_len mean_sent_len
    fooey                    3.45         13.47
    bar                      3.15          9.29
    
    if len(args) > 0:
        if word and sent:
            print '%-17s %s %s' % ('filename', 'mean_word_len', 'mean_sent_len')
        elif word:
            print '%-17s %s' % ('filename', 'mean_word_len')
        elif sent:
            print '%-17s %s' % ('filename', 'mean_sent_len')
        for file in args:
            f = open(file)
            raw = f.read()
            words = nltk.word_tokenize(raw)
            sents = nltk.sent_tokenize(raw)
            filename = os.path.basename(file).rstrip('.txt')
            if sent:
                mean_sent_length = mean_sent_len(sents)
            if word:
                mean_word_length = mean_word_len(words)
            if word and sent:
                print '%-17s %13.2f %13.2f' % (filename, mean_word_length, mean_sent_length)
            elif word:
                print '%-17s %13.2f' % (filename, mean_word_length)
            elif sent:
                print '%-17s %13.2f' % (filename, mean_sent_length)
    else:
        usage(sys.argv[0])
        sys.exit(2)
  5. Use your script to print out mean word length and sentence length for huckFinn, tomSawyeer, Candide, and devilsDictionary (in resources/texts). Save the output to a file called hmwk7_stats.txt in your personal directory, and commit it to the svn repository. Show the command you use in BASH. Make your paths relative to the root of your working copy of the repository. Do the same command, but also try the -s and -w option, and print to the screen. (8 points)

    # In bash:
    students/robfelty/hmwk7.py resources/texts/{huckFinn,tomSawyer,Candide,devilsDictionary}.txt > students/robfelty/hmwk7_stats.txt
    students/robfelty/hmwk7.py -w resources/texts/{huckFinn,tomSawyer,Candide,devilsDictionary}.txt
    students/robfelty/hmwk7.py -s resources/texts/{huckFinn,tomSawyer,Candide,devilsDictionary}.txt
Written by Robert Felty


Leave a Reply

You must be logged in to post a comment.

Subscribe without commenting