November 23, 2009 · homework · (No comments)

Most students did very well on this assignment. The only consistent shortcoming was having unnecessary loops in the tag_errors function. These unnecessary loops lead to an increase in execution time of about 10%.

Class statistics for Homework 11
mean 56.71
standard deviation 8.58

    In this homework you will practice part of speech tagging, and evaluating part of speech taggers. The homework covers material up to Nov. 12, and is due Nov. 19th.

    1. Use the unigram tagger to evaluate the accuracy of tagging of the romance and the adventure genres of the Brown corpus. Use a default tagger of NN as a backoff tagger. You should train the tagger on the first 90% of each genre, and test on the remaining 10%. (10 points)

      t0 = nltk.DefaultTagger('NN')

      adv_tagged_sents = brown.tagged_sents(categories='adventure')
      adv_size = int(len(adv_tagged_sents) * 0.9)
      adv_train_sents = adv_tagged_sents[:size]
      adv_test_sents = adv_tagged_sents[size:]
      adv_tag = nltk.UnigramTagger(adv_train_sents, backoff=t0)
      adv_tag.evaluate(adv_test_sents)

      rom_tagged_sents = brown.tagged_sents(categories='romance')
      rom_size = int(len(rom_tagged_sents) * 0.9)
      rom_train_sents = rom_tagged_sents[:size]
      rom_test_sents = rom_tagged_sents[size:]
      rom_tag = nltk.UnigramTagger(rom_train_sents, backoff=t0)
      rom_tag.evaluate(rom_test_sents)
    2. Now let’s investigate the most common types of errors that our tagger makes. Write a function called tag_errors which will return all errors that our tagger made. It should accept two arguments, test, and gold, which should be lists of tagged sentences. The test sentences should be ones that have been automatically tagged, and the gold should be ones that have been manually corrected. The function should output a list of incorrect, correct tuples, e.g. [('VB', 'NN'), ('VBN', 'VBD'), ('NN', 'VB'), ('NN', 'VBD'), ('TO', 'IN')]. (15 points)

      def tag_errors(test,gold):
          '''returns list of tuples of (wrong,correct) given automatically tagged
          data and the gold standard for that data'
      ''
          errors=[]
          for testsent, goldsent in zip(test,gold):
              for testpair, goldpair in zip(testsent,goldsent):
                  if testpair[1]!=goldpair[1]:
                      errors.append((testpair[1],goldpair[1]))
          return errors
    3. Use the Unigram taggers you trained to tag the test data from the adventure and romance genres of the Brown corpus. HINT: Look at the batch_tag method of the UnigramTagger. (10 points)

      adv_sents = brown.sents(categories='adventure')
      adv_unknown = adv_sents[adv_size:]
      adv_test = adv_tagger.batch_tag(adv_unknown)

      rom_sents = brown.sents(categories='romance')
      rom_unknown = rom_sents[rom_size:]
      rom_test = rom_tagger.batch_tag(rom_unknown)
    4. Use your tag_errors function to find all the tagging errors for the romance and adventure genres of the Brown corpus. (10 points)

      adv_errors = tag_errors(adv_test, adv_test_sents)
      rom_errors = tag_errors(rom_test, rom_test_sents)
    5. Now create frequency distributions of the tagging errors for the romance and adventure genres. (5 points)
      adv_error_fd = nltk.FreqDist(adv_errors)
      rom_error_fd = nltk.FreqDist(rom_errors)
    6. What differences do you notice between the frequency distributions of the two genres? (No code required for this question) (5 points)
       
    7. How might we improve our tagging performance? (No code required for this question) (5 points)
    8.  
November 19, 2009 · notes · (No comments)

Here are today’s notes on classifier evaluation and decision trees

pdf iconling5200-nltk6-notes.pdf

November 17, 2009 · notes · (No comments)

Here are today’s notes on supervised classification

pdf iconling5200-nltk6-notes.pdf

November 16, 2009 · homework · (No comments)

Most students did well on this assignment. Please take a detailed look at my solution in resources/hmwk/hmwk10.py

Class statistics for Homework 10
mean 51.67
standard deviation 7.28
  1. Use svn to copy my solution to homework 8 from resources/py into your personal directory as hmwk10.py (5 points)

    svn cp resources/py/hmwk8.py students/robfelty/hmwk10.py
  2. Modify the mean_word_len and mean_sent_len functions to accept two optional
    arguments, ignore_stop and use_set. The default for each of
    these should be True. If use_set is True, you should convert the
    stopword corpus to a set. If ignore_stop is True, you should ignore stopwords from the calculation (which is what the code in hmwk8.py does). (15 points)

    def mean_sent_len(sents, ignore_stop=True, use_set=True):
        ''' returns the average number of words per sentence

        Input should be a list of lists, with each item in the list being a
        sentence, composed of a list of words. We ignore any punctuation and
        stopwords
        '
    ''
        if use_set:
            eng_stopwords = set(stopwords.words('english'))
        else:
            eng_stopwords = stopwords.words('english')
        if ignore_stop:
            words_no_punc = [w for s in sents for w in s
                            if w not in string.punctuation
                            and w.lower() not in eng_stopwords]
        else:
            words_no_punc = [w for s in sents for w in s
                            if w not in string.punctuation ]
        num_words = len(words_no_punc)
        num_sents = len(sents)
        return (num_words / num_sents)

    def mean_word_len(words, ignore_stop=True, use_set=True):
        ''' returns the average number of letters per words

        Input should be a list of words.
        We ignore any punctuation and stopwords
        '
    ''
        if use_set:
            eng_stopwords = set(stopwords.words('english'))
        else:
            eng_stopwords = stopwords.words('english')
        if ignore_stop:
            words_no_punc = [w for w in words
                  if w not in string.punctuation and w.lower() not in eng_stopwords]
        else:
            words_no_punc = [w for w in words
                  if w not in string.punctuation]
        num_words = len(words_no_punc)
        num_chars = sum([len(w) for w in words_no_punc])
        return (num_chars / num_words)
  3. Now create a new file called means_timing.py. In this file, import your hmwk10.py module, and use the timeit module to test how long it takes to calculate the mean sentence length 100 times, trying all 4 combinations of the parameters of use_set and ignore_stop. (10 points)

    import nltk
    import hmwk10
    setup = '''import nltk
    import text_means
    f = open('
    ../texts/Candide.txt')
    raw = f.read()
    sents = text_means.sent_tokenize(raw)
    words = nltk.word_tokenize(raw)
    '
    ''
    test1 = 'text_means.mean_word_len(words)'
    print Timer(test1, setup).timeit(100)

    test2 = 'text_means.mean_word_len(words, use_set=False)'
    print Timer(test2, setup).timeit(100)

    test3 = 'text_means.mean_word_len(words, use_set=False, check_stop=False)'
    print Timer(test3, setup).timeit(100)

    test4 = 'text_means.mean_word_len(words, use_set=True, check_stop=False)'
    print Timer(test4, setup).timeit(100)
  4. Now add another global option called include-stop (i for short) to hmwk10.py specifying whether or not to ignore stopwords when calculating mean word length and sentence length. The default should be False. (10 points)
    opts, args = getopt.gnu_getopt(sys.argv[1:], "hwsajni",
                 ["help", "word", "sent", 'ari', 'adj', 'noheader', 'include-stop'])
    include_stop = False
    for o, a in opts:
        if o in ("-i", "--include-stop"):
            include_stop = True

    # in calc_text_stats
    mean_sent_length = mean_sent_len(sents,include_stop=include_stop)
  5. Modify the calc_text_stats function so that it also computes the percentage of words that are stop words. 10 points
  6. Now create a bash script which prints out the mean word and sentence length for Huck Finn, Tom Sawyer, Candide, and the Devil’s dictionary. Pipe the output to sort to sort by mean sentence length. Try it both including and ignoring stop words. Your output (when ignoring stop words), should look like the that below.(10 points)
    filename          mean_word_len mean_sent_len per_stop_words
    tomSawyer                  5.51          7.46            42.2
    Candide                    6.07          9.04            43.5
    huckFinn                   4.93          9.32            45.0
    devilsDictionary           6.30         10.08            40.2
    
     ./text_means.py -wsi ../texts/{tomSawyer,huckFinn,Candide,devilsDictionary}.txt |sort -nk 3
     ./text_means.py -ws ../texts/{tomSawyer,huckFinn,Candide,devilsDictionary}.txt |sort -nk 3
November 14, 2009 · homework · (No comments)

In this homework you will practice part of speech tagging, and evaluating part of speech taggers. The homework covers material up to Nov. 12, and is due Nov. 20th.

  1. Use the unigram tagger to evaluate the accuracy of tagging of the romance and the adventure genres of the Brown corpus. Use a default tagger of NN as a backoff tagger. You should train the tagger on the first 90% of each genre, and test on the remaining 10%. (10 points)
  2. Now let’s investigate the most common types of errors that our tagger makes. Write a function called tag_errors which will return all errors that our tagger made. It should accept two arguments, test, and gold, which should be lists of tagged sentences. The test sentences should be ones that have been automatically tagged, and the gold should be ones that have been manually corrected. The function should output a list of incorrect, correct tuples, e.g. [('VB', 'NN'), ('VBN', 'VBD'), ('NN', 'VB'), ('NN', 'VBD'), ('TO', 'IN')]. (15 points)
  3. Use the Unigram taggers you trained to tag the test data from the adventure and romance genres of the Brown corpus. HINT: Look at the batch_tag method of the UnigramTagger. (10 points)
  4. Use your tag_errors function to find all the tagging errors for the romance and adventure genres of the Brown corpus. (10 points)
  5. Now create frequency distributions of the tagging errors for the romance and adventure genres. (5 points)
  6. What differences do you notice between the frequency distributions of the two genres? (No code required for this question) (5 points)
  7. How might we improve our tagging performance? (No code required for this question) (5 points)
November 12, 2009 · News · 2 comments

Sam found a nice program to automatically identify the language of a text using trigrams. You might find it of interest.

November 12, 2009 · notes · (No comments)

Here are today’s notes covering details of part of speech tagging

pdf iconling5200-nltk-5-1-notes.pdf

November 12, 2009 · homework · (No comments)

Several people have asked some questions about homework 10 which I would like to address

On named parameters to mean_sent_len and mean_word_len functions. We had previously defined these functions to ignore stop words. That is when computing the mean number of words per sentence, throw out stop words before calculating the mean. We might not want to this all the time though. So now we make this an option to the function. Like all other named arguments to functions, they have a default value. In this case, we want the default to be true.

For question 3, remember that when using the timeit module, you have to import all necessary modules in your setup statement. If you like, this can be a multiline string (it’s easier to read that way). Also note that question 3 has nothing to do with question 4.

Note that for question 4, I am asking you to add a global option, i.e. one that you could specify when calling your script from the command line. This has nothing to do with question 3 at all.

Note that my sample output had an error. I accidentally output the percentage of non-stopwords, as opposed to the percentage of stopwords. Sorry about that, and thanks to Steve for pointing it out.

Finally, as to the strange naming of include-stopwords, consider trying it the other way around, using ignore_stopwords. If this is true by default, (which is what we want), then how do you make it false from the command line? You could make it have an argument, so you would say

./hmwk10.py --ignore_stopwords=false 

but I don’t like that. I would rather specify

./hmwk10.py --include_stopwords

and have the default for –include_stopwords be false.

November 10, 2009 · notes · (No comments)

Here are today’s notes on part of speech tagging

pdf iconling5200-nltk-5-notes.pdf

November 5, 2009 · homework · 2 comments

In this homework you will apply some of the advanced function programming we have discussed, including using named arguments and default values. It covers material up to November 5th, and is due November 13th.

  1. Use svn to copy my solution to homework 8 from resources/py into your personal directory as hmwk10.py (5 points)
  2. Modify the mean_word_len and mean_sent_len functions to accept two optional
    arguments, ignore_stop and use_set. The default for each of
    these should be True. If use_set is True, you should convert the
    stopword corpus to a set. If ignore_stop is True, you should ignore stopwords from the calculation (which is what the code in hmwk8.py does). (15 points)
  3. Now create a new file called means_timing.py. In this file, import your hmwk10.py module, and use the timeit module to test how long it takes to calculate the mean sentence length 100 times, trying all 4 combinations of the parameters of use_set and ignore_stop. (10 points)
  4. Now add another global option called include-stop (i for short) to hmwk10.py specifying whether or not to ignore stopwords when calculating mean word length and sentence length. The default should be False. (10 points)
  5. Modify the calc_text_stats function so that it also computes the percentage of words that are stop words. 10 points
  6. Now create a bash script which prints out the mean word and sentence length for Huck Finn, Tom Sawyer, Candide, and the Devil’s dictionary. Pipe the output to sort to sort by mean sentence length. Try it both including and ignoring stop words. Your output (when ignoring stop words), should look like the that below.(10 points)
    filename          mean_word_len mean_sent_len  per_stop_words
    tomSawyer                  5.51          7.46            42.2
    Candide                    6.07          9.04            43.5
    huckFinn                   4.93          9.32            45.0
    devilsDictionary           6.30         10.08            40.2