September 29, 2009 · slides · (No comments)

Here are the slides for today’s class covering nltk lexical resources, python functions and modules.

We did not get to the last part about automatically doing historical linguistics. We will cover that on Thursday.

pdf iconling5200-nltk2-1-slides.pdf

September 29, 2009 · notes · (No comments)

Here are the notes for today’s class, covering python functions and lexical resources in the nltk.

pdf iconling5200-nltk2-1-notes.pdf

September 28, 2009 · homework · (No comments)

Overall, students did quite well on this assignment. Comments are in your files in the subversion repository. Run svn update from your personal directory to get the latest version. My python solution file is also in the repository under resources/hmwk

Class statistics for Homework 4
mean 54.3
standard deviation 4.8

Regular expressions

  1. Use grep to print out all words from the celex.txt file which have a
    frequency of 100 or more, and either start with a-l, or end with m-z. Note that
    the orthography column comes right before the frequency column, so this is
    possible to do using a single regular expression. You should use one grouping
    and several character classes. Hint 1: use negative character classes to avoid
    getting stuff that is in more than one column. Hint 2: Consider what is
    different between the numbers 87 and 99, vs. the numbers 100 and 154.

    Here are some example words which should be included

    • at (starts with a-l)
    • yellow (ends with m-z)

    Here are some example words which should be excluded

    • omega (does not start with a-l, does not end with m-z)
    • abacus (starts with a-l, but has a frequency less than 100

    (10 points)

    grep -E '^[^\\]*\\([a-lA-L][^\\]*|[^\\]*[m-zM-Z])\\[0-9]{3,}' celex.txt

Python

Create a python script called hmwk4.py, and put all your answers in this file.
You can use comments in the file if you like. Before each answer, print out the
question number, and a brief description. For example:

print('#2 - first 10 words in the holy grail')

  1. Print the first 10 words from monty python and the holy grail (text6). (3 points)
    print(text6[:10])
  2. Print the last 20 words from Moby Dick (text1). (4 points)
    print(text1[-20:])
  3. Create a frequency distribution of the holy grail. Store it in the variable
    called moby_dist. (4 points)

    moby_dist = FreqDist(text6)
  4. Print the number of times the word “Grail” occurs in this text (4 points)
    print(moby_dist['Grail'])
  5. Print the most frequent word in the Holy Grail. (Hint: note that punctuation is counted as words by the NLTK. That is the answer might be a punctuation mark). (4 points)
    print(moby_dist.max())
  6. Create a list which contains the word lengths of each word in the Holy
    Grail(text6). Store it in a variable called holy_lengths. Do the same for Moby
    Dick (text1), and store it in a variable called moby_lengths. (6 points)

    moby_lengths = [len(w) for w in text1]
    holy_lengths = [len(w) for w in text6]
  7. Create a frequency distribution of word lengths for Moby Dick and The Holy Grail. Store the distributions in variables called moby_len_dist and holy_len_dist respectively. (6 points)
    moby_len_dist = FreqDist(moby_lengths)
    holy_len_dist = FreqDist(holy_lengths)
  8. Print the most commonly occuring word length for Moby Dick and for The Holy Grail. (Use one command for each) (5 points)
    print(moby_len_dist.max())
    print(holy_len_dist.max())
  9. Calculate the mean word length for Moby Dick and The Holy Grail. You can use the sum() function to calculate the total number of characters in the text. For example, sum([22, 24, 3]) returns 49. Store the results in variables holy_mean_len and moby_mean_len respectively. (6 points)
    holy_mean_len = sum(holy_lengths)/len(holy_lengths)
    moby_mean_len = sum(moby_lengths)/len(moby_lengths)
  10. Create a list of words from Moby Dick which have more than 3 letters, and
    less than 7 letters. Store it in a variable called four_to_six. (8 points)

    four_to_six = [w for w in text1 if len(w) > 3 and len(w) < 7]
September 25, 2009 · homework · 1 comment

In this homework you will practice loading and extracting information from various corpora, and calculating word frequency and conditional word frequency. There will be questions about conditionals, loops, and list expressions. Please put your answers in an executable python script named hmwk5.py, and commit it to the subversion repository.
It is due Oct. 2nd and covers material up to Sep. 24th.

  1. Create a list called my_ints, with the following values: (10, 15, 24, 67, 1098, 500, 700) (2 points)

  2. Print the maximum value in my_ints (3 points)
  3. Use a for loop and a conditional to print out whether the value is an odd or even number. For example, for 10, your program should print “10 is even”. (5 points)
  4. Now create a new list called new_ints and fill it with values from my_ints which are divisible by 3. In addition, double each of the new values. For example, the new list should contain 30 (15*2). Use a for loop and a conditional to accomplish this task. (5 points)
  5. Now do the same thing as in the last question, but use a list expression to accomplish the task. (5 points)
  6. Import the Reuters corpus from the NLTK. How many documents contain stories about coffee? (4 points)
  7. Print the number of words in the Reuters corpus which belong to the barley category. (5 points)
  8. Create a conditional frequency distribution of word lengths from the Reuters corpus for the categories barley, corn, and rye. (8 points)
  9. Using the cfd you just created, print out a table which lists cumulative counts of word lengths (up to nine letters long) for each category. (5 points)
  10. Load the devilsDictionary.txt file from the ling5200 svn repository in resources/texts into the NLTK as a plaintext corpus (3 points)
  11. Store a list of all the words from the Devil’s Dictionary into a variable called devil_words (4 points)
  12. Now create a list of words which does not include punctuation, and store it in devil_words_nopunc. Import the string module to get a handy list of punctuation marks, stored in string.punctuation. (5 points)
  13. Create a frequency distribution for each of the two lists of words from the Devil’s dictionary, one which includes punctuation, and one which doesn’t. Find the most frequently occuring word in each list. (6 points)
September 24, 2009 · slides · (No comments)

Here are the slides from today’s class, covering an introduction to various corpora in the NLTK, and a discussion of calculating conditional frequency distributions.

pdf iconling5200-nltk2-slides.pdf

September 22, 2009 · slides · (No comments)

Here are the slides from today’s class covering control structures (conditionals and loops) in python.

pdf iconling5200-nltk1-2-slides.pdf

September 21, 2009 · homework · (No comments)

Overall students did quite well with this assignment. I have made comments in the files you submitted, which have now been updated in the subversion repository. To see the changes, open your terminal, change the directory to <myling5200>/students/<yourname>, and type svn update. Replace <myling5200> and <yourname> with the path to your working copy and your username respectively.

Class statistics for Homework 3
mean 53.1
standard deviation 8.0

UNIX

  1. Using the celex.txt file, calculate the ratio of heterosyllabic vs. tautosyllabic st clusters. That is, how frequently do words contain an st cluster that is within a syllable, vs. how frequently they contain an st cluster that spans two syllables. Note that each word contains a syllabified transcription where syllables are surrounded by brackets []. For example, abacus has three syllables, [&][b@][k@s]. You should use grep and bc to calculate the ratio (also compare to the question from hmwk 2 to computer the average number of letters per word for each entry in the devils dictionary). 10 points
     echo "`grep -Ec 's\]\[t' celex.txt` / `grep -Ec '(\[st|st\])' celex.temp `" |bc -l
  2. How many entries in the devils dictionary have more than 6 letters? Use grep to find out (5 points)
    grep -Ec '^[A-Z-]{7,}.*, [a-z]{1,3}\.' devilsDictionary.txt

Subversion

For this homework, submit it via subversion by adding it into your own directory. Make at least 2 separate commits. When you are finished, make sure to say so in your log message.

  1. Create a new file called hmwk3_<yourname>.txt, and add it to the svn repository. Show the commands you used (5 points)
    pwd #myling5200/students/<myname>
    touch hmwk3_&lt;myname&gt;.txt
    svn add hmwk3_&lt;myname&gt;.txt
    svn commit -m 'adding homework 3 file'
  2. Show the log of all changes you made to your homework 3 file. Show the commands you used (5 points)
    pwd #myling5200/students/<myname>
    svn log hmwk3_&lt;myname&gt;.txt
  3. Find all log messages pertaining to the slides which contain grep. You will need to use a pipe. Your command should print out not only the line which contains grep, but also the 2 preceding lines. Search the grep manual for “context” to find the appropriate option. (7 points)
    pwd #myling5200
    svn log slides | grep -EiB 'grep'
  4. Show the changes to your homework 3 file between the final version and the version before that. Show the commands you used (5 points)
    pwd #myling5200/students/<myname>
    svn diff -r <number>:<number> hmwk3_&lt;myname&gt;.txt

Python

  1. Calculate the percentage of indefinite articles in Moby Dick using the NLTK. You can use the percentage function defined in chapter 1.1 (8 points)

    percentage((text1.count('a')) + text1.count('an'), len(text1))
  2. Using the dispersion_plot function in the nltk, find 1 word which has been used with increasing frequency in the inaugural address, and one which has been used with decreasing frequency. You can base your decision of increasing vs. decreasing simply by using visual inspection of the graphs. (5 points)
  3. Use the random module to generate 2 random integers between 10 and 100, and then calculate the quotient of the first number divided by the second. Make sure to use normal division, not integer division. Look at the help for the random module to find the appropriate function (10 points)
  4. from __future__ import division
    import random
    print(random.randint(10,100) / random.randint(10,100))

Extra credit:

Use perl and regular expressions to strip the answers from the solutions to homework one. First, download the solution. You might also want to look at a blog entry on perl slurping for hints. (3 extra points)

perl =pe '$string = do { local ( $/ ); <>}; $string=~s/<(code|pre)>.*?<\/pre>//gs;' < hmwk2.solution > hmwk2.question
September 19, 2009 · News · (No comments)

Several people have still been having problems installing all the python packages required for doing the exercises in the NLTK book, especially Windows users. After digging around a bit more, I have discovered another possible solution. Try installing the python distribution from Enthought. It comes pre-packaged with a bunch of different packages like matplotlib, numpy, and scipy. It is a big download, but it seemed to work ok for me. After installing that, I re-installed the nltk using the Windows .msi installed from the nltk website. Once you have these installed, you can run python using the IDLE program, which should be in your start menu.

If you want to use this version of python through cygwin, you need to add it to your path.
From cygwin, type:

nano ~/.bash_profile

And edit the file to include:

export PATH="/cygdrive/c/Python25:${PATH"}

Then quit and restart cygwin.

I hope this works for those people who have still been having problems.

September 18, 2009 · homework · 2 comments

This homework involves some more practice with regular expressions, as well as using lists with python, and making some word frequency measurements with the NLTK. It covers material up to Sep. 17th.

Regular expressions

  1. Use grep to print out all words from the celex.txt file which have a
    frequency of 100 or more, and either start with a-l, or end with m-z. Note that
    the orthography column comes right before the frequency column, so this is
    possible to do using a single regular expression. You should use one grouping
    and several character classes. Hint 1: use negative character classes to avoid
    getting stuff that is in more than one column. Hint 2: Consider what is
    different between the numbers 87 and 99, vs. the numbers 100 and 154.

    Here are some example words which should be included

    • at (starts with a-l)
    • yellow (ends with m-z)

    Here are some example words which should be excluded

    • omega (does not start with a-l, does not end with m-z)
    • abacus (starts with a-l, but has a frequency less than 100

    (10 points)

Python

Create a python script called hmwk4.py, and put all your answers in this file.
You can use comments in the file if you like. Before each answer, print out the
question number, and a brief description. For example:

print('#2 - first 10 words in the holy grail')

  1. Print the first 10 words from monty python and the holy grail (text6). (3 points)
  2. Print the last 20 words from Moby Dick (text1). (4 points)
  3. Create a frequency distribution of the holy grail. Store it in the variable
    called moby_dist. (4 points)

  4. Print the number of times the word “Grail” occurs in this text (4 points)
  5. Print the most frequent word in the Holy Grail. (Hint: note that punctuation is counted as words by the NLTK, so it is possible that the most frequent word might actually be a punctuation mark). (4 points)
  6. Create a list which contains the word lengths of each word in the Holy
    Grail(text6). Store it in a variable called holy_lengths. Do the same for Moby
    Dick (text1), and store it in a variable called moby_lengths. (6 points)

  7. Create a frequency distribution of word lengths for Moby Dick and The Holy Grail. Store the distributions in variables called moby_len_dist and holy_len_dist respectively. (6 points)
  8. Print the most commonly occuring word length for Moby Dick and for The Holy Grail. (Use one command for each) (5 points)
  9. Calculate the mean word length for Moby Dick and The Holy Grail. You can use the sum() function to calculate the total number of characters in the text. For example, sum([22, 24, 3]) returns 49. Store the results in variables holy_mean_len and moby_mean_len respectively. (6 points)
  10. Create a list of words from Moby Dick which have more than 3 letters, and
    less than 7 letters. Store it in a variable called four_to_six. (8 points)

September 17, 2009 · slides · (No comments)

Here are the slides from today, which give examples of how to use lists in python, and how to calculate word frequency using the NLTK.

pdf iconling5200-nltk1-1-slides.pdf