Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 2 solution – More UNIX basics and regular expressions

September 14, 2009 · homework

Overall most students did quite well. Comments are in your svn direcotries.

mean	54.4
standard deviation	6.24

Print out the entries (orthography only) from the celex.txt file which were taken from GOOGLE. Hint: You will need to use a pipe. (6 points)

grep GOOGLE celex.txt | cut -f2 -d '\'
Print out the 50 most frequent words from the celex.txt file which were taken from GOOGLE. Hint: You will need to combine the answers from the last 2 questions. (9 points)

grep GOOGLE celex.txt | cut -f2,3 -d '\' | sort -t '\' -k 2,2rn | head -n 50
OR
grep GOOGLE celex.txt | sort -t '\' -k 3,3rn | head -n 50 | cut -f 2,3 -d '\'
Use unix commands to count the number of entries (not definitions) in the devil’s dictionary that begin with a vowel. Your output should be a single number. (7 points)
grep -Ec '^[AEIOU][A-Z-]*,' devilsDictionary.txt
227
Use unix commands to calculate the average number of letters per word for each entry (not the definitions) in the Devil’s Dictionary. The output should simply be a number. HINT: You will need to use subshells, and bc (10 points)
entries=`grep -E '^[A-Z]+,' devilsDictionary.txt |cut -f1 -d ','|wc -l`
letters=`grep -E '^[A-Z]+,' devilsDictionary.txt |cut -f1 -d ','|wc -c`
echo "$letters/$entries"|bc -l

OR, in one fell swoop

echo "`grep -E '^[A-Z]+,' devilsDictionary.txt |cut -f1 -d ','|wc -c`/`grep -E '^[A-Z]+,' devilsDictionary.txt |cut -f1 -d ','|wc -l`"|bc -l
Count the number of adjectives, nouns, and verbs in the devil’s dictionary. (10 points)
noun=`grep -cE '^[A-Z]+, n\.' devilsDictionary.txt`
verb=`grep -cE '^[A-Z]+, v\.' devilsDictionary.txt`
adj=`grep -cE '^[A-Z]+, adj\.' devilsDictionary.txt`
Print out all the entries (not the definitions), which are not adjectives, nouns, or verbs. HINT: use grep more than once. (10 points)
grep -E '^[A-Z]+, ' devilsDictionary.txt |grep -vE '^[A-Z]+, (v|n|adj)\.' | cut -f1 -d '.'
Write a unix pipeline which will print the number of words in the celex.txt file that contain a q not followed by a u (look only at the orthography of each entry). (8 points)
cut -f2 -d '\' celex.txt |grep -Eic 'q[^u]'
EVEN BETTER
cut -f2 -d '\' celex.txt |grep -Eic 'q([^u]|$)'
Extra credit

Write a unix pipeline which will print the total number of points in this assignment. Don’t include the points for the extra credit (3 extra points) (Hint: use dc)

echo "`grep -oE '[0-9]+ points' hmwk2.solution |cut -d ' ' -f1` ++++++ p"|dc

Written by Robert Felty

Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 2 solution – More UNIX basics and regular expressions

Leave a Reply

Archives

Categories