Skip to main content


By May 15, 2020留学咨询

CS1210 Computer Science I: Fundamentals Homework 2: Handling Text Due Friday, October 25 at 11:59PM Introduction In this assignment, we’re going to develop some code that relies greatly on the string datatype as well as all sorts of iteration. First, a few general points. (1) This is a challenging project, and you have been given two weeks to work on it. If you wait to begin, you will almost surely fail to complete it. The best strategy for success is to work on the project a little bit every day. (2) The work you hand in should be only your own; you are not to work with or discuss your work with any other student. Sharing your code or referring to code produced by others is a violation of the student honor code and will be dealt with accordingly. (3) Help is always available from the TAs or the instructor during their posted office hours. You may also post general questions on the Piazza discussion board (although you should never post your Python code). Background In this assignment we will be processing text. With this handout, you will find a file containing the short story entitled The Cat by Mary E. Wilkins Freeman, written in 1901, to test your code. At some point during the course of this assignment, I will provide you additional texts for you to test your code on; updated versions of this handout may also be distributed as needed. You should think of this project as building tools to read in, manipulate, and analyze these texts. The rest of these instructions outline the functions that you should implement, describing their input/output behaviors. As usual, you should start by completing the hawkid() function so that we may properly credit you for your work. Test hawkid() to ensure it in fact returns your own hawkid as the only element in a single element tuple. As you work on each function, test your work on the document provided to make sure your code functions as expected. Feel free to upload versions of your code as you go; we only grade the last version uploaded, so this practice allows you to “lock in” working partial solutions prior to the deadline. Finally, some general guidance. (1) You will be graded on both the correctness and the quality of your code, including the quality of your comments! (2) As usual, respect the function signatures provided. (3) The template file has been pared down; you should take responsibility for providing the appropriate level of documentation. (4) Be careful with iteration; choose the most appropriate form of iteration (comprehension, while, or for) as the function mandates. Poorly selected iterative forms may be graded down, even if they work! Finally, you may feel free to add new functions that you feel are necessary to complete the functionality described herein. Comment appropriately! def getText(file): This function should open the file named file, and return the contents of the file formatted as a single string. During processing, you should (i) remove any blank lines, (ii) remove any lines consisting entirely of CAPITALIZED WORDS1, and (iii) replace any explicit ’\n’ (newline) characters with spaces unless 1 directly preceeded by a ’-’ (hyphen), in which case you should simply remove both the hyphen and the newline, restoring the original word. def flushMarks(text): This function should take as input a string such as what might be returned by getText() and return a new string with the following modifications to the input: Remove possessives, i.e., “’s” at the end of a word; Remove ’)’, ’(’, ’,’, ’:’, ’-’, ’_’, ’…’, ’”’ and “’”; and Replace ’!’, ’?’ and ’;’ with ’.’ A condition of this function is that it should be easy to change or extend the substitutions made. In other words, a function that steps through each of these substitutions in an open-coded fashion will not get full credit; write your function so that the substitutions can be modified or extended without having to significantly alter the code. Here’s a hint: if your code for this function is more than a few lines long, you’re probably not doing it right. def extractWords(text, i=0, k=None): This function should take as input a string such as might be returned by flushMarks() and return an ordered list of words extracted from the input string. The words returned should all be lowercase, and should contain only characters, no punctuation. You can think of the i and k arguments having similar — but not identical — function as the arguments to range(). If left unspecified, the default behavior is to return all the words in the input text. Otherwise, return a list of words starting with the ith word through, but not including, the (i + k)th word of the input (if k is None, the default, then return all of the words in the input starting with the ith word through the end). def extractSentences(text, i=0, k=None): This function returns a list of sentences, where each sentence is defined as a string terminated by a ’.’ although the defining ’.’ itself is removed in the course of processing. The significance of i and k are the same as for extractWords(), that is, they restrict the text to a passage between the ith up to but not including the (i + k)th words. Note that, depending on i and k, the first and/or last sentences returned may actually be sentence fragments. def countSyllables(word): This function, which is provided (i.e., you don’t need to write it) takes as input a string representing a word (such as one of the words in the output from extractWords(), and returns an integer representing the number of syllables in that word. One problem is that the definition of syllable is unclear. As it turns out, syllables are amazingly difficult to define in English (this may well be the topic of a future assignment). The code provided here defines a syllable as follows. First, we strip any trailing ’s’ or ’e’ from the word (the final ’e’ in English is often, but not always, silent). Next, we scan the word from beginning to end, counting each transition between a consonant and a vowel, where vowels are defined as the letters ’a’, ’e’, 1 To understand why we remove lines consisting entirely of CAPITALIZED WORDS, inspect the wind . txt sample file provided. Notice that the frontspiece (title, index and so on) consists of ALL CAPS, and each CHAPTER TITLE also appears on a line in ALL CAPS. Removing these lines leaves just the text of the story. 2 ’i’, ’o’ and ’u’. So, for example, if the word is “creeps,” we strip the trailing ’s’ to get “creep” and count one leading vowel (the ’e’ following the ’r’), or a single syllable. Thus: >>> c oun t Sy l l a b l e s ( ’ c r e e p s ’ ) 1 >>> c oun t Sy l l a b l e s ( ’ d evo t i on ’ ) 3 >>> c oun t Sy l l a b l e s ( ’ c r y ’ ) 1 The last example hints at the special status of the letter ’y’, which is considered a vowel when it follows a non-vowel, but considered a non-vowel when it follows a vowel. So, for example: >>> c oun t Sy l l a b l e s ( ’ c oyo t e ’ ) 2 Here, the ’y is a non-vowel so the two ’o’s correspond to 2 transitions, or 2 syllables (don’t forget we stripped the trailing ’e’). And while that’s not really right (’coyote’ has 3 syllables, because the final ’e’ is not silent here), it does properly recognize that the ’y’ is acting as a consonant. You will find this definition of syllable works pretty well for simple words, but fails for more complex words; English is a complex language with many orthographic bloodlines, so it may be unreasonable to expect a simple definition of syllable! Consider, for example: >>> c oun t Sy l l a b l e s ( ’ c on s ume s ’ ) 3 >>> c oun t Sy l l a b l e s ( ’ s p l a s h e s ’ ) 2 Here, it is tempting to treat the trailing -es as something else to strip, but that would cause ’splashes’ to have only a single syllable. Clearly, our solution fails under some conditions; but I would argue it is close enough for our intended use. Readability Formulae Next, we turn our attention to computing a variety of readabilit
y indexes. Readability indexes hav e been used since the early 1900’s to determine if the language used in a book or manual is too hard for a particular audience. At that time, of course, most of the population didn’t hav e a high school degree, so employers and the military were concerned that their instructions or manuals might be too difficult to read. Note that the versions of these formulae used in this assignment may deviate slightly from what you may find on the web. def lix(text, i=0, k=None): The Lasbarhetsindex Swedish Readability Formula, or LIX, like all the indexes here, is based on a sample of the text. By default, we’ll compute the LIX test over the whole input, but if you want to run it only over a subset of the text, use i and k to restrict which section of the text (in words) is considered. The LIX formula is: lix = wrd snt + (100 × lng) wrd Where wrd is the number of words in the sample, snt is the number of sentences in the sample, and lng is the number of words in the sample that exceed 6 characters. You should have no problem computing the LIX formula for a text consisting of only complete sentences; any sentence fragments at the beginning 3 and/or the end of the text sample should be counted as one additional sentence (particularly important if we use i and k to restrict the range for the LIX). def fog(text, i=0, k=None, csyl = countSyllables): Gunning’s Fog Index, or FOG, is defined as: fog = 0. 4(asl + phw) Where asl is the average sentence length (in words) in the sample, and phw is the percentage of words in the sample that are 3 or more syllables long. Thanks to the complicated history of the English language, counting syllables is extremely complicated. For this assignment, I’m providing you with a syllable counting function countSyllables(). In future assignments, I might ask you to create a new function that uses a different definition of what a syllable is; for this reason, your version of fog() should take (as an optional argument) the particular syllable counting function you wish to use. def srs(text, i=0, k=None, csyl = countSyllables): The Smog Readability Score, like Gunning’s Fog Index, also relies on the notion of “hard” words, but combines it with a sentence count: srs = 1. 043 ×√ 30 × hrdsnt + 3. 1291 where hrd is the number of hard words in the sample and snt is the number of sentences in the sample. As with LIX, some care is required when handling sentence fragments in the sample. Testing Your Code I hav e provided a function, evalText(), that you can use to manage the process of evaluating a piece text. d e f eva l Te x t ( fi l e= ’wi nd . t x t ’ , i =0 , k=No n e , c s y l =c oun t Sy l l a b l e s ) : t ex t = flu s hMa r k s ( g e t Te x t ( fi l e ) ) p r i n t ( “Eva l u a t i ng { } : ” . f o rma t ( fi l e . upp e r ( ) ) ) p r i n t ( ” { : 5 . 2 f } L i x Re a d a b i l i t y Fo rmu l a ” . f o rma t ( l i x ( t ex t , i , k ) ) ) p r i n t ( ” { : 5 . 2 f } Gunn i ng ’ s Fog I nd ex ” . f o rma t ( f og ( t ex t , i , k , c s y l ) ) ) p r i n t ( ” { : 5 . 2 f } Smo g Re a d a b i l i t y Sc o r e ” . f o rma t ( s r s ( t ex t , i , j , c s y l ) ) ) Feel free to comment out readability indexes you haven’t yet tried to use. 4


Author admin

More posts by admin

Leave a Reply