The following is a list of course-final assignments – choose one from it and send your answers and the code you used to answer the questions to Albert at aventayolboada@ucsb.edu by 24 March 2023, 15:00 PST.
Jackendoff (1997) investigates what he calls the ‘time’-away construction, which is exemplified by utterances such as Bill slept the afternoon away, We’re twisting the night away, or The two of them would drink the week away.” Try to generate a regular expression that can retrieve such instances from the SGML-annotated version of the BNC. To simplify things, look for sequences of a verb, followed by the (with its tag), followed by a singular or plural form of any of the nouns morning, noon, afternoon, evening, night, day, week, month, or year, followed by away tagged as an adverb.
How many instances are there? What is the distribution of the time nouns, what time scale do most of them refer to, and what is the semantic implication of this construction?
Another partially lexically-filled argument structure construction in English is the so-called into-causative:
How many occurrences of the into-causative can you find in the SGML files representing the British National Corpus? Use tags and I recommend you use them to look for into (tagged as a preposition) followed by something that ends in ing (tagged as a verb). Also, answer the following questions:
What is the semantic prosody of the (verb and/or noun) lemma CAUSE? Specifically, does CAUSE go with largely positive, neutral, or negative collocates and with which ones in particular? Answer this question by exploring and discussing only the noun and verb (but not forms of BE, DO, HAVE, or modal verbs) collocates of CAUSE in close proximity of CAUSE (4 words to the left and right) in the 4 BNC SGML files.
It has been proposed that the ‘thematic concentration’ of a corpus of files, or of just one file can be quantified using the logic underlying the h-index, which is often used to quantify academics’ scholarly ‘productivity and impact’. The ‘thematic concentration’ of a file can be computed by
(By analogy, for academics’ scholarly ‘productivity and impact’, Google Scholar defines the h-index as the largest number h such that h publications have at least h citations.)
Your task is to look at the ICE-GB corpus and find the spoken corpus file and the written corpus file with the highest thematic concentration, i.e. the file with the largest proportion of word types that have a frequency greater than their ranks.