WHY FORENSIC LINGUISTICS NEEDS CORPUS LINGUISTICS

While corpus linguistics has existed since the 1960s, Forensic Linguistics is a relatively new discipline, involving both linguistic evidence in court and wider applications of linguistics to legal texts and discourses. Computer corpora of natural language may be marked up in various ways, grammatically tagged, parsed, lemmatised and analysed with concordance, collocation and other specialist soft ware. In the relatively short history of forensic linguistics, its exponents have oft en employed corpus linguistics techniques in order to throw light on questions like disputed authorship. However, the corpora employed have been general ones such as the Cobuild “Bank of English”, rather than purpose-built databases of language used in legal contexts, with the result that such research sometimes raises more questions than it answers. Conversely, corpus linguists have from time to time incorporated data from legal settings into their collections; but they have tended to use these resources as the basis for sociolinguistic or historical linguistic research rather than as a means of exploring topics in language and law. Th is paper makes a plea for these two fi elds, which are both already cross-disciplinary, to join forces and create a purpose-built corpus for forensic linguistics. It illustrates how corpus techniques may be successfully applied to questions of disputed authorship, citing both hypothetical and actual examples. It ends with an outline of the kinds of texts which a proposed new corpus for Forensic Linguistics should contain and the tools required to exploit it eff ectively.

and the journal Forensic Linguistics: the Journal of Speech, Language and the Law 2 .Th e term can be said to have both a narrow and a broad defi nition.Th e former covers the use of linguistic evidence in court, concerning for example disputed confessions (Coulthard 1994), trademark disputes (Okawara, 2006 and forthcoming), threats and attempts at extortion (Shuy 1993), taped conversations allegedly off ering bribes (Shuy 2005), suicide notes (Shapero, forthcoming), disputed authorship and alleged plagiarism (Kniffk a 2000).Th e broader defi nition covers all areas of overlap between language and law, including courtroom interpreting (Berk-Seligson 1990), courtroom discourse (Solan 1993;Tiersma 1999), linguistic minorities in the legal process (Eades 1994) and children in the legal process (Walker 1999).Th ese lists are by no means exhaustive but serve to give a fl avour of the wide range of research currently being undertaken in Forensic Linguistics (henceforth FL).

Corpus Linguistics
Th e most concise defi nition available is probably that of Renouf: "Th e term 'corpus' will be used to refer to a collection of texts, of the written or spoken word, which is stored and processed on computer for the purposes of linguistic research." (Renouf, 1987:1) Th e fi rst corpora in this sense were the Brown corpus (Kucera and Francis 1967) and the Lancaster-Oslo-Bergen (LOB) corpus (Garside, Leech and Sampson ed.s, 1987).Th ese consisted of 1 million words of US and British English, respectively, from published sources in the year 1961.Since then a number of general corpora have been built including the COBUILD Corpus, now known as the "Bank of English" (Sinclair ed., 1987); the British National Corpus (Burnard, L. ed., 1995) and the International Corpus of English (Greenbaum ed., 1996).As well as general linguistic research aimed at achieving a more accurate description of natural language, larger corpora have oft en been used for lexicographic purposes, most famously the COBUILD range of dictionaries and grammar books such as the Collins Cobuild English Dictionary for Advanced Learners (2001).
2 Th e journal was founded in 1994 as Forensic Linguistics but the title changed in 2003 to Th e International Journal of Speech, Language and the Law to refl ect a broadening of academic coverage and readership.
Th e fi eld of corpus linguistics has also spawned a plethora of specialised corpora, including the International Corpus of Learner English (Granger, 1994); the CHILDES database (really a collection of sub-corpora) of child language (MacWhinney 1995); the Bergen Corpus of London Teenage Language (COLT) (Stenström et al. 1998), the Leeds Corpus of English Dialects (Klemola and Jones 1999) and the Helsinki corpus of English texts (Kytö 1994).Specialist corpora may be used to study a language at a particular period in time or in a particular region, or to examine the linguistic patterns in a particular author or text type.
Corpora rarely consist of plain text, although Sinclair's concept of a "monitor corpus" (Sinclair 1982) envisaged almost-raw text fl owing through a series of soft ware fi lters to extract information from it and then being discarded.More commonly, corpora are marked up with various kinds of information such as the sex of the speaker or the date of the text; less trivially they may incorporate part-of-speech tagging or higher constituent tagging (syntactic parsing).It may be considered desirable to lemmatise the text, in order to enable the linguist or lexicographer to retrieve all forms of a particular word in a single search expression; indeed, in highly-infl ected languages such as Hungarian lemmatisation and consequent morphological mark-up are virtually essential (Pajzs 1991).In the case of parallel corpora, "hooks" into the translation equivalents in another language are embedded into the text (Botley et al. 2000).Markup may be carried out automatically or manually: usually some combination of the two is employed.Figure 1 illustrates the various forms related to the lemma "steal" while Figure 2 shows a fairly basic form of mark-up, "COCOA" tagging for the now-superseded Oxford Concordance Program (Hockey and Martin 1987), applied to the offi cial court transcript of an English trial to label the speaker of each utterance and to mark certain text as "comment" to be excluded from any processing.Once a corpus has been created and marked up ready for exploitation, specialist soft ware can be used to analyse it to produce wordlists, concordances, collocation sets and more: the CLAWS suite of programs for the LOB corpus and the CLAN suite for CHILDES are two well-known examples.Kirk (1994) gives a good overview of the various types of corpora, annotation and processing soft ware.

Hypothetical scenario: Th e case of the disputed confession
Disputed confessions are a fairly frequent phenomenon in FL as narrowly defi ned.A typical problem involves a suspect denying that part or all of an incriminating statement consists of his/her own words.A forensic linguist tasked with evaluating the plausibility of such a claim will usually seek to obtain an undisputed sample from the suspect for comparison, along with an undisputed sample from anyone suspected of being the real source of the text, such as a police offi cer.
Let us examine a fi ctitious example of this "genre", in which the text in question contains 4 instances of a relatively rare lexical item, such as "vehicle", which does not appear at all in the defendant's undisputed statement.However, it appears 5 times in the police offi cer's witness statement, as shown in Table 1.Th e linguist's native-speaker knowledge of English tells her that "vehicle" is a rarer word than "car" and moreover belongs to a more formal register, likely to be favoured by offi cers of the law.It is tempting, faced with the evidence in Table 1, to conclude that the confession is the work of the police offi cer.However, such a conclusion would be unwarranted without taking into account the relative text size, as shown in Table 2.It now appears that the absence of the word "vehicle" in the undisputed statement may be due to its shorter length and not to the rarity of the word.Th e fact that "car" occurs here with a higher frequency than in the longer, disputed statement would seem to reinforce our original suspicions, but nonetheless it is hard to be sure.Th is is the kind of situation where a reliable corpus can be a godsend.Th e data in Table 3 are actual and not hypothetical.Th ey indicate that for both British and US English the lexical item "car" is 4-5 times as frequent as "vehicle" in radio broadcasts, while in general spoken discourse it is 10 times as frequent.Th e forensic linguist can be confi dent, aft er all, that there is something distinctly odd about a text in which "vehicle" appears more frequently than "car", at least if its origins are supposed to be in speech rather than writing.

Th e Google "corpus" -a quick and dirty solution?
Some linguists now use search engines such as Google as a tool for checking the relative frequency of contrasting words in modern English, as a roughand-ready general corpus; or as a means of demonstrating that phrases one might think were common are in fact quite unique to specifi c texts or speakers/writers.Subjecting "car" and "vehicle" to a Google enquiry for domain names ending in ".co.uk" and ".com", on the assumption that these will yield British and US data respectively, is likely to produce statistics like those in Table 4.It is gratifying to fi nd that the general proportions of the lexical items under scrutiny are confi rmed by a trawl of the World-Wide Web.However, we have no way of knowing the total "text size" of the pages from which these fi gures were returned.Th ere are many questions which one can ask of a corpus but not of a search engine, such as "Do men and women use this word equally?" (requiring mark-up for the sex of the speaker/writer); "Is this feature more common in speech or writing?" (requiring control of the corpus collection), and "What words appear most frequently two places to the left of the key word?" (requiring collocation soft ware).With part-ofspeech tagging it is even possible to interrogate a corpus about particular usages of a syntactically ambiguous word, requesting all occurrences of, for instance, "judge" used as a verb but not a noun.None of this is possible with an Internet search engine.

Daniel Raphaie
Mr. Raphaie came from Iran to Britain in 1978 as a student.His fi rst language is Farsi (Persian).He remained in the country, married, had a child, took various jobs and fell into bad company.In 1988 the police raided the fl at of his former wife, where he was staying at the time.He was charged with dealing in drugs and stolen goods although no hard drugs were found in the fl at: he was convicted solely on the statements of the police that at the time of the search he had admitted having just fl ushed a quantity of heroin down the toilet.
In a linguistic examination of the alleged incriminating statements by Raphaie, it was noted that these included the following: "Look I didn't want to get caught holding it." "Look I might get six or seven grammes, maybe more, every two days." Th ere is no evidence from much longer, undisputed samples of Raphaie's speech that he ever uses look as a discourse marker; yet in this supposedly contemporaneous transcript of the search, which contains just 232 words attributed to him, he is supposed to have used it twice.
Th e Cobuild Bank of English was searched for instances of the word "look" used as a discourse marker: the results are shown in Table 5.It was found that the use of discourse marker "look" could be divided into primary and secondary usages (Blackwell, 2000).Primary occurrences were instances of the speaker using "look" directly in addressing the hearer, as in "Look, you've got to be here on Sunday." Secondary occurrences, by contrast, were examples of quoted speech in which use of the feature was being attributed to someone else, or to the speaker at some previous time: "Yeah but the ANC are saying look equality for blacks." Table 5 shows that secondary usage is twice as frequent as primary usage: in other words, "look" is twice as likely to appear in a reconstruction of someone's purported speech than in their actual original words.
Th is in itself might not be suffi cient to support Mr. Raphaie's allegations that the supposed contemporaneous transcription of his speech at the time of the police raid was nothing of the kind.However, there is other evidence that the introduction of discourse markers is one way of giving a veneer of authenticity to texts which are not contemporaneous (Coulthard 1996).One may note, moreover, that "look" is a confrontational item, unlikely to be used by a suspect to a police offi cer.Finally, Lindsay and O'Connell (1995) have observed that transcribers tend to omit all discourse markers due to pressures of real-time writing and the lack of psychological saliency of such items for the hearer.Th e sum total of this linguistic and metalinguistic evidence was considered suffi cient to discredit the police claims that the interview with Raphaie had been transcribed contemporaneously at the time of the search.Th is does not mean, of course, that the content was a total fabrication: it may have been based on a real speech event but written up some time aft erwards.In that case, however, one is justifi ed in asking why the police wrote up the alleged interaction in the Exhibits Book, claiming that this was the only book available to write in at the time the transcription was made.
Th is analysis of discourse markers was made available to Mr. Raphaie's legal team and submitted as part of the evidence to the Court of Appeal.It is believed to be the fi rst occasion when an appeal was heard in the English courts on the grounds of linguistic evidence.In the event, Mr. Raphaie's appeal was allowed on legal grounds without the linguistic evidence being put before the court.

Eddie Gilfoyle
Eddie and Paula Gilfoyle were a married couple living in Upton, Wirral, England.On 4th June 1992 Paula Gilfoyle's body was found hanging in the garage of her home.She was eight and a half months pregnant.Despite the fact that a suicide note was found in Paula's handwriting, Eddie was prosecuted for her murder.Th e prosecution claimed that Eddie, a hospital nurse, had tricked Paula into writing the note and then murdered her, in eff ect using the note as his alibi.Th e jury believed this and convicted him.He and his family and friends are still protesting his innocence.Goutsos (1995) compared the language of the problematic suicide note with samples of Eddie's writing and found a number of apparently incriminating phrases which were common to both, including "rebuild your life", "turn back the clock" and "if I could, I would".Th ere was also a tendency in both texts to use couplets such as "cheated and lied", "family and friends", "pain and suff ering" / "suff ering and pain", "hurt and suff ering" and "pain and heartache".It is tempting to conclude from this that Eddie was indeed the originator of the "suicide" note and had justly been convicted.However, as Table 6 shows, the Bank of English reveals that some of these phrases are common collocations in general use.(from Goutsos, 1995) Worse was to come.Goutsos recollects the investigations of the Birmingham University Forensic Linguistics group: "We found that the surprising phrase Goodnight and God bless which appeared in the closing off section of the disputed suicide texts is in fact a common feature of death announcements in the press of the area where the texts originated." (Goutsos 1995:108) Further problems emerged when the nature of the texts being compared with each other was taken into account: "One major problem is that our corpora were signifi cantly skewed.Th e texts involved were not alike with regard to almost any parameter among the components of speech events as formulated by Hymes (1974) "… To achieve register objectivity, we would have to refer to comparable corpora with diff erent variables such as a corpus of letters written by other people or a corpus of suicide notes." (Goutsos, 1995:107) Th us, although at fi rst sight there had appeared to be a number of incriminating similarities between Eddie's language and that of the suicide note, on closer investigation this conclusion was not justifi ed.In the fi rst place, the phrases are not so unusual in colloquial English; secondly, it has to be borne in mind that people who live together intimately probably tend to converge in their language use; and thirdly, the linguists were not comparing like with like and did not have corpora which would enable them to do so.
Th e outcome for Mr. Gilfoyle was a less happy one than for Mr. Raphaie: his two appeals against conviction were unsuccessful and he remained in jail, vehemently protesting his innocence.

General vs. Specialised Corpora for Forensic Linguistics
While general corpora such as the British National Corpus or Bank of English may be adequate for some FL purposes, as in the Raphaie case, it is clear from Goutsos' remarks cited above that such corpora cannot answer questions such as whether or not "Goodnight and God bless" is a likely way to end a suicide note.Th ere is a need for a specialised database of texts which can be used to research issues in language and law.A wide range of issues could be investigated with such a resource, such as diff erences between the language of prosecution and defence lawyers, or between expert witnesses and eye-witnesses.Th e language of judges, which has already been the focus of examination (Solan 1993), could be studied more eff ectively if a machine-readable, marked-up corpus were available to researchers.
Admittedly some legal language is already available in corpus form, most notably the proceedings of the Old Bailey from 1674-1834 which have recently been placed online.However, to date this material has been used mainly for historical linguistic and sociolinguistic research, as a rich source of information on variation and change at the interface of early modern and modern English in London.Dr Magnus Huber of the University of Giessen, for instance, is exploiting the Old Bailey data to analyse diff erences correlating with the social parameters of age, gender, place of origin and social status (Huber 2007).Similarly, the International Corpus of English (ICE) contains ten 2,000-word texts of "legal presentations" in each category (Nelson 1996), but the main purpose of this is to compare the language of such presentations across various parts of the world in which English is spoken rather than to investigate the language of the courtroom per se.
It is to be hoped that some of the existing collections of legal texts can be incorporated into the proposed corpus for forensic linguistic research.
Table 6 off ers a tentative list of text types which might usefully be included in a specialist FL corpus.7 indicates some of the features which it would be desirable to include in the mark-up of such a corpus.sex of text originator role of text originator (e.g.suspect, eye witness, police offi cer) fi rst language of text originator other languages spoken/used by text originator how long text originator has been resident in the country concerned 3 I have used the term "text originator" to indicate the person responsible for the language of the text in question.Th is is not necessarily the same as the person who writes it, as can be seen from the Gilfoyle case where a suicide note was allegedly dictated by Eddie, the text originator and alleged perpetrator, to Paula, the scribe and alleged victim.
text type (statement, court proceedings etc.) whether text was written by text originator or transcribed by someone else if transcribed: whether from speech or tape recording age, sex and role of transcriber(s) if written: details of writing (e.g.handwritten; typewritten; word-processed).
if word-processed, were spell-or style-checkers available?
purpose of the text (e.g.evidence in court, background info.for solicitor) whether other texts from same text originator exist in the corpus

Conclusion
Th is paper has attempted to provide an overview of the kinds of problems which face the forensic linguist.While some questions may be resolved satisfactorily by reference to language data readily available on the Internet or in a general machine-readable corpus, there remain thorny issues which can only be discussed properly when a specialist corpus for language and law becomes available.We cannot state with any degree of confi dence that a disputed suicide note is a forgery until we have an idea of what a "normal" suicide note looks like.We may be sure that a particular utterance, such as "I then proceeded to exit the vehicle", is so formal that it can only have been produced by a police offi cer; but a cross-examining lawyer is likely to put it to any expert witness stating this that perhaps in the formal setting of a police interview, suspects (especially seasoned ones with previous experience of such speech events) are likely to accommodate their language to that of the interviewing offi cer.When a person's liberty and reputation are at stake, mere linguistic intuition is not good enough: there needs to be a solid basis on which to draw reasonable conclusions, preferably supported by quantitative data which can be subjected to statistical tests.Th e construction of a purpose-built corpus for research in language and law will not only nurture the academic curiosity of linguists, but should serve the wider interests of justice.

Bibliography
Berk-Seligson, S., 1990, Th e Bilingual Courtroom: Court Interpreters in the Judicial Process.Chicago: University of Chicago Press.

Figure 2 :
Figure 2: Corpus data with mark-up

Table 3 :
Frequencies of "car" and "vehicle" in the Bank of English

Table 4 :
Results of a search for "car" and "vehicle" using Google

Table 5 :
Discourse marker "Look" in the Bank of English

Table 6 :
Phrases from the "suicide note" in the Bank of English

Table 6 :
A corpus for Forensic Linguistics: Text types

Table 7 :
A corpus for Forensic Linguistics: annotation age of text originator 2