Superdiversity and corpus linguistics

Caroline Tagg, with Christian Mair and Rachelle Vessey

In June 2015, two events were held to explore the possible synergies between superdiversity – used initially by Steven Vertovec to describe a new ‘diversification of diversity’ in contemporary European cities – and corpus linguistics, an approach to language that starts with large databases of texts and explores the frequency, distribution and co-occurrence of linguistic features within them. The two speakers, Professor Christian Mair from the University of Freiburg and Dr Rachelle Vessey from Newcastle University, are linguists researching different language corpora who both came to important conclusions regarding the potential relevance of superdiversity for their work. The questions that the seminars sought to explore are twofold. To what extent should linguists building English-language corpora, for example, take into account the superdiverse contexts in which English is being used? And to what extent can corpus linguistics be used to understand language use in superdiverse contexts?

Dr Rachelle Vessey
Dr Rachelle Vessey

Rachelle Vessey’s work on Canadian French and English language Tweets highlights the role that corpus linguistics can play in identifying the recurring linguistic and communicative choices revealed through analysis of dominant patterns in discourse, but also indicates the risk that a corpus approach can obscure the complex sociolinguistic realities underlying the discourse. Much of her research into Canadian media has been sparked by her realisation that French and English media outputs are not always equivalent. Collecting different corpora of Tweets allowed Rachelle to empirically explore whether her personal impression was reflected more widely across social media, and thus whether language choices were likely to be having an impact on the Canadian sociolinguistic landscape. Focusing on Canadian politicians’ use of hashtags – which can function not only as topic markers but also as user-generated indexing systems for Tweets – Rachelle found that the English language hashtag #CDNPOLI (“Canadian politics”) and the French language hashtag #POLCAN (“la politique canadienne”) signalled not only the topics under discussion (i.e. Canadian politics), but also the language of use. The conclusion that Rachelle drew from this and other datasets was that, in the case of Canadian political discussions on Twitter, monolingualism and language ideologies more generally were being reproduced in and through social media (a finding which challenges assumptions that social media is in any way inherently superdiverse).

Taken from The Canadian Press, February 12th 2015
Taken from The Canadian Press, February 12th 2015

However, as Rachelle has noted elsewhere, ‘corpora by default reduce real-world complexity … to simplicity and homogeneity’ and ‘[m]onolingual corpora may contain texts produced by multicultural and multilingual populations in the medium of the dominant group’ (Vessey, 2013, p.7). In her other research, she has argued that cross-linguistic and cross-cultural comparisons can obscure the fact that a language like English is often used by diverse minority groups (Vessey 2013). If, as Rachelle puts it, ‘[t]he analysis of discourses is at least to some extent tied to the objective of understanding differences between the groups in question’ (Vessey 2013, p. 3), then the tendency of corpus linguistics to identify frequently-occurring words and phrases as central to a discourse community is challenged by the diverse social composition of groups using ‘English’ – as an international lingua franca. Despite the largely monolingual nature of the Canadian social media data that Rachelle collected and analysed, there were still issues of language mixing and non-standard language use. As a result, there remains a need for corpus linguists to take into account issues of superdiversity in order to adapt existing tools to explore evolving uses of language in online spaces.

Professor Christian Mair
Professor Christian Mair

For Christian Mair, the notion of superdiversity became relevant when he began investigating ‘cyber-diasporas’ from West Africa and the Caribbean, using corpora which he compiled between 2000 and 2008 from online forums. Earlier corpora collected in these regions as part of the International Corpus of English (ICE) had followed typical corpus-building conventions in aiming for monolingual, generally fairly standard, and often written monolingual data, and where lengthy shifts into other varieties occurred in spontaneous conversations, these were included only as ‘extra-corpus materials’ (Mair 2013). The non-standard language used in the informal, officially-unregulated online forums made it difficult to avoid the fact that English in these diaspora communities is only part of speakers’ varied repertoires, encompassing pidgins, creoles and other languages, as well as non-standard spellings, as illustrated in the following post in Igbo, with English and Pidgin resources represented in bold:

Achoro m ka emerie CIV maka ha meriri anyi and again ka ighara inwete £500/ Nke a i na rove my signature ama n oburo TT ka ichoro igba m/ one goal to umu ishmel (egypt)

[I would like CIV [Côte d’Ivoire] to be defeated because they’ve beaten us before and again I don’t want to bring 500 pounds. Now that you love my signature, I don’t know if you want to score me one goal to the Ishmaelite (Egypt).]

(Mair 2013)

As Christian pointed out, while traditional corpus methods allow for a description of the dominant and recurring uses of Jamaican, Nigerian and Cameroonian English (including the distribution of racial and ethnic labels, as explored by Heyd 2014), corpus tools have not been designed to enable an understanding of how particular combinations of signs come to be accepted as sociolinguistically authenticated in particular contexts. Nor is a corpus linguistics approach which assumes localisable communities of speakers always able to capture the varied backgrounds of the individuals interacting in an online space. Using a tool called N-CAT (Net Corpora Administration Tool), Christian mapped the varied locations of the forum users across the globe and explored which features of Nigerian Pidgin, Jamaican Creole, Cameroon Pidgin and Camfranglais (a hybrid urban vernacular) were being used and where. However, what such tools do not capture are the complex trajectories of the forum members, as illustrated in one post where the writer claims ‘I’m originally from Nigeria and the Bahamas but I’m currently living in Belgium’.

A map showing the distribution of the discourse particle ‘abi’ (Nigerian Pidgin)
A map showing the distribution of the discourse particle ‘abi’ (Nigerian Pidgin)

To the extent that corpus linguists want to go beyond identifying and describing standard varieties to explore informal spoken or online practices in multilingual, migratory or diasporic contexts (and they may justifiably not want to), they may have to move away from a focus on language description (e.g. English) in favour of a focus on speaker repertoires and the complex ways in which resources associated with various languages, registers and styles are distributed across corpora.

The sociolinguistic study of language and superdiversity has predominantly been ethnographic (Blommaert and Rampton 2011). For sociolinguists, understanding multilingual contexts requires a focus on the individual and on how locally-relevant resources are deployed as meaningful signs in particular contexts of use (Blackledge and Creese 2010). Corpus linguistics also focuses on language as it is used by individuals in real contexts, but its tools have tended to be aimed at describing the language varieties used in these contexts (such as standard Jamaican English), however complex that variety may be (as in the case of ‘global English’). In other words, corpus linguistics starts with the language as the baseline level of analysis, an assumption with which the aforementioned superdiversity researchers would challenge. Yet the studies discussed above show how corpus linguistics methods could be used as part of a wider set of linguistic tools to pin down the wider sociodemographics of certain language features (as seen in the maps of cyber-Nigerian); to extend an ethnographic approach by exploring the typicality and thus the likely local meaning and wider social significance of linguistic resources used in any one instance (as with the use of French and English by Canadian politicians). By moving away from the description of language and towards the investigation of speaker repertoires, corpus linguistics can enrich ethnographic work with tools that enable an understanding of the wider usage patterns of linguistic features identified as locally meaningful.


Blackledge, A. and Creese, A. (2010). Multilingualism: A Critical Perspective. London, Continuum.

Blommaert, Jan and Ben Rampton. 2011. ‘Language and superdiversity’ Diversities 13/2: 1-21.

Heyd, T. (2014) ‘Doing race and ethnicity in a digital community: lexical labels and narratives of belonging in a Nigerian web forum’ Discourse, Context and Media 4-5: 38-47.

Mair, C. (2013) ‘World Englishes and corpora’ Oxford Handbooks Online. Oxford: Oxford University Press.

Vessey, R. (2013) ‘Challenges in cross-linguistic corpus-assisted discourse studies’ Corpora 8/1: 1-26.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s