In this article the web’s controversial nature as a corpus is explored on both theoretical and applicative grounds. More specifically, the article shows how the notion of the web as corpus has changed, during the past decade, the way we conceive of a corpus from the somewhat reassuring standards subsumed under the corpus-as-body metaphor, to a new more flexible and challenging corpus-as-web image. On the one hand the traditional notion of a linguistic corpus as a body of texts rests on some correlate issues such as finite size, balance, part-whole relationship, permanence; on the other hand the very idea of a web of texts brings about notions of non-finiteness, flexibility, de-centering and re-centering, and provisionality. In terms of methodology, this questions issues which could be taken for granted when working with traditional corpora such as the stability of the data, the reproducibility of the research, and the reliability of the results, but has also created the conditions for the development of specific tools that try to make the ‘webscape’ a more hospitable space for corpus research. By simply reworking the output format of ordinary search engines to make it suitable for linguistic analysis (e.g. WebCorp, KWiCFinder), or by allowing the creation of quick flexible small specialized and customized multilingual corpora form the web (e.g. BootCaT), or by crawling more ‘controlled’ parts of the web for the creation of large web corpora (e.g. Wacky project, Google Books NGram Viewer), recently developed tools and resources are decidedly redirecting the way we conceive of corpus work in the new Millennium along those lines envisaged by Martin Wynne as characterizing linguistic resources in the 21st century, such as multilinguality, dynamic content, distributed architecture, virtual corpora, connection with web search (Wynne 2002: 1204).
The 'body' and the 'web'. The web as corpus ten years on
GATTO, MARISTELLA
2011-01-01
Abstract
In this article the web’s controversial nature as a corpus is explored on both theoretical and applicative grounds. More specifically, the article shows how the notion of the web as corpus has changed, during the past decade, the way we conceive of a corpus from the somewhat reassuring standards subsumed under the corpus-as-body metaphor, to a new more flexible and challenging corpus-as-web image. On the one hand the traditional notion of a linguistic corpus as a body of texts rests on some correlate issues such as finite size, balance, part-whole relationship, permanence; on the other hand the very idea of a web of texts brings about notions of non-finiteness, flexibility, de-centering and re-centering, and provisionality. In terms of methodology, this questions issues which could be taken for granted when working with traditional corpora such as the stability of the data, the reproducibility of the research, and the reliability of the results, but has also created the conditions for the development of specific tools that try to make the ‘webscape’ a more hospitable space for corpus research. By simply reworking the output format of ordinary search engines to make it suitable for linguistic analysis (e.g. WebCorp, KWiCFinder), or by allowing the creation of quick flexible small specialized and customized multilingual corpora form the web (e.g. BootCaT), or by crawling more ‘controlled’ parts of the web for the creation of large web corpora (e.g. Wacky project, Google Books NGram Viewer), recently developed tools and resources are decidedly redirecting the way we conceive of corpus work in the new Millennium along those lines envisaged by Martin Wynne as characterizing linguistic resources in the 21st century, such as multilinguality, dynamic content, distributed architecture, virtual corpora, connection with web search (Wynne 2002: 1204).I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.