2nd Web as Corpus Workshop

In conjunction with the 11th Conference of the European Chapter of the Association for Computational Linguistics

Trento, Italy

April 3, 2006

Co-chairs: Adam Kilgarriff and Marco Baroni

Previous WaC Workshop:

http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html

Information available on this site:

Topics
Program
Program Committee
Further Information

Topics

Despite the fact that a growing body of work has shown that the World Wide Web is a mine of language data of unprecedented richness and ease of access (see, e.g., the papers collected in Kilgarriff and Grefenstette, 2003), many fundamental issues about the viability and exploitation of the Web as a linguistic corpus are just starting to be tackled, ranging from Web frequency distributions and registers, to efficient handling of massive data sets, to copyright. Research on the Web as corpus is currently at a very exciting stage: increasing evidence points to the enormous potential of the Internet as a source of linguistic data, but we are still far from a working, fully-fledged linguists' search engine.

We invite submissions which:

describe Web corpus collection projects, or modules for one part of the process (crawling, filtering, language-id, tokenising, lemmatising, POS-tagging, indexing, ...)
explore characteristics of Web data, from a linguistics/NLP perspective
use crawled Web data for NLP purposes.

Preference will be given to projects where Web data are downloaded and processed directly, rather than via search engine interfaces.

Preliminary Program

9:00-9:30 Adam Kilgarriff and Marco Baroni - Introduction

9:30-10:00 Arno Scharl and Albert Weichselbraun Web coverage of the 2004 US presidential election

10:00-10:30 Rüdiger Gleim, Alexander Mehler and Matthias Dehmer - Web corpus mining by instance of Wikipedia

10:30-11:00 break

11:00-11:30 Masatsugu Tonoike, Mitsuhiro Kida, Toshihiro Takagi, Yasuhiro Sasaki, Takehito Utsuro and Satoshi Sato - A comparative study on compositional translation estimation using a domain/topic-specific corpus collected from the web

11:30-12:00 Gemma Boleda, Stefan Bott, Rodrigo Meza, Carlos Castillo, Toni Badia and Vicente López CUCWeb: a Catalan corpus built from the web

12:00-12:30 Paul Rayson, James Walkerdine, William H. Fletcher and Adam Kilgarriff - Annotated web as corpus

12.30-2.30 lunch

2:30-3:00 András Kornai, Péter Halácsy, Viktor Nagy, Csaba Oravecz, Viktor Trón and Dániel Varga - Web-based frequency dictionaries for medium density languages

3:00-3:30 Cédrick Fairon - Corporator: A tool for creating RSS-based specialized corpora

3:30-4:00 Demos, part 1

4:00-4:30 break

4:30-4:50 Demos, part 2

4:50-5:20 Davide Fossati, Gabriele Ghidoni, Barbara Di Eugenio, Isabel Cruz, Huiyong Xiao and Rajen Subba - The problem of ontology alignment on the web: a first report

5:20-5:50 Kie Zuraw - Using the web as a phonological corpus: a case study from Tagalog

5:50-6:00 Organization, next meeting, closing

Program Committee

Toni Badia
Marco Baroni (co-chair)
Silvia Bernardini
Massimiliano Ciaramita
Barbara Di Eugenio
Roger Evans
Stefan Evert
William Fletcher
Rüdiger Gleim
Gregory Grefenstette
Péter Halácsy
Frank Keller
Adam Kilgarriff (co-chair)
Rob Koeling
Mirella Lapata
Anke Lüdeling
Alexander Mehler
Drago Radev
Philip Resnik
German Rigau
Serge Sharoff
David Weir

Further Information

Information on registration and registration fees will be provided at the conference web page.

Main Conference Web Page

EACL 2006 Workshops site

Notice in particular the related workshop on New Text: Wikis and blogs and other dynamic text sources.