2nd Web as Corpus Workshop
In conjunction with the 11th Conference of the European Chapter of
the Association for Computational Linguistics
Trento, Italy
April 3, 2006
Co-chairs: Adam Kilgarriff and Marco Baroni
Previous WaC Workshop:
http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html
Information available on this site:
Topics
Despite the fact that a growing body of work has shown that the
World Wide Web is a mine of language data of unprecedented
richness and ease of access (see, e.g., the papers collected in
Kilgarriff and Grefenstette, 2003), many fundamental issues about
the viability and exploitation of the Web as a linguistic corpus
are just starting to be tackled, ranging from Web frequency
distributions and registers, to efficient handling of massive
data sets, to copyright. Research on the Web as corpus is
currently at a very exciting stage: increasing evidence points to
the enormous potential of the Internet as a source of linguistic
data, but we are still far from a working, fully-fledged
linguists' search engine.
We invite submissions which:
- describe Web corpus collection projects, or modules for one part
of the process (crawling, filtering, language-id, tokenising,
lemmatising, POS-tagging, indexing, ...)
- explore characteristics of Web data, from a linguistics/NLP
perspective
- use crawled Web data for NLP purposes.
Preference will be given to projects where Web data are
downloaded and processed directly, rather than via search engine
interfaces.
Preliminary Program
9:00-9:30 Adam Kilgarriff and Marco Baroni - Introduction
9:30-10:00 Arno Scharl and Albert Weichselbraun Web coverage of the 2004
US presidential election
10:00-10:30 Rüdiger Gleim, Alexander Mehler and Matthias Dehmer - Web
corpus mining by instance of Wikipedia
10:30-11:00 break
11:00-11:30 Masatsugu Tonoike, Mitsuhiro Kida, Toshihiro Takagi, Yasuhiro
Sasaki, Takehito Utsuro and Satoshi Sato - A comparative study on
compositional translation estimation using a domain/topic-specific
corpus collected from the web
11:30-12:00 Gemma Boleda, Stefan Bott, Rodrigo Meza, Carlos Castillo,
Toni Badia and Vicente López CUCWeb: a Catalan corpus built from the
web
12:00-12:30 Paul Rayson, James Walkerdine, William H. Fletcher and Adam
Kilgarriff - Annotated web as corpus
12.30-2.30 lunch
2:30-3:00 András Kornai, Péter Halácsy, Viktor Nagy, Csaba Oravecz,
Viktor Trón and Dániel Varga - Web-based frequency dictionaries for
medium density languages
3:00-3:30 Cédrick Fairon - Corporator: A tool for creating RSS-based
specialized corpora
3:30-4:00 Demos, part 1
4:00-4:30 break
4:30-4:50 Demos, part 2
4:50-5:20 Davide Fossati, Gabriele Ghidoni, Barbara Di Eugenio, Isabel
Cruz, Huiyong Xiao and Rajen Subba - The problem of ontology alignment
on the web: a first report
5:20-5:50 Kie Zuraw - Using the web as a phonological corpus: a case
study from Tagalog
5:50-6:00 Organization, next meeting, closing
Program Committee
Toni Badia
Marco Baroni (co-chair)
Silvia Bernardini
Massimiliano Ciaramita
Barbara Di Eugenio
Roger Evans
Stefan Evert
William Fletcher
Rüdiger Gleim
Gregory Grefenstette
Péter Halácsy
Frank Keller
Adam Kilgarriff (co-chair)
Rob Koeling
Mirella Lapata
Anke Lüdeling
Alexander Mehler
Drago Radev
Philip Resnik
German Rigau
Serge Sharoff
David Weir
Further Information
Information on registration and registration fees will be
provided at the conference web page.
Main Conference Web Page
EACL 2006 Workshops site
Notice in particular the related workshop on New Text: Wikis and blogs and
other dynamic text sources.
Back to Top