regexp_tokenizer.pl: a simple tokenizer based on regular expressions specified in a parameter file.
regexp_tokenizer.pl -h
regexp_tokenizer.pl -v
regexp_tokenizer.pl -t token_regexps input > tokenized_output
regexp_tokenizer.pl -t token_regexps -r replacement_patterns input > tokenized_output
regexp_tokenizer.pl -i -t token_regexps input > tokenized_output
regexp_tokenizer.pl -t token_regexps -s "^[\.\?\!]+$" input > tokenized_output
regexp_tokenizer.pl -i -t token_regexps -r replacement_patterns -s "^[\.\?\!]+$" input > tokenized_output
Installation is trivial.
Go to http://sslmit.unibo.it/~baroni/regexp_tokenizer.html and
download the regexp_tokenizer-VERSION_NUMBER.tar.gz
archive.
Unpack it:
$ tar xvzf regexp_tokenizer-VERSION_NUMBER.tar.gz
Make the script executable:
$ chmod +x regexp_tokenizer-VERSION_NUMBER/regexp_tokenizer.pl
If you want, add the relevant directory to the PATH variable, so that you will be able to call the tokenizer from wherever you are without having to specify the path to the script.
If you use tcsh, add something like the following line to the .tcshrc file:
setenv PATH "${PATH}:/home/marco/sw/regexp_tokenizer-VERSION_NUMBER"
If you use the bash shell, add something like the following line to .bashrc:
PATH=$PATH:/home/marco/sw/regexp_tokenizer-VERSION_NUMBER
That's it!
This is a tokenizer that splits a text into tokens on the basis of a set of regular expressions that are specified by the user in a parameter file.
In this way, the tokenizer can be personalized for different languages and/or tokenization purposes.
Moreover, the user can provide a list of regular expression + replacement pattern pairs, specifying strings that must be modified before applying tokenization.
Also, all upper case characters in the ascii/latin1 range can be converted to lower case before tokenization.
Finally, the user can specify a regular expression describing end-of-sentence tokens, in which case the tokenizer is going to split the input into sentences, and the sentences into tokens.
When no end-of-sentence pattern is specified, output is in one-token-per-line format; when there is an end-of-sentence pattern, output is in one sentence-per-line format, with tokens delimited by single whitespace.
The main weakness of this tokenizer (besides the fact that it is
virtually untested...) lies in the fact that it is not possible to do
context-sensitive tokenization. For example, there is no way to
formulate rules such as: treat this period as part of the previous
token if word that follows begins with lowercase letter
. I hope that
in future versions I will be able to support this feature, at least in
some limited, hacky way.
The basic algorithm I implement is the same implemented in the count.pl script of the NSP toolkit (http://www.d.umn.edu/~tpederse/nsp.html) written by Ted Pedersen, Satanjeev Banerjee and Amruta Purandare: many thanks to them for the idea!
I decided to write a different program because I needed to do tokenization for tasks other than ngram counting. Moreover, I wanted to be able to transform strings before applying tokenization, and to spot end-of-sentence markers.
Two important caveats:
1) This is version 0.01 of the script, and I mean it! This is very very preliminary and testing has been minimal: do not be surprised if even some of the basic functionalities described in the documentation do not work.
2) The script was not designed with efficiency in mind and it could easily turn out to be too slow for your task. All the expected factors will affect efficiency: more data, longer parameter files, less computing power...
This is the basic tokenization algorithm (the only difference wrt the count.pl algorithm is that I take a single line at a time, rather than the whole input at once, as my processing buffer):
- For each line of input: - Copy line to current_string; - While current_string is not empty: - For each regexp in regexp_list: - If regexp is matched by substring at the left edge of current input string: - Treat substring that matched regexp as a token (printing it on its own line); - Remove matching substring from current_string; - Quit this for loop; - If no regexp matched: - Remove first character from current_string;
There are two important things to notice, regarding this algorithm.
First, the inner loop goes through the regexp list and stops as soon as it finds a matching regexp. Thus, the order in which the regexps are listed in the token regexp file matters: If two or more regexps match input from left side, only the first one will be applied.
Second, when no regexp matches, the first character is dropped and the search re-starts. This means that characters that do not fit into some regular expression will simply be ignored. For example, if no regular expression matches whitespace, whitespace will be entirely discarded. If no regular expression matches non-alphabetic characters, then only alphabetic characters will be kept.
The file listing the regular expressions to be used for tokenization must be in one-regular-expression-per-line format. Empty lines and lines beginning with # are ignored (so that one can add comments).
Most valid perl regular expressions should work (unfortunately, \1 and the like will not work).
See the file ital_reg_exps.txt (which is part of the tar.gz archive) for a realistic example.
A simple English regexp file could contain these two lines:
[a-zA-Z']+ [\.,;:\?!]
Given this regexp file and the following input:
John's friends are: Frank, Donna and me.
output will be:
John's friends are : Frank , Donna and me .
Order matters. Consider this input:
John's friends are: Frank, Donna and Mr. Magoo.
Suppose that we added a Mr\.
regexp at beginning of regexp file:
Mr\. [a-zA-Z']+ [;:\.\?!\(\)\-]
Then, output will be:
John's friends are : Frank , Donna and Mr. Magoo .
The regular expression Mr\.
is matched by the left edge of the
string Mr. Magoo
, thus the token Mr.
is constructed, and the
other two regular expressions are not applied to this substring.
If, instead, we list Mr\.
AFTER the other two regexps, or even in the
middle of the two, the output will be:
John's friends are : Frank , Donna and Mr . Magoo .
This happens because Mr
matches the [a-zA-Z']+
regular
expression, so a token Mr
is constructed and this substring is
removed from input. At this point, leftover substring is . Magoo.
and regexp Mr\.
does not match anything, anymore.
Notice that tokens specified by regexp patterns can contain
whitespaces. If our tokenization file contained the regexp
Mr\. Magoo
at the beginning of the list, the output would be:
John's friends are : Frank , Donna and Mr. Magoo .
See below on how this interacts with one-sentence-per-line output format.
A file containing replacement patterns can be passed to the tokenizer via the -r option.
This file can also contain comment lines beginning with # and, again, empty lines will be ignored.
All the other lines of the file must contain two tab-delimited fields: a regular expression target and a replacement string.
Each pair is interpreted as the left and right side of a perl global replacement command. These global replacements are applied, in order, to each input line before the basic tokenization algorithm.
For example, suppose that we use a replacement file containing the following lines:
[0-9]+ NUM [A-Z][a-z]+ CAP
Then, for each line, the script will perform the following global replacements before tokenizing:
s/[0-9]+/NUM/g; s/[A-Z][a-z]+/CAP/g;
Some things to keep in mind:
1) The replacements apply BEFORE tokenization, so tokenization patterns have to be designed with the output of the replacement phase in mind (e.g., given the replacement patterns above, it would make no sense to have regexps referring to digits in the token regexp file).
2) Order matters: If the first replacement pattern in the file gets rid of all digits, then a later replacement pattern targeting digits will never match anything.
3) All replacements are applied, one after the other, to each input
string. Thus, the output of a replacement will constitute the input
of the next replacement. If a replacement pattern with target
NUM
followed the first rule in the example above, then this
replacement would also be applied to all instances of NUM
created by the first rule (notice difference from application of
regexp during tokenization, where substrings matching a regexp are
immediately removed from the input buffer, and thus they are not
compared to following regexps).
4) You can also use the replacement file to specify target strings to be deleted. Simply use the empty string as the replacement string. In other words, a line containing a regexp followed by a tab (and then nothing) is interpreted as an instruction to remove all strings matching the regexp from input. E.g., use ``<[^>]+>'' followed by tab to remove all XML/HTML tags from input before applying tokenization. It is important to remember to add the tab, even if it is not followed by anything.
5) Unfortunately, at the moment the replacement string cannot contain ``matched variables'' ($1, $2...) This would be a very powerful feature, and I hope to find out how to implement it in the future.
In my experience, tokenization tasks fall into two broad classes: those where we only care about identifying tokens (e.g., various unigram frequency collection tasks), and those where we also want to identify sentence boundaries (e.g., preparing data for POS tagging).
Thus, I provide two output formats for the tokenizer: one-token-per-line, which is typically the handiest format when sentence boundaries do not matter, and one-sentence-per-line-with-space-delimited-tokens, for cases where sentence boundaries do matter.
If you are interested in sentence boundaries, you will have to use the -s option followed by a regular expression identifying end-of-sentence markers.
This regexp will be applied to tokens once they are identified.
Thus, -s "[\.\?\!]"
will treat any token containing a period, a
question mark or an exclamation mark as an end-of-sentence mark. On
the other hand, -s "^[\.\?\!]+$"
will match only tokens that are
entirely made of punctuation marks (a better choice, typically).
Sentence marker detection takes place after a token is identified. Thus, good sentence boundary detection will depend on a good integration between what you put in the token regexp file and the sentence marker regexp.
For example, suppose that we have this input:
Mr. Magoo went to U.C.L.A. for his Ph.D. degree. Blah.
Let us consider a few alternative token regexp files.
We start with tok1:
[a-zA-Z]+
With this token regexp file, periods will be discarded, and thus the following reference to the period as a sentence marker is useless:
$ echo "Mr. Magoo went to U.C.L.A. for his Ph.D. degree. Blah." |\ regexp_tokenizer.pl -t tok1 -s "\." - Mr Magoo went to U C L A for his Ph D degree Blah
Let's try tok2:
[a-zA-Z]+ \.
This time, the periods are preserved, but any period is identified as a an end-of-sentence marker:
$ echo "Mr. Magoo went to U.C.L.A. for his Ph.D. degree. Blah." |\ regexp_tokenizer.pl -t tok2 -s "\." - Mr . Magoo went to U . C . L . A . for his Ph . D . degree . Blah .
Now we add Mr.
, U.C.L.A.
and Ph.D.
as tokens to tok3:
Mr\. U\.C\.L\.A\. Ph\.D\. [a-zA-Z]+ \.
$ echo "Mr. Magoo went to U.C.L.A. for his Ph.D. degree. Blah." |\ regexp_tokenizer.pl -t tok3 -s "\." - Mr. Magoo went to U.C.L.A. for his Ph.D. degree . Blah .
Better. However, since our sentence marker regexp specified that it is
sufficient for a token to contain a period in order to be considered a
sentence marker, Mr.
, U.C.L.A.
and Ph.D.
were treated as
sentence markers.
One more try, this time specifying that the sentence marker is a period (as opposed to: contains a period):
$ echo "Mr. Magoo went to U.C.L.A. for his Ph.D. degree. Blah." |\ regexp_tokenizer.pl -t tok3 -s "^\.$" - Mr. Magoo went to U.C.L.A. for his Ph.D. degree . Blah .
Good, that's what we wanted!
Notice that if you have token regexps containing white spaces and you are in sentence detection mode, the tokens with white spaces will become indistinguishable from regular tokens. I.e., suppose that we use the regexp file tok4:
Mr\. Magoo U\.C\.L\.A\. Ph\.D\. [a-zA-Z]+ \.
Without -s option we get:
$ echo "Mr. Magoo went to U.C.L.A. for his Ph.D. degree." |\ regexp_tokenizer.pl -t tok4 - Mr. Magoo went to U.C.L.A. for his Ph.D. degree .
However, in one-sentence-per-line format we get:
$ echo "Mr. Magoo went to U.C.L.A. for his Ph.D. degree." |\ regexp_tokenizer.pl -t tok4 -s "^\.$" - Mr. Magoo went to U.C.L.A. for his Ph.D. degree .
Here, the fact that Mr. Magoo
is a single token, whereas, say,
went to
is not is no longer visible in the output.
One way out of this problem is to identify multi-word tokens in the replacement stage and to replace the inner spaces with a special symbol, like in the following example.
Replacement file rep1:
Mr\. Magoo Mr._Magoo
Token regexp file tok5:
Mr\._Magoo U\.C\.L\.A\. Ph\.D\. [a-zA-Z_]+ \.
Now, we get:
$ echo "Mr. Magoo went to U.C.L.A. for his Ph.D. degree." |\ regexp_tokenizer.pl -t tok5 -r rep1 -s "^\.$" - Mr._Magoo went to U.C.L.A. for his Ph.D. degree .
where the fact that Mr. Magoo
is a token is signaled by the fact
that its two elements are connected by an underscore.
In general, replacing the inner white spaces of multi-word tokens with other symbols is probably a good idea anyway.
This last example also shows that sentence marker detection takes place after the replacements from the replacement file are applied. One must keep this in mind when designing both the replacements and the sentence marker expression.
Finally, notice that the tokenizer assumes that sentences cannot cross newlines -- in other words, end-of-line is always treated as end-of-sentence, even if the last token on the line was not a sentence marker.
Sometimes, it is a good idea to turn all words to lower case -- for
example, if you are collecting frequencies you probably want to treat
The
and the
as two instances of the same token.
Thus, I provide the -i (for case Insensitive) option. If you use this option, all alphabetic characters in the latin1 range will be turned to lower case before anything else is done.
Thus, for example, using regexp file tok6:
[a-z]
and the -i option we get:
$ echo "I Used To Make HEAVY Use of CAPITALIZATION" |\ regexp_tokenizer.pl -i -t tok6 - i used to make heavy use of capitalization
Lower-casing happens before anything else -- keep this in mind when
preparing the replacement and token regexp files. For example, a
Mr\. Magoo
token regexp will be of no use if lower-casing
transformed the relevant string into mr. magoo
.
This also means that, by using replacements, one can re-insert upper
case words after lower-casing. For example, one could turn all letters
to lower case via the -i option, but specify that all digit sequences
are to be replaced with NUM
in the replacement file. Since
replacements are applied after lower-casing, the NUM
string would not
be affected by lower-casing.
As I said, the -i switch assumes that input is in latin1. If you are working with a different encoding, you will have to change the relevant part of the code, or to do lower-casing via replacement patterns.
In short, the processes take place in the following order:
- lower-casing (optional, use -i) - replacements (optional, use -r replacement_file) - tokenization (mandatory, use -t token_regexp_file) - sentence boundary detection (optional, use -s "SENT_MARK_EXP")
Marco Baroni, marco baroni AT unitn it
Probably many: if you find one, please let me know: marco baroni AT unitn it
Copyright 2004, Marco Baroni
This program is free software. You may copy or redistribute it under the same terms as Perl itself.
NSP Toolkit: http://www.d.umn.edu/~tpederse/nsp.html