Lucy::Analysis::RegexTokenizer - Split a string into tokens.
my $whitespace_tokenizer = Lucy::Analysis::RegexTokenizer->new( pattern => '\S+' ); # or... my $word_char_tokenizer = Lucy::Analysis::RegexTokenizer->new( pattern => '\w+' ); # or... my $apostrophising_tokenizer = Lucy::Analysis::RegexTokenizer->new; # Then... once you have a tokenizer, put it into a PolyAnalyzer: my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( analyzers => [ $word_char_tokenizer, $normalizer, $stemmer ], );
Generically, “tokenizing” is a process of breaking up a string into an array of “tokens”. For instance, the string “three blind mice” might be tokenized into “three”, “blind”, “mice”.
Lucy::Analysis::RegexTokenizer decides where it should break up the text based on a regular expression compiled from a supplied pattern
matching one token.
If our source string is…
"Eats, Shoots and Leaves."
… then a “whitespace tokenizer” with a pattern
of "\\S+"
produces…
Eats, Shoots and Leaves.
… while a “word character tokenizer” with a pattern
of "\\w+"
produces…
Eats Shoots and Leaves
… the difference being that the word character tokenizer skips over punctuation as well as whitespace when determining token boundaries.
my $word_char_tokenizer = Lucy::Analysis::RegexTokenizer->new( pattern => '\w+', # required );
Create a new RegexTokenizer.
\w+(?:[\x{2019}']\w+)*
,
which matches “it’s” as well as “it” and “O’Henry’s” as well as “Henry”.my $inversion = $regex_tokenizer->transform($inversion);
Take a single Inversion as input and returns an Inversion, either the same one (presumably transformed in some way), or a new one.
Lucy::Analysis::RegexTokenizer isa Lucy::Analysis::Analyzer isa Clownfish::Obj.
Copyright © 2010-2015 The Apache Software Foundation, Licensed under the
Apache License, Version 2.0.
Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The
Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their
respective owners.