Lucy::Analysis::RegexTokenizer – Apache Lucy Documentation

Apache » Lucy » Docs » Perl » Lucy » Analysis

About

Resources

Related Projects

NAME

Lucy::Analysis::RegexTokenizer - Split a string into tokens.

SYNOPSIS

my $whitespace_tokenizer
    = Lucy::Analysis::RegexTokenizer->new( pattern => '\S+' );

# or...
my $word_char_tokenizer
    = Lucy::Analysis::RegexTokenizer->new( pattern => '\w+' );

# or...
my $apostrophising_tokenizer = Lucy::Analysis::RegexTokenizer->new;

# Then... once you have a tokenizer, put it into a PolyAnalyzer:
my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
    analyzers => [ $word_char_tokenizer, $normalizer, $stemmer ], );

DESCRIPTION

Generically, “tokenizing” is a process of breaking up a string into an array of “tokens”. For instance, the string “three blind mice” might be tokenized into “three”, “blind”, “mice”.

Lucy::Analysis::RegexTokenizer decides where it should break up the text based on a regular expression compiled from a supplied pattern matching one token. If our source string is…

"Eats, Shoots and Leaves."

… then a “whitespace tokenizer” with a pattern of "\\S+" produces…

Eats,
Shoots
and
Leaves.

… while a “word character tokenizer” with a pattern of "\\w+" produces…

Eats
Shoots
and
Leaves

… the difference being that the word character tokenizer skips over punctuation as well as whitespace when determining token boundaries.

CONSTRUCTORS

new

my $word_char_tokenizer = Lucy::Analysis::RegexTokenizer->new(
    pattern => '\w+',    # required
);

Create a new RegexTokenizer.

pattern - A string specifying a Perl-syntax regular expression which should match one token. The default value is \w+(?:[\x{2019}']\w+)*, which matches “it’s” as well as “it” and “O’Henry’s” as well as “Henry”.

METHODS

transform

my $inversion = $regex_tokenizer->transform($inversion);

Take a single Inversion as input and returns an Inversion, either the same one (presumably transformed in some way), or a new one.

inversion - An inversion.

INHERITANCE

Lucy::Analysis::RegexTokenizer isa Lucy::Analysis::Analyzer isa Clownfish::Obj.

Copyright © 2010-2015 The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.