Lucy::Analysis::StandardTokenizer – Apache Lucy Documentation

Apache » Lucy » Docs » Perl » Lucy » Analysis

About

Resources

Related Projects

NAME

Lucy::Analysis::StandardTokenizer - Split a string into tokens.

SYNOPSIS

my $tokenizer = Lucy::Analysis::StandardTokenizer->new;

# Then... once you have a tokenizer, put it into a PolyAnalyzer:
my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
    analyzers => [ $tokenizer, $normalizer, $stemmer ], );

DESCRIPTION

Generically, “tokenizing” is a process of breaking up a string into an array of “tokens”. For instance, the string “three blind mice” might be tokenized into “three”, “blind”, “mice”.

Lucy::Analysis::StandardTokenizer breaks up the text at the word boundaries defined in Unicode Standard Annex #29. It then returns those words that contain alphabetic or numeric characters.

METHODS

transform

my $inversion = $standard_tokenizer->transform($inversion);

Take a single Inversion as input and returns an Inversion, either the same one (presumably transformed in some way), or a new one.

inversion - An inversion.

INHERITANCE

Lucy::Analysis::StandardTokenizer isa Lucy::Analysis::Analyzer isa Clownfish::Obj.

Copyright © 2010-2015 The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.