Lucy::Analysis::StandardTokenizer - Split a string into tokens.
my $tokenizer = Lucy::Analysis::StandardTokenizer->new; # Then... once you have a tokenizer, put it into a PolyAnalyzer: my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( analyzers => [ $tokenizer, $normalizer, $stemmer ], );
Generically, “tokenizing” is a process of breaking up a string into an array of “tokens”. For instance, the string “three blind mice” might be tokenized into “three”, “blind”, “mice”.
Lucy::Analysis::StandardTokenizer breaks up the text at the word boundaries defined in Unicode Standard Annex #29. It then returns those words that contain alphabetic or numeric characters.
my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
Constructor. Takes no arguments.
my $inversion = $standard_tokenizer->transform($inversion);
Take a single Inversion as input and returns an Inversion, either the same one (presumably transformed in some way), or a new one.
Lucy::Analysis::StandardTokenizer isa Lucy::Analysis::Analyzer isa Clownfish::Obj.
Copyright © 2010-2015 The Apache Software Foundation, Licensed under the
Apache License, Version 2.0.
Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The
Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their
respective owners.