Project Lucy has retired. For details please refer to its Attic page.
Lucy::Analysis::StandardTokenizer – C API Documentation
Apache Lucy™

Lucy::Analysis::StandardTokenizer

parcel Lucy
class variable LUCY_STANDARDTOKENIZER
struct symbol lucy_StandardTokenizer
class nickname lucy_StandardTokenizer
header file Lucy/Analysis/StandardTokenizer.h

Name

Lucy::Analysis::StandardTokenizer – Split a string into tokens.

Description

Generically, “tokenizing” is a process of breaking up a string into an array of “tokens”. For instance, the string “three blind mice” might be tokenized into “three”, “blind”, “mice”.

Lucy::Analysis::StandardTokenizer breaks up the text at the word boundaries defined in Unicode Standard Annex #29. It then returns those words that contain alphabetic or numeric characters.

Functions

new
lucy_StandardTokenizer* // incremented
lucy_StandardTokenizer_new(void);

Constructor. Takes no arguments.

init
lucy_StandardTokenizer*
lucy_StandardTokenizer_init(
    lucy_StandardTokenizer *self
);

Initialize a StandardTokenizer.

Methods

Transform
lucy_Inversion* // incremented
lucy_StandardTokenizer_Transform(
    lucy_StandardTokenizer *self,
    lucy_Inversion *inversion
);

Take a single Inversion as input and returns an Inversion, either the same one (presumably transformed in some way), or a new one.

inversion

An inversion.

Transform_Text
lucy_Inversion* // incremented
lucy_StandardTokenizer_Transform_Text(
    lucy_StandardTokenizer *self,
    cfish_String *text
);

Kick off an analysis chain, creating an Inversion from string input. The default implementation simply creates an initial Inversion with a single Token, then calls Transform(), but occasionally subclasses will provide an optimized implementation which minimizes string copies.

text

A string.

Equals
bool
lucy_StandardTokenizer_Equals(
    lucy_StandardTokenizer *self,
    cfish_Obj *other
);

Indicate whether two objects are the same. By default, compares the memory address.

other

Another Obj.

Methods inherited from Lucy::Analysis::Analyzer

Split
cfish_Vector* // incremented
lucy_StandardTokenizer_Split(
    lucy_StandardTokenizer *self,
    cfish_String *text
);

Analyze text and return an array of token texts.

text

A string.

Dump
cfish_Obj* // incremented
lucy_StandardTokenizer_Dump(
    lucy_StandardTokenizer *self
);

Dump the analyzer as hash.

Subclasses should call Dump() on the superclass. The returned object is a hash which should be populated with parameters of the analyzer.

Returns: A hash containing a description of the analyzer.

Load
cfish_Obj* // incremented
lucy_StandardTokenizer_Load(
    lucy_StandardTokenizer *self,
    cfish_Obj *dump
);

Reconstruct an analyzer from a dump.

Subclasses should first call Load() on the superclass. The returned object is an analyzer which should be reconstructed by setting the dumped parameters from the hash contained in dump.

Note that the invocant analyzer is unused.

dump

A hash.

Returns: An analyzer.

Inheritance

Lucy::Analysis::StandardTokenizer is a Lucy::Analysis::Analyzer is a Clownfish::Obj.