Lucy::Analysis::RegexTokenizer

Apache » Lucy » Docs » C » Lucy » Analysis

About

Resources

Related Projects

parcel	Lucy
class variable	`LUCY_REGEXTOKENIZER`
struct symbol	`lucy_RegexTokenizer`
class nickname	`lucy_RegexTokenizer`
header file	`Lucy/Analysis/RegexTokenizer.h`

Name

Lucy::Analysis::RegexTokenizer – Split a string into tokens.

Description

Generically, “tokenizing” is a process of breaking up a string into an array of “tokens”. For instance, the string “three blind mice” might be tokenized into “three”, “blind”, “mice”.

Lucy::Analysis::RegexTokenizer decides where it should break up the text based on a regular expression compiled from a supplied pattern matching one token. If our source string is…

"Eats, Shoots and Leaves."

… then a “whitespace tokenizer” with a pattern of "\\S+" produces…

Eats,
Shoots
and
Leaves.

… while a “word character tokenizer” with a pattern of "\\w+" produces…

Eats
Shoots
and
Leaves

… the difference being that the word character tokenizer skips over punctuation as well as whitespace when determining token boundaries.

Functions

new

lucy_RegexTokenizer* // incremented
lucy_RegexTokenizer_new(
    cfish_String *pattern
);

Create a new RegexTokenizer.

pattern: A string specifying a Perl-syntax regular expression which should match one token. The default value is \w+(?:[\x{2019}']\w+)*, which matches “it’s” as well as “it” and “O’Henry’s” as well as “Henry”.

init

lucy_RegexTokenizer*
lucy_RegexTokenizer_init(
    lucy_RegexTokenizer *self,
    cfish_String *pattern
);

Initialize a RegexTokenizer.

pattern: A string specifying a Perl-syntax regular expression which should match one token. The default value is \w+(?:[\x{2019}']\w+)*, which matches “it’s” as well as “it” and “O’Henry’s” as well as “Henry”.

Methods

Transform

lucy_Inversion* // incremented
lucy_RegexTokenizer_Transform(
    lucy_RegexTokenizer *self,
    lucy_Inversion *inversion
);

Take a single Inversion as input and returns an Inversion, either the same one (presumably transformed in some way), or a new one.

inversion: An inversion.

Transform_Text

lucy_Inversion* // incremented
lucy_RegexTokenizer_Transform_Text(
    lucy_RegexTokenizer *self,
    cfish_String *text
);

Kick off an analysis chain, creating an Inversion from string input. The default implementation simply creates an initial Inversion with a single Token, then calls Transform(), but occasionally subclasses will provide an optimized implementation which minimizes string copies.

text: A string.

Dump

cfish_Obj* // incremented
lucy_RegexTokenizer_Dump(
    lucy_RegexTokenizer *self
);

Dump the analyzer as hash.

Subclasses should call Dump() on the superclass. The returned object is a hash which should be populated with parameters of the analyzer.

Returns: A hash containing a description of the analyzer.

Load

lucy_RegexTokenizer* // incremented
lucy_RegexTokenizer_Load(
    lucy_RegexTokenizer *self,
    cfish_Obj *dump
);

Reconstruct an analyzer from a dump.

Subclasses should first call Load() on the superclass. The returned object is an analyzer which should be reconstructed by setting the dumped parameters from the hash contained in dump.

Note that the invocant analyzer is unused.

dump: A hash.

Returns: An analyzer.

Equals

bool
lucy_RegexTokenizer_Equals(
    lucy_RegexTokenizer *self,
    cfish_Obj *other
);

Indicate whether two objects are the same. By default, compares the memory address.

other: Another Obj.

Methods inherited from Lucy::Analysis::Analyzer

Split

cfish_Vector* // incremented
lucy_RegexTokenizer_Split(
    lucy_RegexTokenizer *self,
    cfish_String *text
);

Analyze text and return an array of token texts.

text: A string.

Inheritance

Lucy::Analysis::RegexTokenizer is a Lucy::Analysis::Analyzer is a Clownfish::Obj.

Copyright © 2010-2015 The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.