This project has retired. For details please refer to its Attic page.
Lucy::Analysis::Token – C API Documentation
Apache Lucy™

Lucy::Analysis::Token

parcel Lucy
class variable LUCY_TOKEN
struct symbol lucy_Token
class nickname lucy_Token
header file Lucy/Analysis/Token.h

Name

Lucy::Analysis::Token – Unit of text.

Description

Token is the fundamental unit used by Apache Lucy’s Analyzer subclasses. Each Token has 5 attributes: text, start_offset, end_offset, boost, and pos_inc.

The text attribute is a Unicode string encoded as UTF-8.

start_offset is the start point of the token text, measured in Unicode code points from the top of the stored field; end_offset delimits the corresponding closing boundary. start_offset and end_offset locate the Token within a larger context, even if the Token’s text attribute gets modified – by stemming, for instance. The Token for “beating” in the text “beating a dead horse” begins life with a start_offset of 0 and an end_offset of 7; after stemming, the text is “beat”, but the start_offset is still 0 and the end_offset is still 7. This allows “beating” to be highlighted correctly after a search matches “beat”.

boost is a per-token weight. Use this when you want to assign more or less importance to a particular token, as you might for emboldened text within an HTML document, for example. (Note: The field this token belongs to must be spec’d to use a posting of type RichPosting.)

pos_inc is the POSition INCrement, measured in Tokens. This attribute, which defaults to 1, is a an advanced tool for manipulating phrase matching. Ordinarily, Tokens are assigned consecutive position numbers: 0, 1, and 2 for "three blind mice". However, if you set the position increment for “blind” to, say, 1000, then the three tokens will end up assigned to positions 0, 1, and 1001 – and will no longer produce a phrase match for the query "three blind mice".

Functions

new
lucy_Token* // incremented
lucy_Token_new(
    char *text,
    size_t len,
    uint32_t start_offset,
    uint32_t end_offset,
    float boost,
    int32_t pos_inc
);

Create a new Token.

text

A UTF-8 string.

len

Size of the string in bytes.

start_offset

Start offset into the original document in Unicode code points.

start_offset

End offset into the original document in Unicode code points.

boost

Per-token weight.

pos_inc

Position increment for phrase matching.

init
lucy_Token*
lucy_Token_init(
    lucy_Token *self,
    char *text,
    size_t len,
    uint32_t start_offset,
    uint32_t end_offset,
    float boost,
    int32_t pos_inc
);

Initialize a Token.

text

A UTF-8 string.

len

Size of the string in bytes.

start_offset

Start offset into the original document in Unicode code points.

start_offset

End offset into the original document in Unicode code points.

boost

Per-token weight.

pos_inc

Position increment for phrase matching.

Methods

Get_Start_Offset
uint32_t
lucy_Token_Get_Start_Offset(
    lucy_Token *self
);
Get_End_Offset
uint32_t
lucy_Token_Get_End_Offset(
    lucy_Token *self
);
Get_Boost
float
lucy_Token_Get_Boost(
    lucy_Token *self
);
Get_Pos_Inc
int32_t
lucy_Token_Get_Pos_Inc(
    lucy_Token *self
);
Get_Text
char*
lucy_Token_Get_Text(
    lucy_Token *self
);
Get_Len
size_t
lucy_Token_Get_Len(
    lucy_Token *self
);
Set_Text
void
lucy_Token_Set_Text(
    lucy_Token *self,
    char *text,
    size_t len
);

Inheritance

Lucy::Analysis::Token is a Clownfish::Obj.