Lucy::Analysis::Token

Apache » Lucy » Docs » C » Lucy » Analysis

About

Resources

Related Projects

parcel	Lucy
class variable	`LUCY_TOKEN`
struct symbol	`lucy_Token`
class nickname	`lucy_Token`
header file	`Lucy/Analysis/Token.h`

Name

Lucy::Analysis::Token – Unit of text.

Description

Token is the fundamental unit used by Apache Lucy’s Analyzer subclasses. Each Token has 5 attributes: text, start_offset, end_offset, boost, and pos_inc.

The text attribute is a Unicode string encoded as UTF-8.

start_offset is the start point of the token text, measured in Unicode code points from the top of the stored field; end_offset delimits the corresponding closing boundary. start_offset and end_offset locate the Token within a larger context, even if the Token’s text attribute gets modified – by stemming, for instance. The Token for “beating” in the text “beating a dead horse” begins life with a start_offset of 0 and an end_offset of 7; after stemming, the text is “beat”, but the start_offset is still 0 and the end_offset is still 7. This allows “beating” to be highlighted correctly after a search matches “beat”.

boost is a per-token weight. Use this when you want to assign more or less importance to a particular token, as you might for emboldened text within an HTML document, for example. (Note: The field this token belongs to must be spec’d to use a posting of type RichPosting.)

pos_inc is the POSition INCrement, measured in Tokens. This attribute, which defaults to 1, is a an advanced tool for manipulating phrase matching. Ordinarily, Tokens are assigned consecutive position numbers: 0, 1, and 2 for "three blind mice". However, if you set the position increment for “blind” to, say, 1000, then the three tokens will end up assigned to positions 0, 1, and 1001 – and will no longer produce a phrase match for the query "three blind mice".

Functions

new

lucy_Token* // incremented
lucy_Token_new(
    char *text,
    size_t len,
    uint32_t start_offset,
    uint32_t end_offset,
    float boost,
    int32_t pos_inc
);

Create a new Token.

text: A UTF-8 string.
len: Size of the string in bytes.
start_offset: Start offset into the original document in Unicode code points.
start_offset: End offset into the original document in Unicode code points.
boost: Per-token weight.
pos_inc: Position increment for phrase matching.

init

lucy_Token*
lucy_Token_init(
    lucy_Token *self,
    char *text,
    size_t len,
    uint32_t start_offset,
    uint32_t end_offset,
    float boost,
    int32_t pos_inc
);

Initialize a Token.

text: A UTF-8 string.
len: Size of the string in bytes.
start_offset: Start offset into the original document in Unicode code points.
start_offset: End offset into the original document in Unicode code points.
boost: Per-token weight.
pos_inc: Position increment for phrase matching.

Methods

Get_Start_Offset

uint32_t
lucy_Token_Get_Start_Offset(
    lucy_Token *self
);

Get_End_Offset

uint32_t
lucy_Token_Get_End_Offset(
    lucy_Token *self
);

Get_Boost

float
lucy_Token_Get_Boost(
    lucy_Token *self
);

Get_Pos_Inc

int32_t
lucy_Token_Get_Pos_Inc(
    lucy_Token *self
);

Get_Text

char*
lucy_Token_Get_Text(
    lucy_Token *self
);

Get_Len

size_t
lucy_Token_Get_Len(
    lucy_Token *self
);

Set_Text

void
lucy_Token_Set_Text(
    lucy_Token *self,
    char *text,
    size_t len
);

Inheritance

Lucy::Analysis::Token is a Clownfish::Obj.

Copyright © 2010-2015 The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.