Commit c54159d4 authored by Tom Lane's avatar Tom Lane

Make locale-dependent regex character classes work for large char codes.

Previously, we failed to recognize Unicode characters above U+7FF as
being members of locale-dependent character classes such as [[:alpha:]].
(Actually, the same problem occurs for large pg_wchar values in any
multibyte encoding, but UTF8 is the only case people have actually
complained about.)  It's impractical to get Spencer's original code to
handle character classes or ranges containing many thousands of characters,
because it insists on considering each member character individually at
regex compile time, whether or not the character will ever be of interest
at run time.  To fix, choose a cutoff point MAX_SIMPLE_CHR below which
we process characters individually as before, and deal with entire ranges
or classes as single entities above that.  We can actually make things
cheaper than before for chars below the cutoff, because the color map can
now be a simple linear array for those chars, rather than the multilevel
tree structure Spencer designed.  It's more expensive than before for
chars above the cutoff, because we must do a binary search in a list of
high chars and char ranges used in the regex pattern, plus call iswalpha()
and friends for each locale-dependent character class used in the pattern.
However, multibyte encodings are normally designed to give smaller codes
to popular characters, so that we can expect that the slow path will be
taken relatively infrequently.  In any case, the speed penalty appears
minor except when we have to apply iswalpha() etc. to high character codes
at runtime --- and the previous coding gave wrong answers for those cases,
so whether it was faster is moot.

Tom Lane, reviewed by Heikki Linnakangas

Discussion: <15563.1471913698@sss.pgh.pa.us>
parent f80049f7
...@@ -27,13 +27,14 @@ and similarly additional source files rege_*.c that are #include'd in ...@@ -27,13 +27,14 @@ and similarly additional source files rege_*.c that are #include'd in
regexec. This was done to avoid exposing internal symbols globally; regexec. This was done to avoid exposing internal symbols globally;
all functions not meant to be part of the library API are static. all functions not meant to be part of the library API are static.
(Actually the above is a lie in one respect: there is one more global (Actually the above is a lie in one respect: there are two more global
symbol, pg_set_regex_collation in regcomp. It is not meant to be part of symbols, pg_set_regex_collation and pg_reg_getcolor in regcomp. These are
the API, but it has to be global because both regcomp and regexec call it. not meant to be part of the API, but they have to be global because both
It'd be better to get rid of that, as well as the static variables it regcomp and regexec call them. It'd be better to get rid of
sets, in favor of keeping the needed locale state in the regex structs. pg_set_regex_collation, as well as the static variables it sets, in favor of
We have not done this yet for lack of a design for how to add keeping the needed locale state in the regex structs. We have not done this
application-specific state to the structs.) yet for lack of a design for how to add application-specific state to the
structs.)
What's where in src/backend/regex/: What's where in src/backend/regex/:
...@@ -274,28 +275,65 @@ colors: ...@@ -274,28 +275,65 @@ colors:
an existing color has to be subdivided. an existing color has to be subdivided.
The last two of these are handled with the "struct colordesc" array and The last two of these are handled with the "struct colordesc" array and
the "colorchain" links in NFA arc structs. The color map proper (that the "colorchain" links in NFA arc structs.
is, the per-character lookup array) is handled as a multi-level tree,
with each tree level indexed by one byte of a character's value. The Ideally, we'd do the first two operations using a simple linear array
code arranges to not have more than one copy of bottom-level tree pages storing the current color assignment for each character code.
that are all-the-same-color. Unfortunately, that's not terribly workable for large charsets such as
Unicode. Our solution is to divide the color map into two parts. A simple
Unfortunately, this design does not seem terribly efficient for common linear array is used for character codes up to MAX_SIMPLE_CHR, which can be
cases such as a tree in which all Unicode letters are colored the same, chosen large enough to include all popular characters (so that the
because there aren't that many places where we get a whole page all the significantly-slower code paths about to be described are seldom invoked).
same color, except at the end of the map. (It also strikes me that given Characters above that need be considered at compile time only if they
PG's current restrictions on the range of Unicode values, we could use a appear explicitly in the regex pattern. We store each such mentioned
3-level rather than 4-level tree; but there's not provision for that in character or character range as an entry in the "colormaprange" array in
regguts.h at the moment.) the colormap. (Overlapping ranges are split into unique subranges, so that
each range in the finished list needs only a single color that describes
A bigger problem is that it just doesn't seem very reasonable to have to all its characters.) When mapping a character above MAX_SIMPLE_CHR to a
consider each Unicode letter separately at regex parse time for a regex color at runtime, we search this list of ranges explicitly.
such as "\w"; more than likely, a huge percentage of those codes will
never be seen at runtime. We need to fix things so that locale-based That's still not quite enough, though, because of locale-dependent
character classes are somehow processed "symbolically" without making a character classes such as [[:alpha:]]. In Unicode locales these classes
full expansion of their contents at parse time. This would mean that we'd may have thousands of entries that are above MAX_SIMPLE_CHR, and we
have to be ready to call iswalpha() at runtime, but if that only happens certainly don't want to be searching large colormaprange arrays at runtime.
for high-code-value characters, it shouldn't be a big performance hit. Nor do we even want to spend the time to initialize cvec structures that
exhaustively describe all of those characters. Our solution is to compute
exact per-character colors at regex compile time only up to MAX_SIMPLE_CHR.
For characters above that, we apply the <ctype.h> or <wctype.h> lookup
functions at runtime for each locale-dependent character class used in the
regex pattern, constructing a bitmap that describes which classes the
runtime character belongs to. The per-character-range data structure
mentioned above actually holds, for each range, a separate color entry
for each possible combination of character class properties. That is,
the color map for characters above MAX_SIMPLE_CHR is really a 2-D array,
whose rows correspond to high characters or character ranges that are
explicitly mentioned in the regex pattern, and whose columns correspond
to sets of the locale-dependent character classes that are used in the
regex.
As an example, given the pattern '\w\u1234[\U0001D100-\U0001D1FF]'
(and supposing that MAX_SIMPLE_CHR is less than 0x1234), we will need
a high color map with three rows. One row is for the single character
U+1234 (represented as a single-element range), one is for the range
U+1D100..U+1D1FF, and the other row represents all remaining high
characters. The color map has two columns, one for characters that
satisfy iswalnum() and one for those that don't.
We build this color map in parallel with scanning the regex. Each time
we detect a new explicit high character (or range) or a locale-dependent
character class, we split existing entry(s) in the high color map so that
characters we need to be able to distinguish will have distinct entries
that can be given separate colors. Often, though, single entries in the
high color map will represent very large sets of characters.
If there are both explicit high characters/ranges and locale-dependent
character classes, we may have entries in the high color map array that
have non-WHITE colors but don't actually represent any real characters.
(For example, in a row representing a singleton range, only one of the
columns could possibly be a live entry; it's the one matching the actual
locale properties for that single character.) We don't currently make
any effort to reclaim such colors. In principle it could be done, but
it's not clear that it's worth the trouble.
Detailed semantics of an NFA Detailed semantics of an NFA
......
...@@ -49,10 +49,6 @@ static void ...@@ -49,10 +49,6 @@ static void
initcm(struct vars * v, initcm(struct vars * v,
struct colormap * cm) struct colormap * cm)
{ {
int i;
int j;
union tree *t;
union tree *nextt;
struct colordesc *cd; struct colordesc *cd;
cm->magic = CMMAGIC; cm->magic = CMMAGIC;
...@@ -64,24 +60,40 @@ initcm(struct vars * v, ...@@ -64,24 +60,40 @@ initcm(struct vars * v,
cm->free = 0; cm->free = 0;
cd = cm->cd; /* cm->cd[WHITE] */ cd = cm->cd; /* cm->cd[WHITE] */
cd->nschrs = MAX_SIMPLE_CHR - CHR_MIN + 1;
cd->nuchrs = 1;
cd->sub = NOSUB; cd->sub = NOSUB;
cd->arcs = NULL; cd->arcs = NULL;
cd->firstchr = CHR_MIN; cd->firstchr = CHR_MIN;
cd->nchrs = CHR_MAX - CHR_MIN + 1;
cd->flags = 0; cd->flags = 0;
/* upper levels of tree */ cm->locolormap = (color *)
for (t = &cm->tree[0], j = NBYTS - 1; j > 0; t = nextt, j--) MALLOC((MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color));
if (cm->locolormap == NULL)
{
CERR(REG_ESPACE);
cm->cmranges = NULL; /* prevent failure during freecm */
cm->hicolormap = NULL;
return;
}
/* this memset relies on WHITE being zero: */
memset(cm->locolormap, WHITE,
(MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color));
memset(cm->classbits, 0, sizeof(cm->classbits));
cm->numcmranges = 0;
cm->cmranges = NULL;
cm->maxarrayrows = 4; /* arbitrary initial allocation */
cm->hiarrayrows = 1; /* but we have only one row/col initially */
cm->hiarraycols = 1;
cm->hicolormap = (color *) MALLOC(cm->maxarrayrows * sizeof(color));
if (cm->hicolormap == NULL)
{ {
nextt = t + 1; CERR(REG_ESPACE);
for (i = BYTTAB - 1; i >= 0; i--) return;
t->tptr[i] = nextt;
} }
/* bottom level is solid white */ /* initialize the "all other characters" row to WHITE */
t = &cm->tree[NBYTS - 1]; cm->hicolormap[0] = WHITE;
for (i = BYTTAB - 1; i >= 0; i--)
t->tcolor[i] = WHITE;
cd->block = t;
} }
/* /*
...@@ -90,117 +102,67 @@ initcm(struct vars * v, ...@@ -90,117 +102,67 @@ initcm(struct vars * v,
static void static void
freecm(struct colormap * cm) freecm(struct colormap * cm)
{ {
size_t i;
union tree *cb;
cm->magic = 0; cm->magic = 0;
if (NBYTS > 1)
cmtreefree(cm, cm->tree, 0);
for (i = 1; i <= cm->max; i++) /* skip WHITE */
if (!UNUSEDCOLOR(&cm->cd[i]))
{
cb = cm->cd[i].block;
if (cb != NULL)
FREE(cb);
}
if (cm->cd != cm->cdspace) if (cm->cd != cm->cdspace)
FREE(cm->cd); FREE(cm->cd);
if (cm->locolormap != NULL)
FREE(cm->locolormap);
if (cm->cmranges != NULL)
FREE(cm->cmranges);
if (cm->hicolormap != NULL)
FREE(cm->hicolormap);
} }
/* /*
* cmtreefree - free a non-terminal part of a colormap tree * pg_reg_getcolor - slow case of GETCOLOR()
*/ */
static void color
cmtreefree(struct colormap * cm, pg_reg_getcolor(struct colormap * cm, chr c)
union tree * tree,
int level) /* level number (top == 0) of this block */
{ {
int i; int rownum,
union tree *t; colnum,
union tree *fillt = &cm->tree[level + 1]; low,
union tree *cb; high;
assert(level < NBYTS - 1); /* this level has pointers */ /* Should not be used for chrs in the locolormap */
for (i = BYTTAB - 1; i >= 0; i--) assert(c > MAX_SIMPLE_CHR);
{
t = tree->tptr[i]; /*
assert(t != NULL); * Find which row it's in. The colormapranges are in order, so we can use
if (t != fillt) * binary search.
*/
rownum = 0; /* if no match, use array row zero */
low = 0;
high = cm->numcmranges;
while (low < high)
{ {
if (level < NBYTS - 2) int middle = low + (high - low) / 2;
{ /* more pointer blocks below */ const colormaprange *cmr = &cm->cmranges[middle];
cmtreefree(cm, t, level + 1);
FREE(t); if (c < cmr->cmin)
} high = middle;
else if (c > cmr->cmax)
low = middle + 1;
else else
{ /* color block below */ {
cb = cm->cd[t->tcolor[0]].block; rownum = cmr->rownum; /* found a match */
if (t != cb) /* not a solid block */ break;
FREE(t);
}
} }
} }
}
/* /*
* setcolor - set the color of a character in a colormap * Find which column it's in --- this is all locale-dependent.
*/ */
static color /* previous color */ if (cm->hiarraycols > 1)
setcolor(struct colormap * cm,
chr c,
color co)
{
uchr uc = c;
int shift;
int level;
int b;
int bottom;
union tree *t;
union tree *newt;
union tree *fillt;
union tree *lastt;
union tree *cb;
color prev;
assert(cm->magic == CMMAGIC);
if (CISERR() || co == COLORLESS)
return COLORLESS;
t = cm->tree;
for (level = 0, shift = BYTBITS * (NBYTS - 1); shift > 0;
level++, shift -= BYTBITS)
{
b = (uc >> shift) & BYTMASK;
lastt = t;
t = lastt->tptr[b];
assert(t != NULL);
fillt = &cm->tree[level + 1];
bottom = (shift <= BYTBITS) ? 1 : 0;
cb = (bottom) ? cm->cd[t->tcolor[0]].block : fillt;
if (t == fillt || t == cb)
{ /* must allocate a new block */
newt = (union tree *) MALLOC((bottom) ?
sizeof(struct colors) : sizeof(struct ptrs));
if (newt == NULL)
{ {
CERR(REG_ESPACE); colnum = cclass_column_index(cm, c);
return COLORLESS; return cm->hicolormap[rownum * cm->hiarraycols + colnum];
} }
if (bottom)
memcpy(VS(newt->tcolor), VS(t->tcolor),
BYTTAB * sizeof(color));
else else
memcpy(VS(newt->tptr), VS(t->tptr), {
BYTTAB * sizeof(union tree *)); /* fast path if no relevant cclasses */
t = newt; return cm->hicolormap[rownum];
lastt->tptr[b] = t;
}
} }
b = uc & BYTMASK;
prev = t->tcolor[b];
t->tcolor[b] = co;
return prev;
} }
/* /*
...@@ -216,7 +178,7 @@ maxcolor(struct colormap * cm) ...@@ -216,7 +178,7 @@ maxcolor(struct colormap * cm)
} }
/* /*
* newcolor - find a new color (must be subject of setcolor at once) * newcolor - find a new color (must be assigned at once)
* Beware: may relocate the colordescs. * Beware: may relocate the colordescs.
*/ */
static color /* COLORLESS for error */ static color /* COLORLESS for error */
...@@ -278,12 +240,12 @@ newcolor(struct colormap * cm) ...@@ -278,12 +240,12 @@ newcolor(struct colormap * cm)
cd = &cm->cd[cm->max]; cd = &cm->cd[cm->max];
} }
cd->nchrs = 0; cd->nschrs = 0;
cd->nuchrs = 0;
cd->sub = NOSUB; cd->sub = NOSUB;
cd->arcs = NULL; cd->arcs = NULL;
cd->firstchr = CHR_MIN; /* in case never set otherwise */ cd->firstchr = CHR_MIN; /* in case never set otherwise */
cd->flags = 0; cd->flags = 0;
cd->block = NULL;
return (color) (cd - cm->cd); return (color) (cd - cm->cd);
} }
...@@ -305,13 +267,9 @@ freecolor(struct colormap * cm, ...@@ -305,13 +267,9 @@ freecolor(struct colormap * cm,
assert(cd->arcs == NULL); assert(cd->arcs == NULL);
assert(cd->sub == NOSUB); assert(cd->sub == NOSUB);
assert(cd->nchrs == 0); assert(cd->nschrs == 0);
assert(cd->nuchrs == 0);
cd->flags = FREECOL; cd->flags = FREECOL;
if (cd->block != NULL)
{
FREE(cd->block);
cd->block = NULL; /* just paranoia */
}
if ((size_t) co == cm->max) if ((size_t) co == cm->max)
{ {
...@@ -354,17 +312,25 @@ static color ...@@ -354,17 +312,25 @@ static color
pseudocolor(struct colormap * cm) pseudocolor(struct colormap * cm)
{ {
color co; color co;
struct colordesc *cd;
co = newcolor(cm); co = newcolor(cm);
if (CISERR()) if (CISERR())
return COLORLESS; return COLORLESS;
cm->cd[co].nchrs = 1; cd = &cm->cd[co];
cm->cd[co].flags = PSEUDO; cd->nschrs = 0;
cd->nuchrs = 1; /* pretend it is in the upper map */
cd->sub = NOSUB;
cd->arcs = NULL;
cd->firstchr = CHR_MIN;
cd->flags = PSEUDO;
return co; return co;
} }
/* /*
* subcolor - allocate a new subcolor (if necessary) to this chr * subcolor - allocate a new subcolor (if necessary) to this chr
*
* This works only for chrs that map into the low color map.
*/ */
static color static color
subcolor(struct colormap * cm, chr c) subcolor(struct colormap * cm, chr c)
...@@ -372,7 +338,9 @@ subcolor(struct colormap * cm, chr c) ...@@ -372,7 +338,9 @@ subcolor(struct colormap * cm, chr c)
color co; /* current color of c */ color co; /* current color of c */
color sco; /* new subcolor */ color sco; /* new subcolor */
co = GETCOLOR(cm, c); assert(c <= MAX_SIMPLE_CHR);
co = cm->locolormap[c - CHR_MIN];
sco = newsub(cm, co); sco = newsub(cm, co);
if (CISERR()) if (CISERR())
return COLORLESS; return COLORLESS;
...@@ -380,11 +348,37 @@ subcolor(struct colormap * cm, chr c) ...@@ -380,11 +348,37 @@ subcolor(struct colormap * cm, chr c)
if (co == sco) /* already in an open subcolor */ if (co == sco) /* already in an open subcolor */
return co; /* rest is redundant */ return co; /* rest is redundant */
cm->cd[co].nchrs--; cm->cd[co].nschrs--;
if (cm->cd[sco].nchrs == 0) if (cm->cd[sco].nschrs == 0)
cm->cd[sco].firstchr = c; cm->cd[sco].firstchr = c;
cm->cd[sco].nchrs++; cm->cd[sco].nschrs++;
setcolor(cm, c, sco); cm->locolormap[c - CHR_MIN] = sco;
return sco;
}
/*
* subcolorhi - allocate a new subcolor (if necessary) to this colormap entry
*
* This is the same processing as subcolor(), but for entries in the high
* colormap, which do not necessarily correspond to exactly one chr code.
*/
static color
subcolorhi(struct colormap * cm, color *pco)
{
color co; /* current color of entry */
color sco; /* new subcolor */
co = *pco;
sco = newsub(cm, co);
if (CISERR())
return COLORLESS;
assert(sco != COLORLESS);
if (co == sco) /* already in an open subcolor */
return co; /* rest is redundant */
cm->cd[co].nuchrs--;
cm->cd[sco].nuchrs++;
*pco = sco;
return sco; return sco;
} }
...@@ -400,7 +394,8 @@ newsub(struct colormap * cm, ...@@ -400,7 +394,8 @@ newsub(struct colormap * cm,
sco = cm->cd[co].sub; sco = cm->cd[co].sub;
if (sco == NOSUB) if (sco == NOSUB)
{ /* color has no open subcolor */ { /* color has no open subcolor */
if (cm->cd[co].nchrs == 1) /* optimization */ /* optimization: singly-referenced color need not be subcolored */
if ((cm->cd[co].nschrs + cm->cd[co].nuchrs) == 1)
return co; return co;
sco = newcolor(cm); /* must create subcolor */ sco = newcolor(cm); /* must create subcolor */
if (sco == COLORLESS) if (sco == COLORLESS)
...@@ -417,136 +412,500 @@ newsub(struct colormap * cm, ...@@ -417,136 +412,500 @@ newsub(struct colormap * cm,
} }
/* /*
* subrange - allocate new subcolors to this range of chrs, fill in arcs * newhicolorrow - get a new row in the hicolormap, cloning it from oldrow
*
* Returns array index of new row. Note the array might move.
*/ */
static void static int
subrange(struct vars * v, newhicolorrow(struct colormap * cm,
chr from, int oldrow)
chr to,
struct state * lp,
struct state * rp)
{ {
uchr uf; int newrow = cm->hiarrayrows;
color *newrowptr;
int i; int i;
assert(from <= to); /* Assign a fresh array row index, enlarging storage if needed */
if (newrow >= cm->maxarrayrows)
{
color *newarray;
if (cm->maxarrayrows >= INT_MAX / (cm->hiarraycols * 2))
{
CERR(REG_ESPACE);
return 0;
}
newarray = (color *) REALLOC(cm->hicolormap,
cm->maxarrayrows * 2 *
cm->hiarraycols * sizeof(color));
if (newarray == NULL)
{
CERR(REG_ESPACE);
return 0;
}
cm->hicolormap = newarray;
cm->maxarrayrows *= 2;
}
cm->hiarrayrows++;
/* Copy old row data */
newrowptr = &cm->hicolormap[newrow * cm->hiarraycols];
memcpy(newrowptr,
&cm->hicolormap[oldrow * cm->hiarraycols],
cm->hiarraycols * sizeof(color));
/* Increase color reference counts to reflect new colormap entries */
for (i = 0; i < cm->hiarraycols; i++)
cm->cd[newrowptr[i]].nuchrs++;
/* first, align "from" on a tree-block boundary */ return newrow;
uf = (uchr) from; }
i = (int) (((uf + BYTTAB - 1) & (uchr) ~BYTMASK) - uf);
for (; from <= to && i > 0; i--, from++) /*
newarc(v->nfa, PLAIN, subcolor(v->cm, from), lp, rp); * newhicolorcols - create a new set of columns in the high colormap
if (from > to) /* didn't reach a boundary */ *
* Essentially, extends the 2-D array to the right with a copy of itself.
*/
static void
newhicolorcols(struct colormap * cm)
{
color *newarray;
int r,
c;
if (cm->hiarraycols >= INT_MAX / (cm->maxarrayrows * 2))
{
CERR(REG_ESPACE);
return; return;
}
newarray = (color *) REALLOC(cm->hicolormap,
cm->maxarrayrows *
cm->hiarraycols * 2 * sizeof(color));
if (newarray == NULL)
{
CERR(REG_ESPACE);
return;
}
cm->hicolormap = newarray;
/* deal with whole blocks */ /* Duplicate existing columns to the right, and increase ref counts */
for (; to - from >= BYTTAB; from += BYTTAB) /* Must work backwards in the array because we realloc'd in place */
subblock(v, from, lp, rp); for (r = cm->hiarrayrows - 1; r >= 0; r--)
{
color *oldrowptr = &newarray[r * cm->hiarraycols];
color *newrowptr = &newarray[r * cm->hiarraycols * 2];
color *newrowptr2 = newrowptr + cm->hiarraycols;
for (c = 0; c < cm->hiarraycols; c++)
{
color co = oldrowptr[c];
newrowptr[c] = newrowptr2[c] = co;
cm->cd[co].nuchrs++;
}
}
/* clean up any remaining partial table */ cm->hiarraycols *= 2;
for (; from <= to; from++)
newarc(v->nfa, PLAIN, subcolor(v->cm, from), lp, rp);
} }
/* /*
* subblock - allocate new subcolors for one tree block of chrs, fill in arcs * subcolorcvec - allocate new subcolors to cvec members, fill in arcs
* *
* Note: subcolors that are created during execution of this function * For each chr "c" represented by the cvec, do the equivalent of
* will not be given a useful value of firstchr; it'll be left as CHR_MIN. * newarc(v->nfa, PLAIN, subcolor(v->cm, c), lp, rp);
* For the current usage of firstchr in pg_regprefix, this does not matter *
* because such subcolors won't occur in the common prefix of a regex. * Note that in typical cases, many of the subcolors are the same.
* While newarc() would discard duplicate arc requests, we can save
* some cycles by not calling it repetitively to begin with. This is
* mechanized with the "lastsubcolor" state variable.
*/ */
static void static void
subblock(struct vars * v, subcolorcvec(struct vars * v,
chr start, /* first of BYTTAB chrs */ struct cvec * cv,
struct state * lp, struct state * lp,
struct state * rp) struct state * rp)
{ {
uchr uc = start;
struct colormap *cm = v->cm; struct colormap *cm = v->cm;
int shift; color lastsubcolor = COLORLESS;
int level; chr ch,
from,
to;
const chr *p;
int i; int i;
int b;
union tree *t;
union tree *cb;
union tree *fillt;
union tree *lastt;
int previ;
int ndone;
color co;
color sco;
assert((uc % BYTTAB) == 0); /* ordinary characters */
for (p = cv->chrs, i = cv->nchrs; i > 0; p++, i--)
{
ch = *p;
subcoloronechr(v, ch, lp, rp, &lastsubcolor);
NOERR();
}
/* and the ranges */
for (p = cv->ranges, i = cv->nranges; i > 0; p += 2, i--)
{
from = *p;
to = *(p + 1);
if (from <= MAX_SIMPLE_CHR)
{
/* deal with simple chars one at a time */
chr lim = (to <= MAX_SIMPLE_CHR) ? to : MAX_SIMPLE_CHR;
while (from <= lim)
{
color sco = subcolor(cm, from);
NOERR();
if (sco != lastsubcolor)
{
newarc(v->nfa, PLAIN, sco, lp, rp);
NOERR();
lastsubcolor = sco;
}
from++;
}
}
/* deal with any part of the range that's above MAX_SIMPLE_CHR */
if (from < to)
subcoloronerange(v, from, to, lp, rp, &lastsubcolor);
else if (from == to)
subcoloronechr(v, from, lp, rp, &lastsubcolor);
NOERR();
}
/* and deal with cclass if any */
if (cv->cclasscode >= 0)
{
int classbit;
color *pco;
int r,
c;
/* Enlarge array if we don't have a column bit assignment for cclass */
if (cm->classbits[cv->cclasscode] == 0)
{
cm->classbits[cv->cclasscode] = cm->hiarraycols;
newhicolorcols(cm);
NOERR();
}
/* Apply subcolorhi() and make arc for each entry in relevant cols */
classbit = cm->classbits[cv->cclasscode];
pco = cm->hicolormap;
for (r = 0; r < cm->hiarrayrows; r++)
{
for (c = 0; c < cm->hiarraycols; c++)
{
if (c & classbit)
{
color sco = subcolorhi(cm, pco);
NOERR();
/* add the arc if needed */
if (sco != lastsubcolor)
{
newarc(v->nfa, PLAIN, sco, lp, rp);
NOERR();
lastsubcolor = sco;
}
}
pco++;
}
}
}
}
/*
* subcoloronechr - do subcolorcvec's work for a singleton chr
*
* We could just let subcoloronerange do this, but it's a bit more efficient
* if we exploit the single-chr case. Also, callers find it useful for this
* to be able to handle both low and high chr codes.
*/
static void
subcoloronechr(struct vars * v,
chr ch,
struct state * lp,
struct state * rp,
color *lastsubcolor)
{
struct colormap *cm = v->cm;
colormaprange *newranges;
int numnewranges;
colormaprange *oldrange;
int oldrangen;
int newrow;
/* Easy case for low chr codes */
if (ch <= MAX_SIMPLE_CHR)
{
color sco = subcolor(cm, ch);
/* find its color block, making new pointer blocks as needed */ NOERR();
t = cm->tree; if (sco != *lastsubcolor)
fillt = NULL;
for (level = 0, shift = BYTBITS * (NBYTS - 1); shift > 0;
level++, shift -= BYTBITS)
{ {
b = (uc >> shift) & BYTMASK; newarc(v->nfa, PLAIN, sco, lp, rp);
lastt = t; *lastsubcolor = sco;
t = lastt->tptr[b]; }
assert(t != NULL); return;
fillt = &cm->tree[level + 1]; }
if (t == fillt && shift > BYTBITS)
{ /* need new ptr block */ /*
t = (union tree *) MALLOC(sizeof(struct ptrs)); * Potentially, we could need two more colormapranges than we have now, if
if (t == NULL) * the given chr is in the middle of some existing range.
*/
newranges = (colormaprange *)
MALLOC((cm->numcmranges + 2) * sizeof(colormaprange));
if (newranges == NULL)
{ {
CERR(REG_ESPACE); CERR(REG_ESPACE);
return; return;
} }
memcpy(VS(t->tptr), VS(fillt->tptr), numnewranges = 0;
BYTTAB * sizeof(union tree *));
lastt->tptr[b] = t; /* Ranges before target are unchanged */
for (oldrange = cm->cmranges, oldrangen = 0;
oldrangen < cm->numcmranges;
oldrange++, oldrangen++)
{
if (oldrange->cmax >= ch)
break;
newranges[numnewranges++] = *oldrange;
} }
/* Match target chr against current range */
if (oldrangen >= cm->numcmranges || oldrange->cmin > ch)
{
/* chr does not belong to any existing range, make a new one */
newranges[numnewranges].cmin = ch;
newranges[numnewranges].cmax = ch;
/* row state should be cloned from the "all others" row */
newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0);
numnewranges++;
}
else if (oldrange->cmin == oldrange->cmax)
{
/* we have an existing singleton range matching the chr */
newranges[numnewranges++] = *oldrange;
newrow = oldrange->rownum;
/* we've now fully processed this old range */
oldrange++, oldrangen++;
}
else
{
/* chr is a subset of this existing range, must split it */
if (ch > oldrange->cmin)
{
/* emit portion of old range before chr */
newranges[numnewranges].cmin = oldrange->cmin;
newranges[numnewranges].cmax = ch - 1;
newranges[numnewranges].rownum = oldrange->rownum;
numnewranges++;
}
/* emit chr as singleton range, initially cloning from range */
newranges[numnewranges].cmin = ch;
newranges[numnewranges].cmax = ch;
newranges[numnewranges].rownum = newrow =
newhicolorrow(cm, oldrange->rownum);
numnewranges++;
if (ch < oldrange->cmax)
{
/* emit portion of old range after chr */
newranges[numnewranges].cmin = ch + 1;
newranges[numnewranges].cmax = oldrange->cmax;
/* must clone the row if we are making two new ranges from old */
newranges[numnewranges].rownum =
(ch > oldrange->cmin) ? newhicolorrow(cm, oldrange->rownum) :
oldrange->rownum;
numnewranges++;
}
/* we've now fully processed this old range */
oldrange++, oldrangen++;
} }
/* special cases: fill block or solid block */ /* Update colors in newrow and create arcs as needed */
co = t->tcolor[0]; subcoloronerow(v, newrow, lp, rp, lastsubcolor);
cb = cm->cd[co].block;
if (t == fillt || t == cb) /* Ranges after target are unchanged */
for (; oldrangen < cm->numcmranges; oldrange++, oldrangen++)
{ {
/* either way, we want a subcolor solid block */ newranges[numnewranges++] = *oldrange;
sco = newsub(cm, co); }
t = cm->cd[sco].block;
if (t == NULL) /* Assert our original space estimate was adequate */
{ /* must set it up */ assert(numnewranges <= (cm->numcmranges + 2));
t = (union tree *) MALLOC(sizeof(struct colors));
if (t == NULL) /* And finally, store back the updated list of ranges */
if (cm->cmranges != NULL)
FREE(cm->cmranges);
cm->cmranges = newranges;
cm->numcmranges = numnewranges;
}
/*
* subcoloronerange - do subcolorcvec's work for a high range
*/
static void
subcoloronerange(struct vars * v,
chr from,
chr to,
struct state * lp,
struct state * rp,
color *lastsubcolor)
{
struct colormap *cm = v->cm;
colormaprange *newranges;
int numnewranges;
colormaprange *oldrange;
int oldrangen;
int newrow;
/* Caller should take care of non-high-range cases */
assert(from > MAX_SIMPLE_CHR);
assert(from < to);
/*
* Potentially, if we have N non-adjacent ranges, we could need as many as
* 2N+1 result ranges (consider case where new range spans 'em all).
*/
newranges = (colormaprange *)
MALLOC((cm->numcmranges * 2 + 1) * sizeof(colormaprange));
if (newranges == NULL)
{ {
CERR(REG_ESPACE); CERR(REG_ESPACE);
return; return;
} }
for (i = 0; i < BYTTAB; i++) numnewranges = 0;
t->tcolor[i] = sco;
cm->cd[sco].block = t; /* Ranges before target are unchanged */
for (oldrange = cm->cmranges, oldrangen = 0;
oldrangen < cm->numcmranges;
oldrange++, oldrangen++)
{
if (oldrange->cmax >= from)
break;
newranges[numnewranges++] = *oldrange;
} }
/* find loop must have run at least once */
lastt->tptr[b] = t; /*
newarc(v->nfa, PLAIN, sco, lp, rp); * Deal with ranges that (partially) overlap the target. As we process
cm->cd[co].nchrs -= BYTTAB; * each such range, increase "from" to remove the dealt-with characters
cm->cd[sco].nchrs += BYTTAB; * from the target range.
return; */
while (oldrangen < cm->numcmranges && oldrange->cmin <= to)
{
if (from < oldrange->cmin)
{
/* Handle portion of new range that corresponds to no old range */
newranges[numnewranges].cmin = from;
newranges[numnewranges].cmax = oldrange->cmin - 1;
/* row state should be cloned from the "all others" row */
newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0);
numnewranges++;
/* Update colors in newrow and create arcs as needed */
subcoloronerow(v, newrow, lp, rp, lastsubcolor);
/* We've now fully processed the part of new range before old */
from = oldrange->cmin;
} }
/* general case, a mixed block to be altered */ if (from <= oldrange->cmin && to >= oldrange->cmax)
i = 0;
while (i < BYTTAB)
{ {
co = t->tcolor[i]; /* old range is fully contained in new, process it in-place */
sco = newsub(cm, co); newranges[numnewranges++] = *oldrange;
newarc(v->nfa, PLAIN, sco, lp, rp); newrow = oldrange->rownum;
previ = i; from = oldrange->cmax + 1;
do }
else
{
/* some part of old range does not overlap new range */
if (from > oldrange->cmin)
{
/* emit portion of old range before new range */
newranges[numnewranges].cmin = oldrange->cmin;
newranges[numnewranges].cmax = from - 1;
newranges[numnewranges].rownum = oldrange->rownum;
numnewranges++;
}
/* emit common subrange, initially cloning from old range */
newranges[numnewranges].cmin = from;
newranges[numnewranges].cmax =
(to < oldrange->cmax) ? to : oldrange->cmax;
newranges[numnewranges].rownum = newrow =
newhicolorrow(cm, oldrange->rownum);
numnewranges++;
if (to < oldrange->cmax)
{
/* emit portion of old range after new range */
newranges[numnewranges].cmin = to + 1;
newranges[numnewranges].cmax = oldrange->cmax;
/* must clone the row if we are making two new ranges from old */
newranges[numnewranges].rownum =
(from > oldrange->cmin) ? newhicolorrow(cm, oldrange->rownum) :
oldrange->rownum;
numnewranges++;
}
from = oldrange->cmax + 1;
}
/* Update colors in newrow and create arcs as needed */
subcoloronerow(v, newrow, lp, rp, lastsubcolor);
/* we've now fully processed this old range */
oldrange++, oldrangen++;
}
if (from <= to)
{ {
t->tcolor[i++] = sco; /* Handle portion of new range that corresponds to no old range */
} while (i < BYTTAB && t->tcolor[i] == co); newranges[numnewranges].cmin = from;
ndone = i - previ; newranges[numnewranges].cmax = to;
cm->cd[co].nchrs -= ndone; /* row state should be cloned from the "all others" row */
cm->cd[sco].nchrs += ndone; newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0);
numnewranges++;
/* Update colors in newrow and create arcs as needed */
subcoloronerow(v, newrow, lp, rp, lastsubcolor);
}
/* Ranges after target are unchanged */
for (; oldrangen < cm->numcmranges; oldrange++, oldrangen++)
{
newranges[numnewranges++] = *oldrange;
}
/* Assert our original space estimate was adequate */
assert(numnewranges <= (cm->numcmranges * 2 + 1));
/* And finally, store back the updated list of ranges */
if (cm->cmranges != NULL)
FREE(cm->cmranges);
cm->cmranges = newranges;
cm->numcmranges = numnewranges;
}
/*
* subcoloronerow - do subcolorcvec's work for one new row in the high colormap
*/
static void
subcoloronerow(struct vars * v,
int rownum,
struct state * lp,
struct state * rp,
color *lastsubcolor)
{
struct colormap *cm = v->cm;
color *pco;
int i;
/* Apply subcolorhi() and make arc for each entry in row */
pco = &cm->hicolormap[rownum * cm->hiarraycols];
for (i = 0; i < cm->hiarraycols; pco++, i++)
{
color sco = subcolorhi(cm, pco);
NOERR();
/* make the arc if needed */
if (sco != *lastsubcolor)
{
newarc(v->nfa, PLAIN, sco, lp, rp);
NOERR();
*lastsubcolor = sco;
}
} }
} }
...@@ -575,12 +934,12 @@ okcolors(struct nfa * nfa, ...@@ -575,12 +934,12 @@ okcolors(struct nfa * nfa,
{ {
/* is subcolor, let parent deal with it */ /* is subcolor, let parent deal with it */
} }
else if (cd->nchrs == 0) else if (cd->nschrs == 0 && cd->nuchrs == 0)
{ {
/* parent empty, its arcs change color to subcolor */ /* parent empty, its arcs change color to subcolor */
cd->sub = NOSUB; cd->sub = NOSUB;
scd = &cm->cd[sco]; scd = &cm->cd[sco];
assert(scd->nchrs > 0); assert(scd->nschrs > 0 || scd->nuchrs > 0);
assert(scd->sub == sco); assert(scd->sub == sco);
scd->sub = NOSUB; scd->sub = NOSUB;
while ((a = cd->arcs) != NULL) while ((a = cd->arcs) != NULL)
...@@ -597,7 +956,7 @@ okcolors(struct nfa * nfa, ...@@ -597,7 +956,7 @@ okcolors(struct nfa * nfa,
/* parent's arcs must gain parallel subcolor arcs */ /* parent's arcs must gain parallel subcolor arcs */
cd->sub = NOSUB; cd->sub = NOSUB;
scd = &cm->cd[sco]; scd = &cm->cd[sco];
assert(scd->nchrs > 0); assert(scd->nschrs > 0 || scd->nuchrs > 0);
assert(scd->sub == sco); assert(scd->sub == sco);
scd->sub = NOSUB; scd->sub = NOSUB;
for (a = cd->arcs; a != NULL; a = a->colorchain) for (a = cd->arcs; a != NULL; a = a->colorchain)
...@@ -711,62 +1070,54 @@ dumpcolors(struct colormap * cm, ...@@ -711,62 +1070,54 @@ dumpcolors(struct colormap * cm,
struct colordesc *end; struct colordesc *end;
color co; color co;
chr c; chr c;
char *has;
fprintf(f, "max %ld\n", (long) cm->max); fprintf(f, "max %ld\n", (long) cm->max);
if (NBYTS > 1)
fillcheck(cm, cm->tree, 0, f);
end = CDEND(cm); end = CDEND(cm);
for (cd = cm->cd + 1, co = 1; cd < end; cd++, co++) /* skip 0 */ for (cd = cm->cd + 1, co = 1; cd < end; cd++, co++) /* skip 0 */
{
if (!UNUSEDCOLOR(cd)) if (!UNUSEDCOLOR(cd))
{ {
assert(cd->nchrs > 0); assert(cd->nschrs > 0 || cd->nuchrs > 0);
has = (cd->block != NULL) ? "#" : "";
if (cd->flags & PSEUDO) if (cd->flags & PSEUDO)
fprintf(f, "#%2ld%s(ps): ", (long) co, has); fprintf(f, "#%2ld(ps): ", (long) co);
else else
fprintf(f, "#%2ld%s(%2d): ", (long) co, fprintf(f, "#%2ld(%2d): ", (long) co, cd->nschrs + cd->nuchrs);
has, cd->nchrs);
/* /*
* Unfortunately, it's hard to do this next bit more efficiently. * Unfortunately, it's hard to do this next bit more efficiently.
*
* Spencer's original coding has the loop iterating from CHR_MIN
* to CHR_MAX, but that's utterly unusable for 32-bit chr. For
* debugging purposes it seems fine to print only chr codes up to
* 1000 or so.
*/ */
for (c = CHR_MIN; c < 1000; c++) for (c = CHR_MIN; c <= MAX_SIMPLE_CHR; c++)
if (GETCOLOR(cm, c) == co) if (GETCOLOR(cm, c) == co)
dumpchr(c, f); dumpchr(c, f);
fprintf(f, "\n"); fprintf(f, "\n");
} }
} }
/* dump the high colormap if it contains anything interesting */
/* if (cm->hiarrayrows > 1 || cm->hiarraycols > 1)
* fillcheck - check proper filling of a tree {
*/ int r,
static void c;
fillcheck(struct colormap * cm, const color *rowptr;
union tree * tree,
int level, /* level number (top == 0) of this block */
FILE *f)
{
int i;
union tree *t;
union tree *fillt = &cm->tree[level + 1];
assert(level < NBYTS - 1); /* this level has pointers */ fprintf(f, "other:\t");
for (i = BYTTAB - 1; i >= 0; i--) for (c = 0; c < cm->hiarraycols; c++)
{ {
t = tree->tptr[i]; fprintf(f, "\t%ld", (long) cm->hicolormap[c]);
if (t == NULL) }
fprintf(f, "NULL found in filled tree!\n"); fprintf(f, "\n");
else if (t == fillt) for (r = 0; r < cm->numcmranges; r++)
{ {
dumpchr(cm->cmranges[r].cmin, f);
fprintf(f, "..");
dumpchr(cm->cmranges[r].cmax, f);
fprintf(f, ":");
rowptr = &cm->hicolormap[cm->cmranges[r].rownum * cm->hiarraycols];
for (c = 0; c < cm->hiarraycols; c++)
{
fprintf(f, "\t%ld", (long) rowptr[c]);
}
fprintf(f, "\n");
} }
else if (level < NBYTS - 2) /* more pointer blocks below */
fillcheck(cm, t, level + 1, f);
} }
} }
......
...@@ -34,7 +34,8 @@ ...@@ -34,7 +34,8 @@
/* /*
* Notes: * Notes:
* Only (selected) functions in _this_ file should treat chr* as non-constant. * Only (selected) functions in _this_ file should treat the chr arrays
* of a cvec as non-constant.
*/ */
/* /*
...@@ -67,6 +68,7 @@ clearcvec(struct cvec * cv) ...@@ -67,6 +68,7 @@ clearcvec(struct cvec * cv)
assert(cv != NULL); assert(cv != NULL);
cv->nchrs = 0; cv->nchrs = 0;
cv->nranges = 0; cv->nranges = 0;
cv->cclasscode = -1;
return cv; return cv;
} }
......
...@@ -349,6 +349,19 @@ static const struct cname ...@@ -349,6 +349,19 @@ static const struct cname
} }
}; };
/*
* The following arrays define the valid character class names.
*/
static const char *const classNames[NUM_CCLASSES + 1] = {
"alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph",
"lower", "print", "punct", "space", "upper", "xdigit", NULL
};
enum classes
{
CC_ALNUM, CC_ALPHA, CC_ASCII, CC_BLANK, CC_CNTRL, CC_DIGIT, CC_GRAPH,
CC_LOWER, CC_PRINT, CC_PUNCT, CC_SPACE, CC_UPPER, CC_XDIGIT
};
/* /*
* We do not use the hard-wired Unicode classification tables that Tcl does. * We do not use the hard-wired Unicode classification tables that Tcl does.
...@@ -543,21 +556,6 @@ cclass(struct vars * v, /* context */ ...@@ -543,21 +556,6 @@ cclass(struct vars * v, /* context */
int i, int i,
index; index;
/*
* The following arrays define the valid character class names.
*/
static const char *const classNames[] = {
"alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph",
"lower", "print", "punct", "space", "upper", "xdigit", NULL
};
enum classes
{
CC_ALNUM, CC_ALPHA, CC_ASCII, CC_BLANK, CC_CNTRL, CC_DIGIT, CC_GRAPH,
CC_LOWER, CC_PRINT, CC_PUNCT, CC_SPACE, CC_UPPER, CC_XDIGIT
};
/* /*
* Map the name to the corresponding enumerated value. * Map the name to the corresponding enumerated value.
*/ */
...@@ -593,18 +591,20 @@ cclass(struct vars * v, /* context */ ...@@ -593,18 +591,20 @@ cclass(struct vars * v, /* context */
* pg_ctype_get_cache so that we can cache the results. Other classes * pg_ctype_get_cache so that we can cache the results. Other classes
* have definitions that are hard-wired here, and for those we just * have definitions that are hard-wired here, and for those we just
* construct a transient cvec on the fly. * construct a transient cvec on the fly.
*
* NB: keep this code in sync with cclass_column_index(), below.
*/ */
switch ((enum classes) index) switch ((enum classes) index)
{ {
case CC_PRINT: case CC_PRINT:
cv = pg_ctype_get_cache(pg_wc_isprint); cv = pg_ctype_get_cache(pg_wc_isprint, index);
break; break;
case CC_ALNUM: case CC_ALNUM:
cv = pg_ctype_get_cache(pg_wc_isalnum); cv = pg_ctype_get_cache(pg_wc_isalnum, index);
break; break;
case CC_ALPHA: case CC_ALPHA:
cv = pg_ctype_get_cache(pg_wc_isalpha); cv = pg_ctype_get_cache(pg_wc_isalpha, index);
break; break;
case CC_ASCII: case CC_ASCII:
/* hard-wired meaning */ /* hard-wired meaning */
...@@ -625,10 +625,10 @@ cclass(struct vars * v, /* context */ ...@@ -625,10 +625,10 @@ cclass(struct vars * v, /* context */
addrange(cv, 0x7f, 0x9f); addrange(cv, 0x7f, 0x9f);
break; break;
case CC_DIGIT: case CC_DIGIT:
cv = pg_ctype_get_cache(pg_wc_isdigit); cv = pg_ctype_get_cache(pg_wc_isdigit, index);
break; break;
case CC_PUNCT: case CC_PUNCT:
cv = pg_ctype_get_cache(pg_wc_ispunct); cv = pg_ctype_get_cache(pg_wc_ispunct, index);
break; break;
case CC_XDIGIT: case CC_XDIGIT:
...@@ -646,16 +646,16 @@ cclass(struct vars * v, /* context */ ...@@ -646,16 +646,16 @@ cclass(struct vars * v, /* context */
} }
break; break;
case CC_SPACE: case CC_SPACE:
cv = pg_ctype_get_cache(pg_wc_isspace); cv = pg_ctype_get_cache(pg_wc_isspace, index);
break; break;
case CC_LOWER: case CC_LOWER:
cv = pg_ctype_get_cache(pg_wc_islower); cv = pg_ctype_get_cache(pg_wc_islower, index);
break; break;
case CC_UPPER: case CC_UPPER:
cv = pg_ctype_get_cache(pg_wc_isupper); cv = pg_ctype_get_cache(pg_wc_isupper, index);
break; break;
case CC_GRAPH: case CC_GRAPH:
cv = pg_ctype_get_cache(pg_wc_isgraph); cv = pg_ctype_get_cache(pg_wc_isgraph, index);
break; break;
} }
...@@ -665,6 +665,47 @@ cclass(struct vars * v, /* context */ ...@@ -665,6 +665,47 @@ cclass(struct vars * v, /* context */
return cv; return cv;
} }
/*
* cclass_column_index - get appropriate high colormap column index for chr
*/
static int
cclass_column_index(struct colormap * cm, chr c)
{
int colnum = 0;
/* Shouldn't go through all these pushups for simple chrs */
assert(c > MAX_SIMPLE_CHR);
/*
* Note: we should not see requests to consider cclasses that are not
* treated as locale-specific by cclass(), above.
*/
if (cm->classbits[CC_PRINT] && pg_wc_isprint(c))
colnum |= cm->classbits[CC_PRINT];
if (cm->classbits[CC_ALNUM] && pg_wc_isalnum(c))
colnum |= cm->classbits[CC_ALNUM];
if (cm->classbits[CC_ALPHA] && pg_wc_isalpha(c))
colnum |= cm->classbits[CC_ALPHA];
assert(cm->classbits[CC_ASCII] == 0);
assert(cm->classbits[CC_BLANK] == 0);
assert(cm->classbits[CC_CNTRL] == 0);
if (cm->classbits[CC_DIGIT] && pg_wc_isdigit(c))
colnum |= cm->classbits[CC_DIGIT];
if (cm->classbits[CC_PUNCT] && pg_wc_ispunct(c))
colnum |= cm->classbits[CC_PUNCT];
assert(cm->classbits[CC_XDIGIT] == 0);
if (cm->classbits[CC_SPACE] && pg_wc_isspace(c))
colnum |= cm->classbits[CC_SPACE];
if (cm->classbits[CC_LOWER] && pg_wc_islower(c))
colnum |= cm->classbits[CC_LOWER];
if (cm->classbits[CC_UPPER] && pg_wc_isupper(c))
colnum |= cm->classbits[CC_UPPER];
if (cm->classbits[CC_GRAPH] && pg_wc_isgraph(c))
colnum |= cm->classbits[CC_GRAPH];
return colnum;
}
/* /*
* allcases - supply cvec for all case counterparts of a chr (including itself) * allcases - supply cvec for all case counterparts of a chr (including itself)
* *
......
...@@ -736,7 +736,7 @@ store_match(pg_ctype_cache *pcc, pg_wchar chr1, int nchrs) ...@@ -736,7 +736,7 @@ store_match(pg_ctype_cache *pcc, pg_wchar chr1, int nchrs)
* Note that the result must not be freed or modified by caller. * Note that the result must not be freed or modified by caller.
*/ */
static struct cvec * static struct cvec *
pg_ctype_get_cache(pg_wc_probefunc probefunc) pg_ctype_get_cache(pg_wc_probefunc probefunc, int cclasscode)
{ {
pg_ctype_cache *pcc; pg_ctype_cache *pcc;
pg_wchar max_chr; pg_wchar max_chr;
...@@ -770,31 +770,43 @@ pg_ctype_get_cache(pg_wc_probefunc probefunc) ...@@ -770,31 +770,43 @@ pg_ctype_get_cache(pg_wc_probefunc probefunc)
pcc->cv.ranges = (chr *) malloc(pcc->cv.rangespace * sizeof(chr) * 2); pcc->cv.ranges = (chr *) malloc(pcc->cv.rangespace * sizeof(chr) * 2);
if (pcc->cv.chrs == NULL || pcc->cv.ranges == NULL) if (pcc->cv.chrs == NULL || pcc->cv.ranges == NULL)
goto out_of_memory; goto out_of_memory;
pcc->cv.cclasscode = cclasscode;
/* /*
* Decide how many character codes we ought to look through. For C locale * Decide how many character codes we ought to look through. In general
* there's no need to go further than 127. Otherwise, if the encoding is * we don't go past MAX_SIMPLE_CHR; chr codes above that are handled at
* UTF8 go up to 0x7FF, which is a pretty arbitrary cutoff but we cannot * runtime using the "high colormap" mechanism. However, in C locale
* extend it as far as we'd like (say, 0xFFFF, the end of the Basic * there's no need to go further than 127, and if we only have a 1-byte
* Multilingual Plane) without creating significant performance issues due * <ctype.h> API there's no need to go further than that can handle.
* to too many characters being fed through the colormap code. This will *
* need redesign to fix reasonably, but at least for the moment we have * If it's not MAX_SIMPLE_CHR that's constraining the search, mark the
* all common European languages covered. Otherwise (not C, not UTF8) go * output cvec as not having any locale-dependent behavior, since there
* up to 255. These limits are interrelated with restrictions discussed * will be no need to do any run-time locale checks. (The #if's here
* at the head of this file. * would always be true for production values of MAX_SIMPLE_CHR, but it's
* useful to allow it to be small for testing purposes.)
*/ */
switch (pg_regex_strategy) switch (pg_regex_strategy)
{ {
case PG_REGEX_LOCALE_C: case PG_REGEX_LOCALE_C:
#if MAX_SIMPLE_CHR >= 127
max_chr = (pg_wchar) 127; max_chr = (pg_wchar) 127;
pcc->cv.cclasscode = -1;
#else
max_chr = (pg_wchar) MAX_SIMPLE_CHR;
#endif
break; break;
case PG_REGEX_LOCALE_WIDE: case PG_REGEX_LOCALE_WIDE:
case PG_REGEX_LOCALE_WIDE_L: case PG_REGEX_LOCALE_WIDE_L:
max_chr = (pg_wchar) 0x7FF; max_chr = (pg_wchar) MAX_SIMPLE_CHR;
break; break;
case PG_REGEX_LOCALE_1BYTE: case PG_REGEX_LOCALE_1BYTE:
case PG_REGEX_LOCALE_1BYTE_L: case PG_REGEX_LOCALE_1BYTE_L:
#if MAX_SIMPLE_CHR >= UCHAR_MAX
max_chr = (pg_wchar) UCHAR_MAX; max_chr = (pg_wchar) UCHAR_MAX;
pcc->cv.cclasscode = -1;
#else
max_chr = (pg_wchar) MAX_SIMPLE_CHR;
#endif
break; break;
default: default:
max_chr = 0; /* can't get here, but keep compiler quiet */ max_chr = 0; /* can't get here, but keep compiler quiet */
......
...@@ -55,7 +55,6 @@ static void cbracket(struct vars *, struct state *, struct state *); ...@@ -55,7 +55,6 @@ static void cbracket(struct vars *, struct state *, struct state *);
static void brackpart(struct vars *, struct state *, struct state *); static void brackpart(struct vars *, struct state *, struct state *);
static const chr *scanplain(struct vars *); static const chr *scanplain(struct vars *);
static void onechr(struct vars *, chr, struct state *, struct state *); static void onechr(struct vars *, chr, struct state *, struct state *);
static void dovec(struct vars *, struct cvec *, struct state *, struct state *);
static void wordchrs(struct vars *); static void wordchrs(struct vars *);
static void processlacon(struct vars *, struct state *, struct state *, int, static void processlacon(struct vars *, struct state *, struct state *, int,
struct state *, struct state *); struct state *, struct state *);
...@@ -96,16 +95,19 @@ static chr chrnamed(struct vars *, const chr *, const chr *, chr); ...@@ -96,16 +95,19 @@ static chr chrnamed(struct vars *, const chr *, const chr *, chr);
/* === regc_color.c === */ /* === regc_color.c === */
static void initcm(struct vars *, struct colormap *); static void initcm(struct vars *, struct colormap *);
static void freecm(struct colormap *); static void freecm(struct colormap *);
static void cmtreefree(struct colormap *, union tree *, int);
static color setcolor(struct colormap *, chr, color);
static color maxcolor(struct colormap *); static color maxcolor(struct colormap *);
static color newcolor(struct colormap *); static color newcolor(struct colormap *);
static void freecolor(struct colormap *, color); static void freecolor(struct colormap *, color);
static color pseudocolor(struct colormap *); static color pseudocolor(struct colormap *);
static color subcolor(struct colormap *, chr c); static color subcolor(struct colormap *, chr);
static color subcolorhi(struct colormap *, color *);
static color newsub(struct colormap *, color); static color newsub(struct colormap *, color);
static void subrange(struct vars *, chr, chr, struct state *, struct state *); static int newhicolorrow(struct colormap *, int);
static void subblock(struct vars *, chr, struct state *, struct state *); static void newhicolorcols(struct colormap *);
static void subcolorcvec(struct vars *, struct cvec *, struct state *, struct state *);
static void subcoloronechr(struct vars *, chr, struct state *, struct state *, color *);
static void subcoloronerange(struct vars *, chr, chr, struct state *, struct state *, color *);
static void subcoloronerow(struct vars *, int, struct state *, struct state *, color *);
static void okcolors(struct nfa *, struct colormap *); static void okcolors(struct nfa *, struct colormap *);
static void colorchain(struct colormap *, struct arc *); static void colorchain(struct colormap *, struct arc *);
static void uncolorchain(struct colormap *, struct arc *); static void uncolorchain(struct colormap *, struct arc *);
...@@ -114,7 +116,6 @@ static void colorcomplement(struct nfa *, struct colormap *, int, struct state * ...@@ -114,7 +116,6 @@ static void colorcomplement(struct nfa *, struct colormap *, int, struct state *
#ifdef REG_DEBUG #ifdef REG_DEBUG
static void dumpcolors(struct colormap *, FILE *); static void dumpcolors(struct colormap *, FILE *);
static void fillcheck(struct colormap *, union tree *, int, FILE *);
static void dumpchr(chr, FILE *); static void dumpchr(chr, FILE *);
#endif #endif
/* === regc_nfa.c === */ /* === regc_nfa.c === */
...@@ -215,6 +216,7 @@ static struct cvec *range(struct vars *, chr, chr, int); ...@@ -215,6 +216,7 @@ static struct cvec *range(struct vars *, chr, chr, int);
static int before(chr, chr); static int before(chr, chr);
static struct cvec *eclass(struct vars *, chr, int); static struct cvec *eclass(struct vars *, chr, int);
static struct cvec *cclass(struct vars *, const chr *, const chr *, int); static struct cvec *cclass(struct vars *, const chr *, const chr *, int);
static int cclass_column_index(struct colormap *, chr);
static struct cvec *allcases(struct vars *, chr); static struct cvec *allcases(struct vars *, chr);
static int cmp(const chr *, const chr *, size_t); static int cmp(const chr *, const chr *, size_t);
static int casecmp(const chr *, const chr *, size_t); static int casecmp(const chr *, const chr *, size_t);
...@@ -1467,7 +1469,7 @@ brackpart(struct vars * v, ...@@ -1467,7 +1469,7 @@ brackpart(struct vars * v,
NOERR(); NOERR();
cv = eclass(v, startc, (v->cflags & REG_ICASE)); cv = eclass(v, startc, (v->cflags & REG_ICASE));
NOERR(); NOERR();
dovec(v, cv, lp, rp); subcolorcvec(v, cv, lp, rp);
return; return;
break; break;
case CCLASS: case CCLASS:
...@@ -1477,7 +1479,7 @@ brackpart(struct vars * v, ...@@ -1477,7 +1479,7 @@ brackpart(struct vars * v,
NOERR(); NOERR();
cv = cclass(v, startp, endp, (v->cflags & REG_ICASE)); cv = cclass(v, startp, endp, (v->cflags & REG_ICASE));
NOERR(); NOERR();
dovec(v, cv, lp, rp); subcolorcvec(v, cv, lp, rp);
return; return;
break; break;
default: default:
...@@ -1523,7 +1525,7 @@ brackpart(struct vars * v, ...@@ -1523,7 +1525,7 @@ brackpart(struct vars * v,
NOTE(REG_UUNPORT); NOTE(REG_UUNPORT);
cv = range(v, startc, endc, (v->cflags & REG_ICASE)); cv = range(v, startc, endc, (v->cflags & REG_ICASE));
NOERR(); NOERR();
dovec(v, cv, lp, rp); subcolorcvec(v, cv, lp, rp);
} }
/* /*
...@@ -1565,46 +1567,14 @@ onechr(struct vars * v, ...@@ -1565,46 +1567,14 @@ onechr(struct vars * v,
{ {
if (!(v->cflags & REG_ICASE)) if (!(v->cflags & REG_ICASE))
{ {
newarc(v->nfa, PLAIN, subcolor(v->cm, c), lp, rp); color lastsubcolor = COLORLESS;
subcoloronechr(v, c, lp, rp, &lastsubcolor);
return; return;
} }
/* rats, need general case anyway... */ /* rats, need general case anyway... */
dovec(v, allcases(v, c), lp, rp); subcolorcvec(v, allcases(v, c), lp, rp);
}
/*
* dovec - fill in arcs for each element of a cvec
*/
static void
dovec(struct vars * v,
struct cvec * cv,
struct state * lp,
struct state * rp)
{
chr ch,
from,
to;
const chr *p;
int i;
/* ordinary characters */
for (p = cv->chrs, i = cv->nchrs; i > 0; p++, i--)
{
ch = *p;
newarc(v->nfa, PLAIN, subcolor(v->cm, ch), lp, rp);
NOERR();
}
/* and the ranges */
for (p = cv->ranges, i = cv->nranges; i > 0; p += 2, i--)
{
from = *p;
to = *(p + 1);
if (from <= to)
subrange(v, from, to, lp, rp);
NOERR();
}
} }
/* /*
......
...@@ -28,10 +28,6 @@ ...@@ -28,10 +28,6 @@
#include "regex/regexport.h" #include "regex/regexport.h"
static void scancolormap(struct colormap * cm, int co,
union tree * t, int level, chr partial,
pg_wchar **chars, int *chars_len);
/* /*
* Get total number of NFA states. * Get total number of NFA states.
...@@ -187,10 +183,7 @@ pg_reg_colorisend(const regex_t *regex, int co) ...@@ -187,10 +183,7 @@ pg_reg_colorisend(const regex_t *regex, int co)
* *
* Note: we return -1 if the color number is invalid, or if it is a special * Note: we return -1 if the color number is invalid, or if it is a special
* color (WHITE or a pseudocolor), or if the number of members is uncertain. * color (WHITE or a pseudocolor), or if the number of members is uncertain.
* The latter case cannot arise right now but is specified to allow for future * Callers should not try to extract the members if -1 is returned.
* improvements (see musings about run-time handling of higher character codes
* in regex/README). Callers should not try to extract the members if -1 is
* returned.
*/ */
int int
pg_reg_getnumcharacters(const regex_t *regex, int co) pg_reg_getnumcharacters(const regex_t *regex, int co)
...@@ -205,7 +198,18 @@ pg_reg_getnumcharacters(const regex_t *regex, int co) ...@@ -205,7 +198,18 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
if (cm->cd[co].flags & PSEUDO) /* also pseudocolors (BOS etc) */ if (cm->cd[co].flags & PSEUDO) /* also pseudocolors (BOS etc) */
return -1; return -1;
return cm->cd[co].nchrs; /*
* If the color appears anywhere in the high colormap, treat its number of
* members as uncertain. In principle we could determine all the specific
* chrs corresponding to each such entry, but it would be expensive
* (particularly if character class tests are required) and it doesn't
* seem worth it.
*/
if (cm->cd[co].nuchrs != 0)
return -1;
/* OK, return the known number of member chrs */
return cm->cd[co].nschrs;
} }
/* /*
...@@ -222,6 +226,7 @@ pg_reg_getcharacters(const regex_t *regex, int co, ...@@ -222,6 +226,7 @@ pg_reg_getcharacters(const regex_t *regex, int co,
pg_wchar *chars, int chars_len) pg_wchar *chars, int chars_len)
{ {
struct colormap *cm; struct colormap *cm;
chr c;
assert(regex != NULL && regex->re_magic == REMAGIC); assert(regex != NULL && regex->re_magic == REMAGIC);
cm = &((struct guts *) regex->re_guts)->cmap; cm = &((struct guts *) regex->re_guts)->cmap;
...@@ -231,62 +236,17 @@ pg_reg_getcharacters(const regex_t *regex, int co, ...@@ -231,62 +236,17 @@ pg_reg_getcharacters(const regex_t *regex, int co,
if (cm->cd[co].flags & PSEUDO) if (cm->cd[co].flags & PSEUDO)
return; return;
/* Recursively search the colormap tree */
scancolormap(cm, co, cm->tree, 0, 0, &chars, &chars_len);
}
/*
* Recursively scan the colormap tree to find chrs belonging to color "co".
* See regex/README for info about the tree structure.
*
* t: tree block to scan
* level: level (from 0) of t
* partial: partial chr code for chrs within t
* chars, chars_len: output area
*/
static void
scancolormap(struct colormap * cm, int co,
union tree * t, int level, chr partial,
pg_wchar **chars, int *chars_len)
{
int i;
if (level < NBYTS - 1)
{
/* non-leaf node */
for (i = 0; i < BYTTAB; i++)
{
/* /*
* We do not support search for chrs of color 0 (WHITE), so * We need only examine the low character map; there should not be any
* all-white subtrees need not be searched. These can be * matching entries in the high map.
* recognized because they are represented by the fill blocks in
* the colormap struct. This typically allows us to avoid
* scanning large regions of higher-numbered chrs.
*/ */
if (t->tptr[i] == &cm->tree[level + 1]) for (c = CHR_MIN; c <= MAX_SIMPLE_CHR; c++)
continue;
/* Recursively scan next level down */
scancolormap(cm, co,
t->tptr[i], level + 1,
(partial | (chr) i) << BYTBITS,
chars, chars_len);
}
}
else
{ {
/* leaf node */ if (cm->locolormap[c - CHR_MIN] == co)
for (i = 0; i < BYTTAB; i++)
{ {
if (t->tcolor[i] == co) *chars++ = c;
{ if (--chars_len == 0)
if (*chars_len > 0) break;
{
**chars = partial | (chr) i;
(*chars)++;
(*chars_len)--;
}
}
} }
} }
} }
...@@ -194,7 +194,10 @@ findprefix(struct cnfa * cnfa, ...@@ -194,7 +194,10 @@ findprefix(struct cnfa * cnfa,
if (thiscolor == COLORLESS) if (thiscolor == COLORLESS)
break; break;
/* The color must be a singleton */ /* The color must be a singleton */
if (cm->cd[thiscolor].nchrs != 1) if (cm->cd[thiscolor].nschrs != 1)
break;
/* Must not have any high-color-map entries */
if (cm->cd[thiscolor].nuchrs != 0)
break; break;
/* /*
......
...@@ -77,6 +77,16 @@ typedef unsigned uchr; /* unsigned type that will hold a chr */ ...@@ -77,6 +77,16 @@ typedef unsigned uchr; /* unsigned type that will hold a chr */
*/ */
#define CHR_IS_IN_RANGE(c) ((c) <= CHR_MAX) #define CHR_IS_IN_RANGE(c) ((c) <= CHR_MAX)
/*
* MAX_SIMPLE_CHR is the cutoff between "simple" and "complicated" processing
* in the color map logic. It should usually be chosen high enough to ensure
* that all common characters are <= MAX_SIMPLE_CHR. However, very large
* values will be counterproductive since they cause more regex setup time.
* Also, small values can be helpful for testing the high-color-map logic
* with plain old ASCII input.
*/
#define MAX_SIMPLE_CHR 0x7FF /* suitable value for Unicode */
/* functions operating on chr */ /* functions operating on chr */
#define iscalnum(x) pg_wc_isalnum(x) #define iscalnum(x) pg_wc_isalnum(x)
#define iscalpha(x) pg_wc_isalpha(x) #define iscalpha(x) pg_wc_isalpha(x)
......
...@@ -172,6 +172,5 @@ extern int pg_regexec(regex_t *, const pg_wchar *, size_t, size_t, rm_detail_t * ...@@ -172,6 +172,5 @@ extern int pg_regexec(regex_t *, const pg_wchar *, size_t, size_t, rm_detail_t *
extern int pg_regprefix(regex_t *, pg_wchar **, size_t *); extern int pg_regprefix(regex_t *, pg_wchar **, size_t *);
extern void pg_regfree(regex_t *); extern void pg_regfree(regex_t *);
extern size_t pg_regerror(int, const regex_t *, char *, size_t); extern size_t pg_regerror(int, const regex_t *, char *, size_t);
extern void pg_set_regex_collation(Oid collation);
#endif /* _REGEX_H_ */ #endif /* _REGEX_H_ */
...@@ -127,23 +127,6 @@ ...@@ -127,23 +127,6 @@
#define ISBSET(uv, sn) ((uv)[(sn)/UBITS] & ((unsigned)1 << ((sn)%UBITS))) #define ISBSET(uv, sn) ((uv)[(sn)/UBITS] & ((unsigned)1 << ((sn)%UBITS)))
/*
* We dissect a chr into byts for colormap table indexing. Here we define
* a byt, which will be the same as a byte on most machines... The exact
* size of a byt is not critical, but about 8 bits is good, and extraction
* of 8-bit chunks is sometimes especially fast.
*/
#ifndef BYTBITS
#define BYTBITS 8 /* bits in a byt */
#endif
#define BYTTAB (1<<BYTBITS) /* size of table with one entry per byt value */
#define BYTMASK (BYTTAB-1) /* bit mask for byt */
#define NBYTS ((CHRBITS+BYTBITS-1)/BYTBITS)
/* the definition of GETCOLOR(), below, assumes NBYTS <= 4 */
/* /*
* As soon as possible, we map chrs into equivalence classes -- "colors" -- * As soon as possible, we map chrs into equivalence classes -- "colors" --
* which are of much more manageable number. * which are of much more manageable number.
...@@ -153,42 +136,9 @@ typedef short color; /* colors of characters */ ...@@ -153,42 +136,9 @@ typedef short color; /* colors of characters */
#define MAX_COLOR 32767 /* max color (must fit in 'color' datatype) */ #define MAX_COLOR 32767 /* max color (must fit in 'color' datatype) */
#define COLORLESS (-1) /* impossible color */ #define COLORLESS (-1) /* impossible color */
#define WHITE 0 /* default color, parent of all others */ #define WHITE 0 /* default color, parent of all others */
/* Note: various places in the code know that WHITE is zero */
/*
* A colormap is a tree -- more precisely, a DAG -- indexed at each level
* by a byt of the chr, to map the chr to a color efficiently. Because
* lower sections of the tree can be shared, it can exploit the usual
* sparseness of such a mapping table. The tree is always NBYTS levels
* deep (in the past it was shallower during construction but was "filled"
* to full depth at the end of that); areas that are unaltered as yet point
* to "fill blocks" which are entirely WHITE in color.
*
* Leaf-level tree blocks are of type "struct colors", while upper-level
* blocks are of type "struct ptrs". Pointers into the tree are generally
* declared as "union tree *" to be agnostic about what level they point to.
*/
/* the tree itself */
struct colors
{
color ccolor[BYTTAB];
};
struct ptrs
{
union tree *pptr[BYTTAB];
};
union tree
{
struct colors colors;
struct ptrs ptrs;
};
/* use these pseudo-field names when dereferencing a "union tree" pointer */
#define tcolor colors.ccolor
#define tptr ptrs.pptr
/* /*
* Per-color data structure for the compile-time color machinery * Per-color data structure for the compile-time color machinery
* *
...@@ -203,26 +153,56 @@ union tree ...@@ -203,26 +153,56 @@ union tree
*/ */
struct colordesc struct colordesc
{ {
uchr nchrs; /* number of chars of this color */ int nschrs; /* number of simple chars of this color */
int nuchrs; /* number of upper map entries of this color */
color sub; /* open subcolor, if any; or free-chain ptr */ color sub; /* open subcolor, if any; or free-chain ptr */
#define NOSUB COLORLESS /* value of "sub" when no open subcolor */ #define NOSUB COLORLESS /* value of "sub" when no open subcolor */
struct arc *arcs; /* chain of all arcs of this color */ struct arc *arcs; /* chain of all arcs of this color */
chr firstchr; /* char first assigned to this color */ chr firstchr; /* simple char first assigned to this color */
int flags; /* bit values defined next */ int flags; /* bit values defined next */
#define FREECOL 01 /* currently free */ #define FREECOL 01 /* currently free */
#define PSEUDO 02 /* pseudocolor, no real chars */ #define PSEUDO 02 /* pseudocolor, no real chars */
#define UNUSEDCOLOR(cd) ((cd)->flags & FREECOL) #define UNUSEDCOLOR(cd) ((cd)->flags & FREECOL)
union tree *block; /* block of solid color, if any */
}; };
/* /*
* The color map itself * The color map itself
* *
* Much of the data in the colormap struct is only used at compile time. * This struct holds both data used only at compile time, and the chr to
* However, the bulk of the space usage is in the "tree" structure, so it's * color mapping information, used at both compile and run time. The latter
* not clear that there's much point in converting the rest to a more compact * is the bulk of the space, so it's not really worth separating out the
* form when compilation is finished. * compile-only portion.
*
* Ideally, the mapping data would just be an array of colors indexed by
* chr codes; but for large character sets that's impractical. Fortunately,
* common characters have smaller codes, so we can use a simple array for chr
* codes up to MAX_SIMPLE_CHR, and do something more complex for codes above
* that, without much loss of performance. The "something more complex" is a
* 2-D array of color entries, where row indexes correspond to individual chrs
* or chr ranges that have been mentioned in the regex (with row zero
* representing all other chrs), and column indexes correspond to different
* sets of locale-dependent character classes such as "isalpha". The
* classbits[k] entry is zero if we do not care about the k'th character class
* in this regex, and otherwise it is the bit to be OR'd into the column index
* if the character in question is a member of that class. We find the color
* of a high-valued chr by identifying which colormaprange it is in to get
* the row index (use row zero if it's in none of them), identifying which of
* the interesting cclasses it's in to get the column index, and then indexing
* into the 2-D hicolormap array.
*
* The colormapranges are required to be nonempty, nonoverlapping, and to
* appear in increasing chr-value order.
*/ */
#define NUM_CCLASSES 13 /* must match data in regc_locale.c */
typedef struct colormaprange
{
chr cmin; /* range represents cmin..cmax inclusive */
chr cmax;
int rownum; /* row index in hicolormap array (>= 1) */
} colormaprange;
struct colormap struct colormap
{ {
int magic; int magic;
...@@ -233,27 +213,27 @@ struct colormap ...@@ -233,27 +213,27 @@ struct colormap
color free; /* beginning of free chain (if non-0) */ color free; /* beginning of free chain (if non-0) */
struct colordesc *cd; /* pointer to array of colordescs */ struct colordesc *cd; /* pointer to array of colordescs */
#define CDEND(cm) (&(cm)->cd[(cm)->max + 1]) #define CDEND(cm) (&(cm)->cd[(cm)->max + 1])
/* mapping data for chrs <= MAX_SIMPLE_CHR: */
color *locolormap; /* simple array indexed by chr code */
/* mapping data for chrs > MAX_SIMPLE_CHR: */
int classbits[NUM_CCLASSES]; /* see comment above */
int numcmranges; /* number of colormapranges */
colormaprange *cmranges; /* ranges of high chrs */
color *hicolormap; /* 2-D array of color entries */
int maxarrayrows; /* number of array rows allocated */
int hiarrayrows; /* number of array rows in use */
int hiarraycols; /* number of array columns (2^N) */
/* If we need up to NINLINECDS, we store them here to save a malloc */ /* If we need up to NINLINECDS, we store them here to save a malloc */
#define NINLINECDS ((size_t)10) #define NINLINECDS ((size_t) 10)
struct colordesc cdspace[NINLINECDS]; struct colordesc cdspace[NINLINECDS];
union tree tree[NBYTS]; /* tree top, plus lower-level fill blocks */
}; };
/* optimization magic to do fast chr->color mapping */ /* fetch color for chr; beware of multiple evaluation of c argument */
#define B0(c) ((c) & BYTMASK) #define GETCOLOR(cm, c) \
#define B1(c) (((c)>>BYTBITS) & BYTMASK) ((c) <= MAX_SIMPLE_CHR ? (cm)->locolormap[(c) - CHR_MIN] : pg_reg_getcolor(cm, c))
#define B2(c) (((c)>>(2*BYTBITS)) & BYTMASK)
#define B3(c) (((c)>>(3*BYTBITS)) & BYTMASK)
#if NBYTS == 1
#define GETCOLOR(cm, c) ((cm)->tree->tcolor[B0(c)])
#endif
/* beware, for NBYTS>1, GETCOLOR() is unsafe -- 2nd arg used repeatedly */
#if NBYTS == 2
#define GETCOLOR(cm, c) ((cm)->tree->tptr[B1(c)]->tcolor[B0(c)])
#endif
#if NBYTS == 4
#define GETCOLOR(cm, c) ((cm)->tree->tptr[B3(c)]->tptr[B2(c)]->tptr[B1(c)]->tcolor[B0(c)])
#endif
/* /*
...@@ -264,6 +244,11 @@ struct colormap ...@@ -264,6 +244,11 @@ struct colormap
* Representation of a set of characters. chrs[] represents individual * Representation of a set of characters. chrs[] represents individual
* code points, ranges[] represents ranges in the form min..max inclusive. * code points, ranges[] represents ranges in the form min..max inclusive.
* *
* If the cvec represents a locale-specific character class, eg [[:alpha:]],
* then the chrs[] and ranges[] arrays contain only members of that class
* up to MAX_SIMPLE_CHR (inclusive). cclasscode is set to regc_locale.c's
* code for the class, rather than being -1 as it is in an ordinary cvec.
*
* Note that in cvecs gotten from newcvec() and intended to be freed by * Note that in cvecs gotten from newcvec() and intended to be freed by
* freecvec(), both arrays of chrs are after the end of the struct, not * freecvec(), both arrays of chrs are after the end of the struct, not
* separately malloc'd; so chrspace and rangespace are effectively immutable. * separately malloc'd; so chrspace and rangespace are effectively immutable.
...@@ -276,6 +261,7 @@ struct cvec ...@@ -276,6 +261,7 @@ struct cvec
int nranges; /* number of ranges (chr pairs) */ int nranges; /* number of ranges (chr pairs) */
int rangespace; /* number of ranges allocated in ranges[] */ int rangespace; /* number of ranges allocated in ranges[] */
chr *ranges; /* pointer to vector of chr pairs */ chr *ranges; /* pointer to vector of chr pairs */
int cclasscode; /* value of "enum classes", or -1 */
}; };
...@@ -489,3 +475,8 @@ struct guts ...@@ -489,3 +475,8 @@ struct guts
int nlacons; /* size of lacons[]; note that only slots int nlacons; /* size of lacons[]; note that only slots
* numbered 1 .. nlacons-1 are used */ * numbered 1 .. nlacons-1 are used */
}; };
/* prototypes for functions that are exported from regcomp.c to regexec.c */
extern void pg_set_regex_collation(Oid collation);
extern color pg_reg_getcolor(struct colormap * cm, chr c);
/*
* This test is for Linux/glibc systems and others that implement proper
* locale classification of Unicode characters with high code values.
* It must be run in a database with UTF8 encoding and a Unicode-aware locale.
*/
SET client_encoding TO UTF8;
--
-- Test the "high colormap" logic with single characters and ranges that
-- exceed the MAX_SIMPLE_CHR cutoff, here assumed to be less than U+2000.
--
-- trivial cases:
SELECT 'aⓐ' ~ U&'a\24D0' AS t;
t
---
t
(1 row)
SELECT 'aⓐ' ~ U&'a\24D1' AS f;
f
---
f
(1 row)
SELECT 'aⓕ' ~ 'a[ⓐ-ⓩ]' AS t;
t
---
t
(1 row)
SELECT 'aⒻ' ~ 'a[ⓐ-ⓩ]' AS f;
f
---
f
(1 row)
-- cases requiring splitting of ranges:
SELECT 'aⓕⓕ' ~ 'aⓕ[ⓐ-ⓩ]' AS t;
t
---
t
(1 row)
SELECT 'aⓕⓐ' ~ 'aⓕ[ⓐ-ⓩ]' AS t;
t
---
t
(1 row)
SELECT 'aⓐⓕ' ~ 'aⓕ[ⓐ-ⓩ]' AS f;
f
---
f
(1 row)
SELECT 'aⓕⓕ' ~ 'a[ⓐ-ⓩ]ⓕ' AS t;
t
---
t
(1 row)
SELECT 'aⓕⓐ' ~ 'a[ⓐ-ⓩ]ⓕ' AS f;
f
---
f
(1 row)
SELECT 'aⓐⓕ' ~ 'a[ⓐ-ⓩ]ⓕ' AS t;
t
---
t
(1 row)
SELECT 'aⒶⓜ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
t
---
t
(1 row)
SELECT 'aⓜⓜ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
t
---
t
(1 row)
SELECT 'aⓜⓩ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
t
---
t
(1 row)
SELECT 'aⓩⓩ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS f;
f
---
f
(1 row)
SELECT 'aⓜ⓪' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS f;
f
---
f
(1 row)
SELECT 'a0' ~ 'a[a-ⓩ]' AS f;
f
---
f
(1 row)
SELECT 'aq' ~ 'a[a-ⓩ]' AS t;
t
---
t
(1 row)
SELECT 'aⓜ' ~ 'a[a-ⓩ]' AS t;
t
---
t
(1 row)
SELECT 'a⓪' ~ 'a[a-ⓩ]' AS f;
f
---
f
(1 row)
-- Locale-dependent character classes
SELECT 'aⒶⓜ⓪' ~ '[[:alpha:]][[:alpha:]][[:alpha:]][[:graph:]]' AS t;
t
---
t
(1 row)
SELECT 'aⒶⓜ⓪' ~ '[[:alpha:]][[:alpha:]][[:alpha:]][[:alpha:]]' AS f;
f
---
f
(1 row)
-- Locale-dependent character classes with high ranges
SELECT 'aⒶⓜ⓪' ~ '[a-z][[:alpha:]][ⓐ-ⓩ][[:graph:]]' AS t;
t
---
t
(1 row)
SELECT 'aⓜⒶ⓪' ~ '[a-z][[:alpha:]][ⓐ-ⓩ][[:graph:]]' AS f;
f
---
f
(1 row)
SELECT 'aⓜⒶ⓪' ~ '[a-z][ⓐ-ⓩ][[:alpha:]][[:graph:]]' AS t;
t
---
t
(1 row)
SELECT 'aⒶⓜ⓪' ~ '[a-z][ⓐ-ⓩ][[:alpha:]][[:graph:]]' AS f;
f
---
f
(1 row)
/*
* This test is for Linux/glibc systems and others that implement proper
* locale classification of Unicode characters with high code values.
* It must be run in a database with UTF8 encoding and a Unicode-aware locale.
*/
SET client_encoding TO UTF8;
--
-- Test the "high colormap" logic with single characters and ranges that
-- exceed the MAX_SIMPLE_CHR cutoff, here assumed to be less than U+2000.
--
-- trivial cases:
SELECT 'aⓐ' ~ U&'a\24D0' AS t;
SELECT 'aⓐ' ~ U&'a\24D1' AS f;
SELECT 'aⓕ' ~ 'a[ⓐ-ⓩ]' AS t;
SELECT 'aⒻ' ~ 'a[ⓐ-ⓩ]' AS f;
-- cases requiring splitting of ranges:
SELECT 'aⓕⓕ' ~ 'aⓕ[ⓐ-ⓩ]' AS t;
SELECT 'aⓕⓐ' ~ 'aⓕ[ⓐ-ⓩ]' AS t;
SELECT 'aⓐⓕ' ~ 'aⓕ[ⓐ-ⓩ]' AS f;
SELECT 'aⓕⓕ' ~ 'a[ⓐ-ⓩ]ⓕ' AS t;
SELECT 'aⓕⓐ' ~ 'a[ⓐ-ⓩ]ⓕ' AS f;
SELECT 'aⓐⓕ' ~ 'a[ⓐ-ⓩ]ⓕ' AS t;
SELECT 'aⒶⓜ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
SELECT 'aⓜⓜ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
SELECT 'aⓜⓩ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
SELECT 'aⓩⓩ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS f;
SELECT 'aⓜ⓪' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS f;
SELECT 'a0' ~ 'a[a-ⓩ]' AS f;
SELECT 'aq' ~ 'a[a-ⓩ]' AS t;
SELECT 'aⓜ' ~ 'a[a-ⓩ]' AS t;
SELECT 'a⓪' ~ 'a[a-ⓩ]' AS f;
-- Locale-dependent character classes
SELECT 'aⒶⓜ⓪' ~ '[[:alpha:]][[:alpha:]][[:alpha:]][[:graph:]]' AS t;
SELECT 'aⒶⓜ⓪' ~ '[[:alpha:]][[:alpha:]][[:alpha:]][[:alpha:]]' AS f;
-- Locale-dependent character classes with high ranges
SELECT 'aⒶⓜ⓪' ~ '[a-z][[:alpha:]][ⓐ-ⓩ][[:graph:]]' AS t;
SELECT 'aⓜⒶ⓪' ~ '[a-z][[:alpha:]][ⓐ-ⓩ][[:graph:]]' AS f;
SELECT 'aⓜⒶ⓪' ~ '[a-z][ⓐ-ⓩ][[:alpha:]][[:graph:]]' AS t;
SELECT 'aⒶⓜ⓪' ~ '[a-z][ⓐ-ⓩ][[:alpha:]][[:graph:]]' AS f;
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment