Top |
Unicode ManipulationUnicode Manipulation — functions operating on Unicode characters and UTF-8 strings |
typedef | gunichar |
typedef | gunichar2 |
#define | G_UNICHAR_MAX_DECOMPOSITION_LENGTH |
enum | GUnicodeType |
#define | G_UNICODE_COMBINING_MARK |
enum | GUnicodeBreakType |
enum | GUnicodeScript |
enum | GNormalizeMode |
This section describes a number of functions for dealing with
Unicode characters and strings. There are analogues of the
traditional ctype.h
character classification and case conversion
functions, UTF-8 analogues of some string utility functions,
functions to perform normalization, case conversion and collation
on UTF-8 strings and finally functions to convert between the UTF-8,
UTF-16 and UCS-4 encodings of Unicode.
The implementations of the Unicode functions in GLib are based on the Unicode Character Data tables, which are available from www.unicode.org.
Unicode 4.0 was added in GLib 2.8
Unicode 4.1 was added in GLib 2.10
Unicode 5.0 was added in GLib 2.12
Unicode 5.1 was added in GLib 2.16.3
Unicode 6.0 was added in GLib 2.30
Unicode 6.1 was added in GLib 2.32
Unicode 6.2 was added in GLib 2.36
Unicode 6.3 was added in GLib 2.40
Unicode 7.0 was added in GLib 2.42
Unicode 8.0 was added in GLib 2.48
Unicode 9.0 was added in GLib 2.50.1
Unicode 10.0 was added in GLib 2.54
Unicode 11.10 was added in GLib 2.58
Unicode 12.0 was added in GLib 2.62
Unicode 12.1 was added in GLib 2.62
Unicode 13.0 was added in GLib 2.66
gboolean
g_unichar_validate (gunichar ch
);
Checks whether ch
is a valid Unicode character. Some possible
integer values of ch
will not be valid. 0 is considered a valid
character, though it's normally a string terminator.
gboolean
g_unichar_isalnum (gunichar c
);
Determines whether a character is alphanumeric.
Given some UTF-8 text, obtain a character value
with g_utf8_get_char()
.
gboolean
g_unichar_isalpha (gunichar c
);
Determines whether a character is alphabetic (i.e. a letter).
Given some UTF-8 text, obtain a character value with
g_utf8_get_char()
.
gboolean
g_unichar_iscntrl (gunichar c
);
Determines whether a character is a control character.
Given some UTF-8 text, obtain a character value with
g_utf8_get_char()
.
gboolean
g_unichar_isdefined (gunichar c
);
Determines if a given character is assigned in the Unicode standard.
gboolean
g_unichar_isdigit (gunichar c
);
Determines whether a character is numeric (i.e. a digit). This
covers ASCII 0-9 and also digits in other languages/scripts. Given
some UTF-8 text, obtain a character value with g_utf8_get_char()
.
gboolean
g_unichar_isgraph (gunichar c
);
Determines whether a character is printable and not a space
(returns FALSE
for control characters, format characters, and
spaces). g_unichar_isprint()
is similar, but returns TRUE
for
spaces. Given some UTF-8 text, obtain a character value with
g_utf8_get_char()
.
gboolean
g_unichar_islower (gunichar c
);
Determines whether a character is a lowercase letter.
Given some UTF-8 text, obtain a character value with
g_utf8_get_char()
.
gboolean
g_unichar_ismark (gunichar c
);
Determines whether a character is a mark (non-spacing mark,
combining mark, or enclosing mark in Unicode speak).
Given some UTF-8 text, obtain a character value
with g_utf8_get_char()
.
Note: in most cases where isalpha characters are allowed, ismark characters should be allowed to as they are essential for writing most European languages as well as many non-Latin scripts.
Since: 2.14
gboolean
g_unichar_isprint (gunichar c
);
Determines whether a character is printable.
Unlike g_unichar_isgraph()
, returns TRUE
for spaces.
Given some UTF-8 text, obtain a character value with
g_utf8_get_char()
.
gboolean
g_unichar_ispunct (gunichar c
);
Determines whether a character is punctuation or a symbol.
Given some UTF-8 text, obtain a character value with
g_utf8_get_char()
.
gboolean
g_unichar_isspace (gunichar c
);
Determines whether a character is a space, tab, or line separator
(newline, carriage return, etc.). Given some UTF-8 text, obtain a
character value with g_utf8_get_char()
.
(Note: don't use this to do word breaking; you have to use Pango or equivalent to get word breaking right, the algorithm is fairly complex.)
gboolean
g_unichar_istitle (gunichar c
);
Determines if a character is titlecase. Some characters in Unicode which are composites, such as the DZ digraph have three case variants instead of just two. The titlecase form is used at the beginning of a word where only the first letter is capitalized. The titlecase form of the DZ digraph is U+01F2 LATIN CAPITAL LETTTER D WITH SMALL LETTER Z.
gboolean
g_unichar_isupper (gunichar c
);
Determines if a character is uppercase.
gboolean
g_unichar_isxdigit (gunichar c
);
Determines if a character is a hexadecimal digit.
gboolean
g_unichar_iswide (gunichar c
);
Determines if a character is typically rendered in a double-width cell.
gboolean
g_unichar_iswide_cjk (gunichar c
);
Determines if a character is typically rendered in a double-width
cell under legacy East Asian locales. If a character is wide according to
g_unichar_iswide()
, then it is also reported wide with this function, but
the converse is not necessarily true. See the
Unicode Standard Annex 11
for details.
If a character passes the g_unichar_iswide()
test then it will also pass
this test, but not the other way around. Note that some characters may
pass both this test and g_unichar_iszerowidth()
.
Since: 2.12
gboolean
g_unichar_iszerowidth (gunichar c
);
Determines if a given character typically takes zero width when rendered.
The return value is TRUE
for all non-spacing and enclosing marks
(e.g., combining accents), format characters, zero-width
space, but not U+00AD SOFT HYPHEN.
A typical use of this function is with one of g_unichar_iswide()
or
g_unichar_iswide_cjk()
to determine the number of cells a string occupies
when displayed on a grid display (terminals). However, note that not all
terminals support zero-width rendering of zero-width marks.
Since: 2.14
gunichar
g_unichar_totitle (gunichar c
);
Converts a character to the titlecase.
gint
g_unichar_digit_value (gunichar c
);
Determines the numeric value of a character as a decimal digit.
If c
is a decimal digit (according to
g_unichar_isdigit()
), its numeric value. Otherwise, -1.
gint
g_unichar_xdigit_value (gunichar c
);
Determines the numeric value of a character as a hexadecimal digit.
gboolean g_unichar_compose (gunichar a
,gunichar b
,gunichar *ch
);
Performs a single composition step of the Unicode canonical composition algorithm.
This function includes algorithmic Hangul Jamo composition,
but it is not exactly the inverse of g_unichar_decompose()
.
No composition can have either of a
or b
equal to zero.
To be precise, this function composes if and only if
there exists a Primary Composite P which is canonically
equivalent to the sequence <a
,b
>. See the Unicode
Standard for the definition of Primary Composite.
If a
and b
do not compose a new character, ch
is set to zero.
See UAX15 for details.
a |
a Unicode character |
|
b |
a Unicode character |
|
ch |
return location for the composed character. |
[out][not optional] |
Since: 2.30
gboolean g_unichar_decompose (gunichar ch
,gunichar *a
,gunichar *b
);
Performs a single decomposition step of the Unicode canonical decomposition algorithm.
This function does not include compatibility
decompositions. It does, however, include algorithmic
Hangul Jamo decomposition, as well as 'singleton'
decompositions which replace a character by a single
other character. In the case of singletons *b
will
be set to zero.
If ch
is not decomposable, *a
is set to ch
and *b
is set to zero.
Note that the way Unicode decomposition pairs are
defined, it is guaranteed that b
would not decompose
further, but a
may itself decompose. To get the full
canonical decomposition for ch
, one would need to
recursively call this function on a
. Or use
g_unichar_fully_decompose()
.
See UAX15 for details.
ch |
a Unicode character |
|
a |
return location for the first component of |
[out][not optional] |
b |
return location for the second component of |
[out][not optional] |
Since: 2.30
gsize g_unichar_fully_decompose (gunichar ch
,gboolean compat
,gunichar *result
,gsize result_len
);
Computes the canonical or compatibility decomposition of a
Unicode character. For compatibility decomposition,
pass TRUE
for compat
; for canonical decomposition
pass FALSE
for compat
.
The decomposed sequence is placed in result
. Only up to
result_len
characters are written into result
. The length
of the full decomposition (irrespective of result_len
) is
returned by the function. For canonical decomposition,
currently all decompositions are of length at most 4, but
this may change in the future (very unlikely though).
At any rate, Unicode does guarantee that a buffer of length
18 is always enough for both compatibility and canonical
decompositions, so that is the size recommended. This is provided
as G_UNICHAR_MAX_DECOMPOSITION_LENGTH
.
See UAX15 for details.
ch |
a Unicode character. |
|
compat |
whether perform canonical or compatibility decomposition |
|
result |
location to store decomposed result, or |
[optional][out caller-allocates] |
result_len |
length of |
Since: 2.30
GUnicodeBreakType
g_unichar_break_type (gunichar c
);
Determines the break type of c
. c
should be a Unicode character
(to derive a character from UTF-8 encoded text, use
g_utf8_get_char()
). The break type is used to find word and line
breaks ("text boundaries"), Pango implements the Unicode boundary
resolution algorithms and normally you would use a function such
as pango_break()
instead of caring about break types yourself.
gint
g_unichar_combining_class (gunichar uc
);
Determines the canonical combining class of a Unicode character.
Since: 2.14
void g_unicode_canonical_ordering (gunichar *string
,gsize len
);
Computes the canonical ordering of a string in-place. This rearranges decomposed characters in the string according to their combining classes. See the Unicode manual for more information.
gunichar * g_unicode_canonical_decomposition (gunichar ch
,gsize *result_len
);
g_unicode_canonical_decomposition
has been deprecated since version 2.30 and should not be used in newly-written code.
Use the more flexible g_unichar_fully_decompose()
instead.
Computes the canonical decomposition of a Unicode character.
gboolean g_unichar_get_mirror_char (gunichar ch
,gunichar *mirrored_ch
);
In Unicode, some characters are "mirrored". This means that their images are mirrored horizontally in text that is laid out from right to left. For instance, "(" would become its mirror image, ")", in right-to-left text.
If ch
has the Unicode mirrored property and there is another unicode
character that typically has a glyph that is the mirror image of ch
's
glyph and mirrored_ch
is set, it puts that character in the address
pointed to by mirrored_ch
. Otherwise the original character is put.
Since: 2.4
GUnicodeScript
g_unichar_get_script (gunichar ch
);
Looks up the GUnicodeScript for a particular character (as defined
by Unicode Standard Annex #24). No check is made for ch
being a
valid Unicode character; if you pass in invalid character, the
result is undefined.
This function is equivalent to pango_script_for_unichar()
and the
two are interchangeable.
Since: 2.14
GUnicodeScript
g_unicode_script_from_iso15924 (guint32 iso15924
);
Looks up the Unicode script for iso15924
. ISO 15924 assigns four-letter
codes to scripts. For example, the code for Arabic is 'Arab'.
This function accepts four letter codes encoded as a guint32
in a
big-endian fashion. That is, the code expected for Arabic is
0x41726162 (0x41 is ASCII code for 'A', 0x72 is ASCII code for 'r', etc).
See Codes for the representation of names of scripts for details.
the Unicode script for iso15924
, or
of G_UNICODE_SCRIPT_INVALID_CODE
if iso15924
is zero and
G_UNICODE_SCRIPT_UNKNOWN
if iso15924
is unknown.
Since: 2.30
guint32
g_unicode_script_to_iso15924 (GUnicodeScript script
);
Looks up the ISO 15924 code for script
. ISO 15924 assigns four-letter
codes to scripts. For example, the code for Arabic is 'Arab'. The
four letter codes are encoded as a guint32
by this function in a
big-endian fashion. That is, the code returned for Arabic is
0x41726162 (0x41 is ASCII code for 'A', 0x72 is ASCII code for 'r', etc).
See Codes for the representation of names of scripts for details.
the ISO 15924 code for script
, encoded as an integer,
of zero if script
is G_UNICODE_SCRIPT_INVALID_CODE
or
ISO 15924 code 'Zzzz' (script code for UNKNOWN) if script
is not understood.
Since: 2.30
#define g_utf8_next_char(p)
Skips to the next character in a UTF-8 string.
The string must be valid; this macro is as fast as possible, and has no error-checking.
You would use this macro to iterate over a string character by character.
The macro returns the start of the next UTF-8 character.
Before using this macro, use g_utf8_validate()
to validate strings
that may contain invalid UTF-8.
gunichar
g_utf8_get_char (const gchar *p
);
Converts a sequence of bytes encoded as UTF-8 to a Unicode character.
If p
does not point to a valid UTF-8 encoded character, results
are undefined. If you are not sure that the bytes are complete
valid Unicode characters, you should use g_utf8_get_char_validated()
instead.
gunichar g_utf8_get_char_validated (const gchar *p
,gssize max_len
);
Convert a sequence of bytes encoded as UTF-8 to a Unicode character. This function checks for incomplete characters, for invalid characters such as characters that are out of the range of Unicode, and for overlong encodings of valid characters.
Note that g_utf8_get_char_validated()
returns (gunichar)-2 if
max_len
is positive and any of the bytes in the first UTF-8 character
sequence are nul.
gchar * g_utf8_offset_to_pointer (const gchar *str
,glong offset
);
Converts from an integer character offset to a pointer to a position within the string.
Since 2.10, this function allows to pass a negative offset
to
step backwards. It is usually worth stepping backwards from the end
instead of forwards if offset
is in the last fourth of the string,
since moving forward is about 3 times faster than moving backward.
Note that this function doesn't abort when reaching the end of str
.
Therefore you should be sure that offset
is within string boundaries
before calling that function. Call g_utf8_strlen()
when unsure.
This limitation exists as this function is called frequently during
text rendering and therefore has to be as fast as possible.
glong g_utf8_pointer_to_offset (const gchar *str
,const gchar *pos
);
Converts from a pointer to position within a string to an integer character offset.
Since 2.10, this function allows pos
to be before str
, and returns
a negative offset in this case.
gchar *
g_utf8_prev_char (const gchar *p
);
Finds the previous UTF-8 character in the string before p
.
p
does not have to be at the beginning of a UTF-8 character. No check
is made to see if the character found is actually valid other than
it starts with an appropriate byte. If p
might be the first
character of the string, you must use g_utf8_find_prev_char()
instead.
gchar * g_utf8_find_next_char (const gchar *p
,const gchar *end
);
Finds the start of the next UTF-8 character in the string after p
.
p
does not have to be at the beginning of a UTF-8 character. No check
is made to see if the character found is actually valid other than
it starts with an appropriate byte.
If end
is NULL
, the return value will never be NULL
: if the end of the
string is reached, a pointer to the terminating nul byte is returned. If
end
is non-NULL
, the return value will be NULL
if the end of the string
is reached.
p |
a pointer to a position within a UTF-8 encoded string |
|
end |
a pointer to the byte following the end of the string,
or |
[nullable] |
a pointer to the found character or NULL
if end
is
set and is reached.
[transfer none][nullable]
gchar * g_utf8_find_prev_char (const gchar *str
,const gchar *p
);
Given a position p
with a UTF-8 encoded string str
, find the start
of the previous UTF-8 character starting before p
. Returns NULL
if no
UTF-8 characters are present in str
before p
.
p
does not have to be at the beginning of a UTF-8 character. No check
is made to see if the character found is actually valid other than
it starts with an appropriate byte.
glong g_utf8_strlen (const gchar *p
,gssize max
);
Computes the length of the string in characters, not including
the terminating nul character. If the max
'th byte falls in the
middle of a character, the last (partial) character is not counted.
p |
pointer to the start of a UTF-8 encoded string |
|
max |
the maximum number of bytes to examine. If |
gchar * g_utf8_strncpy (gchar *dest
,const gchar *src
,gsize n
);
Like the standard C strncpy()
function, but copies a given number
of characters instead of a given number of bytes. The src
string
must be valid UTF-8 encoded text. (Use g_utf8_validate()
on all
text before trying to use UTF-8 utility functions with it.)
Note you must ensure dest
is at least 4 * n
to fit the
largest possible UTF-8 characters
gchar * g_utf8_strchr (const gchar *p
,gssize len
,gunichar c
);
Finds the leftmost occurrence of the given Unicode character
in a UTF-8 encoded string, while limiting the search to len
bytes.
If len
is -1, allow unbounded search.
p |
a nul-terminated UTF-8 encoded string |
|
len |
the maximum length of |
|
c |
a Unicode character |
NULL
if the string does not contain the character,
otherwise, a pointer to the start of the leftmost occurrence
of the character in the string.
[transfer none][nullable]
gchar * g_utf8_strrchr (const gchar *p
,gssize len
,gunichar c
);
Find the rightmost occurrence of the given Unicode character
in a UTF-8 encoded string, while limiting the search to len
bytes.
If len
is -1, allow unbounded search.
p |
a nul-terminated UTF-8 encoded string |
|
len |
the maximum length of |
|
c |
a Unicode character |
NULL
if the string does not contain the character,
otherwise, a pointer to the start of the rightmost occurrence
of the character in the string.
[transfer none][nullable]
gchar * g_utf8_strreverse (const gchar *str
,gssize len
);
Reverses a UTF-8 string. str
must be valid UTF-8 encoded text.
(Use g_utf8_validate()
on all text before trying to use UTF-8
utility functions with it.)
This function is intended for programmatic uses of reversed strings. It pays no attention to decomposed characters, combining marks, byte order marks, directional indicators (LRM, LRO, etc) and similar characters which might need special handling when reversing a string for display purposes.
Note that unlike g_strreverse()
, this function returns
newly-allocated memory, which should be freed with g_free()
when
no longer needed.
str |
a UTF-8 encoded string |
|
len |
the maximum length of |
Since: 2.2
gchar * g_utf8_substring (const gchar *str
,glong start_pos
,glong end_pos
);
Copies a substring out of a UTF-8 encoded string.
The substring will contain end_pos
- start_pos
characters.
str |
a UTF-8 encoded string |
|
start_pos |
a character offset within |
|
end_pos |
another character offset within |
a newly allocated copy of the requested
substring. Free with g_free()
when no longer needed.
[transfer full]
Since: 2.30
gboolean g_utf8_validate (const gchar *str
,gssize max_len
,const gchar **end
);
Validates UTF-8 encoded text. str
is the text to validate;
if str
is nul-terminated, then max_len
can be -1, otherwise
max_len
should be the number of bytes to validate.
If end
is non-NULL
, then the end of the valid range
will be stored there (i.e. the start of the first invalid
character if some bytes were invalid, or the end of the text
being validated otherwise).
Note that g_utf8_validate()
returns FALSE
if max_len
is
positive and any of the max_len
bytes are nul.
Returns TRUE
if all of str
was valid. Many GLib and GTK+
routines require valid UTF-8 as input; so data read from a file
or the network should be checked with g_utf8_validate()
before
doing anything else with it.
gboolean g_utf8_validate_len (const gchar *str
,gsize max_len
,const gchar **end
);
Validates UTF-8 encoded text.
As with g_utf8_validate()
, but max_len
must be set, and hence this function
will always return FALSE
if any of the bytes of str
are nul.
str |
a pointer to character data. |
[array length=max_len][element-type guint8] |
max_len |
max bytes to validate |
|
end |
return location for end of valid data. |
[out][optional][transfer none] |
Since: 2.60
gchar * g_utf8_make_valid (const gchar *str
,gssize len
);
If the provided string is valid UTF-8, return a copy of it. If not, return a copy in which bytes that could not be interpreted as valid Unicode are replaced with the Unicode replacement character (U+FFFD).
For example, this is an appropriate function to use if you have received a string that was incorrectly declared to be UTF-8, and you need a valid UTF-8 version of it that can be logged or displayed to the user, with the assumption that it is close enough to ASCII or UTF-8 to be mostly readable as-is.
str |
string to coerce into UTF-8 |
|
len |
the maximum length of |
Since: 2.52
gchar * g_utf8_strup (const gchar *str
,gssize len
);
Converts all Unicode characters in the string that have a case to uppercase. The exact manner that this is done depends on the current locale, and may result in the number of characters in the string increasing. (For instance, the German ess-zet will be changed to SS.)
gchar * g_utf8_strdown (const gchar *str
,gssize len
);
Converts all Unicode characters in the string that have a case to lowercase. The exact manner that this is done depends on the current locale, and may result in the number of characters in the string changing.
gchar * g_utf8_casefold (const gchar *str
,gssize len
);
Converts a string into a form that is independent of case. The
result will not correspond to any particular case, but can be
compared for equality or ordered with the results of calling
g_utf8_casefold()
on other strings.
Note that calling g_utf8_casefold()
followed by g_utf8_collate()
is
only an approximation to the correct linguistic case insensitive
ordering, though it is a fairly good one. Getting this exactly
right would require a more sophisticated collation function that
takes case sensitivity into account. GLib does not currently
provide such a function.
gchar * g_utf8_normalize (const gchar *str
,gssize len
,GNormalizeMode mode
);
Converts a string into canonical form, standardizing
such issues as whether a character with an accent
is represented as a base character and combining
accent or as a single precomposed character. The
string has to be valid UTF-8, otherwise NULL
is
returned. You should generally call g_utf8_normalize()
before comparing two Unicode strings.
The normalization mode G_NORMALIZE_DEFAULT
only
standardizes differences that do not affect the
text content, such as the above-mentioned accent
representation. G_NORMALIZE_ALL
also standardizes
the "compatibility" characters in Unicode, such
as SUPERSCRIPT THREE to the standard forms
(in this case DIGIT THREE). Formatting information
may be lost but for most text operations such
characters should be considered the same.
G_NORMALIZE_DEFAULT_COMPOSE
and G_NORMALIZE_ALL_COMPOSE
are like G_NORMALIZE_DEFAULT
and G_NORMALIZE_ALL
,
but returned a result with composed forms rather
than a maximally decomposed form. This is often
useful if you intend to convert the string to
a legacy encoding or pass it to a system with
less capable Unicode handling.
str |
a UTF-8 encoded string. |
|
len |
length of |
|
mode |
the type of normalization to perform. |
a newly allocated string, that
is the normalized form of str
, or NULL
if str
is not valid UTF-8.
[nullable]
gint g_utf8_collate (const gchar *str1
,const gchar *str2
);
Compares two strings for ordering using the linguistically
correct rules for the current locale.
When sorting a large number of strings, it will be significantly
faster to obtain collation keys with g_utf8_collate_key()
and
compare the keys with strcmp()
when sorting instead of sorting
the original strings.
gchar * g_utf8_collate_key (const gchar *str
,gssize len
);
Converts a string into a collation key that can be compared
with other collation keys produced by the same function using
strcmp()
.
The results of comparing the collation keys of two strings
with strcmp()
will always be the same as comparing the two
original keys with g_utf8_collate()
.
Note that this function depends on the current locale.
str |
a UTF-8 encoded string. |
|
len |
length of |
a newly allocated string. This string should
be freed with g_free()
when you are done with it.
gchar * g_utf8_collate_key_for_filename (const gchar *str
,gssize len
);
Converts a string into a collation key that can be compared
with other collation keys produced by the same function using strcmp()
.
In order to sort filenames correctly, this function treats the dot '.' as a special case. Most dictionary orderings seem to consider it insignificant, thus producing the ordering "event.c" "eventgenerator.c" "event.h" instead of "event.c" "event.h" "eventgenerator.c". Also, we would like to treat numbers intelligently so that "file1" "file10" "file5" is sorted as "file1" "file5" "file10".
Note that this function depends on the current locale.
str |
a UTF-8 encoded string. |
|
len |
length of |
a newly allocated string. This string should
be freed with g_free()
when you are done with it.
Since: 2.8
gunichar2 * g_utf8_to_utf16 (const gchar *str
,glong len
,glong *items_read
,glong *items_written
,GError **error
);
Convert a string from UTF-8 to UTF-16. A 0 character will be added to the result after the converted text.
str |
a UTF-8 encoded string |
|
len |
the maximum length (number of bytes) of |
|
items_read |
location to store number of
bytes read, or |
[out][optional] |
items_written |
location to store number
of gunichar2 written, or |
[out][optional] |
error |
location to store the error occurring, or |
gunichar * g_utf8_to_ucs4 (const gchar *str
,glong len
,glong *items_read
,glong *items_written
,GError **error
);
Convert a string from UTF-8 to a 32-bit fixed width representation as UCS-4. A trailing 0 character will be added to the string after the converted text.
str |
a UTF-8 encoded string |
|
len |
the maximum length of |
|
items_read |
location to store number of
bytes read, or |
[out][optional] |
items_written |
location to store number
of characters written or |
[out][optional] |
error |
location to store the error occurring, or |
gunichar * g_utf8_to_ucs4_fast (const gchar *str
,glong len
,glong *items_written
);
Convert a string from UTF-8 to a 32-bit fixed width
representation as UCS-4, assuming valid UTF-8 input.
This function is roughly twice as fast as g_utf8_to_ucs4()
but does no error checking on the input. A trailing 0 character
will be added to the string after the converted text.
str |
a UTF-8 encoded string |
|
len |
the maximum length of |
|
items_written |
location to store the
number of characters in the result, or |
[out][optional] |
a pointer to a newly allocated UCS-4 string.
This value must be freed with g_free()
.
[transfer full]
gunichar * g_utf16_to_ucs4 (const gunichar2 *str
,glong len
,glong *items_read
,glong *items_written
,GError **error
);
Convert a string from UTF-16 to UCS-4. The result will be nul-terminated.
str |
a UTF-16 encoded string |
|
len |
the maximum length (number of gunichar2) of |
|
items_read |
location to store number of
words read, or |
[out][optional] |
items_written |
location to store number
of characters written, or |
[out][optional] |
error |
location to store the error occurring, or |
gchar * g_utf16_to_utf8 (const gunichar2 *str
,glong len
,glong *items_read
,glong *items_written
,GError **error
);
Convert a string from UTF-16 to UTF-8. The result will be terminated with a 0 byte.
Note that the input is expected to be already in native endianness,
an initial byte-order-mark character is not handled specially.
g_convert()
can be used to convert a byte buffer of UTF-16 data of
ambiguous endianness.
Further note that this function does not validate the result string; it may e.g. include embedded NUL characters. The only validation done by this function is to ensure that the input can be correctly interpreted as UTF-16, i.e. it doesn't contain unpaired surrogates or partial character sequences.
str |
a UTF-16 encoded string |
|
len |
the maximum length (number of gunichar2) of |
|
items_read |
location to store number of
words read, or |
[out][optional] |
items_written |
location to store number
of bytes written, or |
[out][optional] |
error |
location to store the error occurring, or |
gunichar2 * g_ucs4_to_utf16 (const gunichar *str
,glong len
,glong *items_read
,glong *items_written
,GError **error
);
Convert a string from UCS-4 to UTF-16. A 0 character will be added to the result after the converted text.
str |
a UCS-4 encoded string |
|
len |
the maximum length (number of characters) of |
|
items_read |
location to store number of
bytes read, or |
[out][optional] |
items_written |
location to store number
of gunichar2 written, or |
[out][optional] |
error |
location to store the error occurring, or |
gchar * g_ucs4_to_utf8 (const gunichar *str
,glong len
,glong *items_read
,glong *items_written
,GError **error
);
Convert a string from a 32-bit fixed width representation as UCS-4. to UTF-8. The result will be terminated with a 0 byte.
str |
a UCS-4 encoded string |
|
len |
the maximum length (number of characters) of |
|
items_read |
location to store number of
characters read, or |
[out][optional] |
items_written |
location to store number
of bytes written or |
[out][optional] |
error |
location to store the error occurring, or |
gint g_unichar_to_utf8 (gunichar c
,gchar *outbuf
);
Converts a single character to UTF-8.
c |
a Unicode character code |
|
outbuf |
output buffer, must have at
least 6 bytes of space. If |
[out caller-allocates][optional] |
typedef guint32 gunichar;
A type which can hold any UTF-32 or UCS-4 character code, also known as a Unicode code point.
If you want to produce the UTF-8 representation of a gunichar,
use g_ucs4_to_utf8()
. See also g_utf8_to_ucs4()
for the reverse
process.
To print/scan values of this type as integer, use
G_GINT32_MODIFIER
and/or G_GUINT32_FORMAT
.
The notation to express a Unicode code point in running text is as a hexadecimal number with four to six digits and uppercase letters, prefixed by the string "U+". Leading zeros are omitted, unless the code point would have fewer than four hexadecimal digits. For example, "U+0041 LATIN CAPITAL LETTER A". To print a code point in the U+-notation, use the format string "U+%04"G_GINT32_FORMAT"X". To scan, use the format string "U+%06"G_GINT32_FORMAT"X".
1 2 3 |
gunichar c; sscanf ("U+0041", "U+%06"G_GINT32_FORMAT"X", &c) g_print ("Read U+%04"G_GINT32_FORMAT"X", c); |
typedef guint16 gunichar2;
A type which can hold any UTF-16 code point<footnote id="utf16_surrogate_pairs">UTF-16 also has so called <firstterm>surrogate pairs</firstterm> to encode characters beyond the BMP as pairs of 16bit numbers. Surrogate pairs cannot be stored in a single gunichar2 field, but all GLib functions accepting gunichar2 arrays will correctly interpret surrogate pairs.</footnote>.
To print/scan values of this type to/from text you need to convert
to/from UTF-8, using g_utf16_to_utf8()
/g_utf8_to_utf16()
.
To print/scan values of this type as integer, use
G_GINT16_MODIFIER
and/or G_GUINT16_FORMAT
.
#define G_UNICHAR_MAX_DECOMPOSITION_LENGTH 18 /* codepoints */
The maximum length (in codepoints) of a compatibility or canonical decomposition of a single Unicode character.
This is as defined by Unicode 6.1.
Since: 2.32
These are the possible character classifications from the Unicode specification. See Unicode Character Database.
General category "Other, Control" (Cc) |
||
General category "Other, Format" (Cf) |
||
General category "Other, Not Assigned" (Cn) |
||
General category "Other, Private Use" (Co) |
||
General category "Other, Surrogate" (Cs) |
||
General category "Letter, Lowercase" (Ll) |
||
General category "Letter, Modifier" (Lm) |
||
General category "Letter, Other" (Lo) |
||
General category "Letter, Titlecase" (Lt) |
||
General category "Letter, Uppercase" (Lu) |
||
General category "Mark, Spacing" (Mc) |
||
General category "Mark, Enclosing" (Me) |
||
General category "Mark, Nonspacing" (Mn) |
||
General category "Number, Decimal Digit" (Nd) |
||
General category "Number, Letter" (Nl) |
||
General category "Number, Other" (No) |
||
General category "Punctuation, Connector" (Pc) |
||
General category "Punctuation, Dash" (Pd) |
||
General category "Punctuation, Close" (Pe) |
||
General category "Punctuation, Final quote" (Pf) |
||
General category "Punctuation, Initial quote" (Pi) |
||
General category "Punctuation, Other" (Po) |
||
General category "Punctuation, Open" (Ps) |
||
General category "Symbol, Currency" (Sc) |
||
General category "Symbol, Modifier" (Sk) |
||
General category "Symbol, Math" (Sm) |
||
General category "Symbol, Other" (So) |
||
General category "Separator, Line" (Zl) |
||
General category "Separator, Paragraph" (Zp) |
||
General category "Separator, Space" (Zs) |
#define G_UNICODE_COMBINING_MARK G_UNICODE_SPACING_MARK GLIB_DEPRECATED_MACRO_IN_2_30_FOR(G_UNICODE_SPACING_MARK)
G_UNICODE_COMBINING_MARK
has been deprecated since version 2.30 and should not be used in newly-written code.
Older name for G_UNICODE_SPACING_MARK
.
These are the possible line break classifications.
Since new unicode versions may add new types here, applications should be ready
to handle unknown values. They may be regarded as G_UNICODE_BREAK_UNKNOWN
.
See Unicode Line Breaking Algorithm.
Mandatory Break (BK) |
||
Carriage Return (CR) |
||
Line Feed (LF) |
||
Attached Characters and Combining Marks (CM) |
||
Surrogates (SG) |
||
Zero Width Space (ZW) |
||
Inseparable (IN) |
||
Non-breaking ("Glue") (GL) |
||
Contingent Break Opportunity (CB) |
||
Space (SP) |
||
Break Opportunity After (BA) |
||
Break Opportunity Before (BB) |
||
Break Opportunity Before and After (B2) |
||
Hyphen (HY) |
||
Nonstarter (NS) |
||
Opening Punctuation (OP) |
||
Closing Punctuation (CL) |
||
Ambiguous Quotation (QU) |
||
Exclamation/Interrogation (EX) |
||
Ideographic (ID) |
||
Numeric (NU) |
||
Infix Separator (Numeric) (IS) |
||
Symbols Allowing Break After (SY) |
||
Ordinary Alphabetic and Symbol Characters (AL) |
||
Prefix (Numeric) (PR) |
||
Postfix (Numeric) (PO) |
||
Complex Content Dependent (South East Asian) (SA) |
||
Ambiguous (Alphabetic or Ideographic) (AI) |
||
Unknown (XX) |
||
Next Line (NL) |
||
Word Joiner (WJ) |
||
Hangul L Jamo (JL) |
||
Hangul V Jamo (JV) |
||
Hangul T Jamo (JT) |
||
Hangul LV Syllable (H2) |
||
Hangul LVT Syllable (H3) |
||
Closing Parenthesis (CP). Since 2.28. Deprecated: 2.70: Use |
||
Closing Parenthesis (CP). Since 2.70 |
||
Conditional Japanese Starter (CJ). Since: 2.32 |
||
Hebrew Letter (HL). Since: 2.32 |
||
Regional Indicator (RI). Since: 2.36 |
||
Emoji Base (EB). Since: 2.50 |
||
Emoji Modifier (EM). Since: 2.50 |
||
Zero Width Joiner (ZWJ). Since: 2.50 |
The GUnicodeScript enumeration identifies different writing systems. The values correspond to the names as defined in the Unicode standard. The enumeration has been added in GLib 2.14, and is interchangeable with PangoScript.
Note that new types may be added in the future. Applications should be ready to handle unknown values. See Unicode Standard Annex 24: Script names.
a value never returned from |
||
a character used by multiple different scripts |
||
a mark glyph that takes its script from the base glyph to which it is attached |
||
Arabic |
||
Armenian |
||
Bengali |
||
Bopomofo |
||
Cherokee |
||
Coptic |
||
Cyrillic |
||
Deseret |
||
Devanagari |
||
Ethiopic |
||
Georgian |
||
Gothic |
||
Greek |
||
Gujarati |
||
Gurmukhi |
||
Han |
||
Hangul |
||
Hebrew |
||
Hiragana |
||
Kannada |
||
Katakana |
||
Khmer |
||
Lao |
||
Latin |
||
Malayalam |
||
Mongolian |
||
Myanmar |
||
Ogham |
||
Old Italic |
||
Oriya |
||
Runic |
||
Sinhala |
||
Syriac |
||
Tamil |
||
Telugu |
||
Thaana |
||
Thai |
||
Tibetan |
||
Canadian Aboriginal |
||
Yi |
||
Tagalog |
||
Hanunoo |
||
Buhid |
||
Tagbanwa |
||
Braille |
||
Cypriot |
||
Limbu |
||
Osmanya |
||
Shavian |
||
Linear B |
||
Tai Le |
||
Ugaritic |
||
New Tai Lue |
||
Buginese |
||
Glagolitic |
||
Tifinagh |
||
Syloti Nagri |
||
Old Persian |
||
Kharoshthi |
||
an unassigned code point |
||
Balinese |
||
Cuneiform |
||
Phoenician |
||
Phags-pa |
||
N'Ko |
||
Kayah Li. Since 2.16.3 |
||
Lepcha. Since 2.16.3 |
||
Rejang. Since 2.16.3 |
||
Sundanese. Since 2.16.3 |
||
Saurashtra. Since 2.16.3 |
||
Cham. Since 2.16.3 |
||
Ol Chiki. Since 2.16.3 |
||
Vai. Since 2.16.3 |
||
Carian. Since 2.16.3 |
||
Lycian. Since 2.16.3 |
||
Lydian. Since 2.16.3 |
||
Avestan. Since 2.26 |
||
Bamum. Since 2.26 |
||
Egyptian Hieroglpyhs. Since 2.26 |
||
Imperial Aramaic. Since 2.26 |
||
Inscriptional Pahlavi. Since 2.26 |
||
Inscriptional Parthian. Since 2.26 |
||
Javanese. Since 2.26 |
||
Kaithi. Since 2.26 |
||
Lisu. Since 2.26 |
||
Meetei Mayek. Since 2.26 |
||
Old South Arabian. Since 2.26 |
||
Old Turkic. Since 2.28 |
||
Samaritan. Since 2.26 |
||
Tai Tham. Since 2.26 |
||
Tai Viet. Since 2.26 |
||
Batak. Since 2.28 |
||
Brahmi. Since 2.28 |
||
Mandaic. Since 2.28 |
||
Chakma. Since: 2.32 |
||
Meroitic Cursive. Since: 2.32 |
||
Meroitic Hieroglyphs. Since: 2.32 |
||
Miao. Since: 2.32 |
||
Sharada. Since: 2.32 |
||
Sora Sompeng. Since: 2.32 |
||
Takri. Since: 2.32 |
||
Bassa. Since: 2.42 |
||
Caucasian Albanian. Since: 2.42 |
||
Duployan. Since: 2.42 |
||
Elbasan. Since: 2.42 |
||
Grantha. Since: 2.42 |
||
Kjohki. Since: 2.42 |
||
Khudawadi, Sindhi. Since: 2.42 |
||
Linear A. Since: 2.42 |
||
Mahajani. Since: 2.42 |
||
Manichaean. Since: 2.42 |
||
Mende Kikakui. Since: 2.42 |
||
Modi. Since: 2.42 |
||
Mro. Since: 2.42 |
||
Nabataean. Since: 2.42 |
||
Old North Arabian. Since: 2.42 |
||
Old Permic. Since: 2.42 |
||
Pahawh Hmong. Since: 2.42 |
||
Palmyrene. Since: 2.42 |
||
Pau Cin Hau. Since: 2.42 |
||
Psalter Pahlavi. Since: 2.42 |
||
Siddham. Since: 2.42 |
||
Tirhuta. Since: 2.42 |
||
Warang Citi. Since: 2.42 |
||
Ahom. Since: 2.48 |
||
Anatolian Hieroglyphs. Since: 2.48 |
||
Hatran. Since: 2.48 |
||
Multani. Since: 2.48 |
||
Old Hungarian. Since: 2.48 |
||
Signwriting. Since: 2.48 |
||
Adlam. Since: 2.50 |
||
Bhaiksuki. Since: 2.50 |
||
Marchen. Since: 2.50 |
||
Newa. Since: 2.50 |
||
Osage. Since: 2.50 |
||
Tangut. Since: 2.50 |
||
Masaram Gondi. Since: 2.54 |
||
Nushu. Since: 2.54 |
||
Soyombo. Since: 2.54 |
||
Zanabazar Square. Since: 2.54 |
||
Dogra. Since: 2.58 |
||
Gunjala Gondi. Since: 2.58 |
||
Hanifi Rohingya. Since: 2.58 |
||
Makasar. Since: 2.58 |
||
Medefaidrin. Since: 2.58 |
||
Old Sogdian. Since: 2.58 |
||
Sogdian. Since: 2.58 |
||
Elym. Since: 2.62 |
||
Nand. Since: 2.62 |
||
Rohg. Since: 2.62 |
||
Wcho. Since: 2.62 |
||
Chorasmian. Since: 2.66 |
||
Dives Akuru. Since: 2.66 |
||
Khitan small script. Since: 2.66 |
||
Yezidi. Since: 2.66 |
Defines how a Unicode string is transformed in a canonical form, standardizing such issues as whether a character with an accent is represented as a base character and combining accent or as a single precomposed character. Unicode strings should generally be normalized before comparing them.
standardize differences that do not affect the text content, such as the above-mentioned accent representation |
||
another name for |
||
like |
||
another name for |
||
beyond |
||
another name for |
||
like |
||
another name for |