General Programming |
Author |
Message |
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
Posted: Wed Aug 02, 2006 4:07 pm Post subject: Unicode |
|
|
At C Board: Unicode + Name Resolution
"So I guess gethostbyname is deprecated, and I am trying to be all
polite, and trying to make a unicode compatible application with the new
name resolution technique of getaddrinfo + getnameinfo. I end up with a
dotted IP address representing the host, as a Unicode string."! |
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
Posted: Sat Sep 16, 2006 11:01 am Post subject: |
|
|
Unicode discussions at JoS:
1. C++ ISO8859-1 to UTF-8
"What is the best way of converting the encoding of a std::string from
ISO8859-1 to UTF-8 ?
I'm working internally with ISO8859-1, but would like to call the dll from an
Oracle extproc process, where Oracle uses UTF-8."
2. Lost in Unicode
"It is all so confusing: wchar_t, char, UTF, multibyte, ANSI, etc.. etc.." |
|
Back to top |
|
|
Vic Guest
|
Posted: Tue Oct 03, 2006 6:25 pm Post subject: |
|
|
Catching up with Unicode 5.0
"Unicode 5.0 was released a week ago: congratulations to all concerned.
Unicode now has about 99,000 characters defined, though many of the
improvements in Unicode 5.0 are related to how to use characters (their
properties or display algorithms) rather than additions. There are only 1369
new characters compared to Unicode 4.1; and no milestone for
implementations such as Unicode 3.1 in 2001 when the number of
characters broke the 16-bit range." |
|
Back to top |
|
|
Mao Guest
|
Posted: Tue Oct 10, 2006 12:38 pm Post subject: |
|
|
JoS: why my software show chinese charactor as "????"
"Here's the rule of thumb:
1. question marks (sometimes upside down question marks): conversion
into a character set, even an intermediate character set, that does not
support the characters.
2. little squares (or other special symbol): properly encoded, but the font
does not support the characters.
3. mixture of garble and little squares (moji bake): a conversion was done
assuming the wrong character set." |
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
Posted: Fri Jan 05, 2007 6:03 pm Post subject: |
|
|
Raymond Chen: What('s) a character!
"All documentation that previously used byte to describe the size of
textual data had to be changed to read "the size of the buffer in bytes if
calling the ANSI version of the function or in WCHARs if calling the Unicode
version of the function." A few years ago the Platform SDK team accepted
my suggestion to adopt the less cumbersome "the size of the buffer in
TCHARs." Newer documentation from the core topics of the Platform SDK
tends to use this alternate formulation." |
|
Back to top |
|
|
Ike Kapetan
Joined: 17 Jun 2006 Posts: 3136 Location: Europe
|
Posted: Tue Apr 17, 2007 9:58 pm Post subject: |
|
|
RC: The Notepad file encoding problem, redux
"If a BOM is found, then life is easy, since the BOM tells you what encoding
the file uses. The problem is when there is no BOM. Now you have to guess,
and when you guess, you can guess wrong." |
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
Posted: Fri May 04, 2007 9:40 pm Post subject: |
|
|
cboard: Wide character (unicode) and multi-byte character
"I am more confused when I saw sometimes we need codepage parameter
for wide character conversion, and sometimes we do not need for conversion.
Here are two examples, code page is used in WideCharToMultiByte() when
dealing with unciode character ... code page is not used in wcstombs() when
dealing with unciode character." |
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
|
Back to top |
|
|
3Plex
Joined: 18 May 2007 Posts: 5
|
Posted: Tue May 29, 2007 4:24 pm Post subject: |
|
|
Safe from the Losing Fight - wchar_t: Unsafe at any size
At the time that this was happening, I happened to work for Macromedia
(now Adobe). Being the most important company that implements Flash,
some of the Apple execs came down and talked to the Mac engineers at
Macromedia. When the appropriate time came, I sprang into action
demanding to know what would be done about wchar_t.
There was stunned silence.
“What’s wchar_t?” was the first answer. After explaining it, the next answer
was “We don’t implement that.” After pointing them to their own documentation,
the next answer was “Oh. Huh. Well, why did you use it? We don’t use
that crap. Use CFString instead!” After slamming my head against the table,
I attempted to explain wchar_t was used everywhere in our codebase,
and CFString wasn’t cross platform. “Sure it is! It works on both Mac OS 9
and Mac OS X!” |
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
Posted: Sun Jun 10, 2007 10:01 pm Post subject: |
|
|
Carbon List: How to use Unicode with Carbon
"I would like to created a carbon application and use UTF-8 encoded strings
for it's controls (if necessary I can also first convert to UTF-16 and then send
the string to Carbon." |
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
Posted: Wed Jul 04, 2007 12:30 am Post subject: |
|
|
RC: If the system says that an embedded string could not be converted
from Unicode to ANSI, maybe it's trying to tell you something
"No matter what ANSI code page you pick, there will be Unicode characters
that cannot be expressed in it. (And no, you can't set your ANSI code page
to UTF-8. Michael Kaplan discussed it last October, and before that, last July,
and before that, a week and a half previous (still July), and before that, two
years ago February. I think Michael might need to change the subtitle of his
blog to "Explaining why the ANSI code page can't be UTF-8 since 2005".)" |
|
Back to top |
|
|
XNote Kapetan
Joined: 16 Jun 2006 Posts: 532
|
Posted: Wed Aug 01, 2007 2:26 pm Post subject: |
|
|
JoS: Character encodings
"I am not sure I understand, how a web page that supports Japanese text
would be encoded in UTF-8, if the character set goes beyond 8 bits." |
|
Back to top |
|
|
Ike Kapetan
Joined: 17 Jun 2006 Posts: 3136 Location: Europe
|
Posted: Fri Sep 14, 2007 5:27 pm Post subject: |
|
|
RC: The code page on the server is not necessarily the code page on the client
"The correct solution is to use FormatMessageW followed by WideCharToMultiByte(x),
where x is the OEM code page of the client. You need to get this information from the
client to the server somehow so that the server knows what character set the client
is going to use for displaying strings." |
|
Back to top |
|
|
Ike Kapetan
Joined: 17 Jun 2006 Posts: 3136 Location: Europe
|
Posted: Sun Sep 16, 2007 7:49 pm Post subject: |
|
|
ubuntuforums - Writing "UTF-8" programs
"Just a tip: GTK is UTF ready, in case you are going to use it."
...
"Python 3.0 will have unicode variable names, and all strings are unicode,
coming next year. You can have unicode strings (explicit, default is ASCII)
and unicode in comments right now." |
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
Posted: Wed Oct 31, 2007 10:57 am Post subject: |
|
|
ibm developerworks - The Pango connection: Part 1
"Pango is an open-source framework for the layout and rendering of
internationalized text, including right-to-left scripts and scripts such as
Tamil where glyphs are context-sensitive. Not surprisingly, Pango uses
Unicode characters internally (represented using UTF-8), and Pango's
interfaces also use UTF-8. Other encodings can be supported by using a
translation library such as GNU iconv to convert the text to UTF-8 before
processing."
... and Part 2
Tony Graham is the author of Unicode: A Primer , the first and currently
only book about the Unicode Standard, Version 3.0, and its uses. |
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
Posted: Thu Nov 01, 2007 7:01 pm Post subject: |
|
|
Apple: Character Encodings and Their Internet Names
"Table A-1 lists character encodings for various languages, gives some
of their common Internet names, and identifies the version of the Text
Encoding Conversion Manager for which character encoding was first
supported for use by the Text Encoding Converter and the Unicode
Converter." |
|
Back to top |
|
|
XNote Kapetan
Joined: 16 Jun 2006 Posts: 532
|
Posted: Wed Nov 07, 2007 2:46 pm Post subject: |
|
|
JoS: Guessing text encoding
"I have an app that parses text log files. If the log file happens to have a
Byte Order Mark, that makes it easy to detect if the file is UTF-8, UTF-16,
UNICODE, etc. However, many log files don't have such a marking." |
|
Back to top |
|
|
XNote Kapetan
Joined: 16 Jun 2006 Posts: 532
|
Posted: Fri Feb 22, 2008 3:55 pm Post subject: |
|
|
JoS: Unicode standardization proposal
"UTF-XX encodings use variable-count-of-bytes-per-char approach. There is
an alternative UCS-XX encodings that support same-count-of-bytes-per-char,
useful for programming.
The same problem already existed with the standard one-byte-per-char ASCII
encoding, and the variable-count-of-bytes-per-char MBCS encoding...
UCS-4 ALWAYS uses 4 bytes per char, and as far as I know, it supports all
known commonly used languages, from English to Chinese. Many developers
currently use UCS-4 encoding internally in their applications, even if they have
to load and save files as their UTF-XX counterparts." |
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
Posted: Tue May 06, 2008 7:54 pm Post subject: |
|
|
Google: Moving to Unicode 5.1
"Web pages can use a variety of different character encodings, like ASCII,
Latin-1, or Windows 1252, or Unicode. Most encodings can only represent a
few languages, but Unicode will handle anything from Chinese to French to
Arabic." |
|
Back to top |
|
|
XNote Kapetan
Joined: 16 Jun 2006 Posts: 532
|
Posted: Tue Jun 03, 2008 2:34 pm Post subject: |
|
|
Canonical - Counting Characters in UTF-8 Strings Is Fast
"So then I thought about how to do what Aristotle was suggesting. In UTF-8,
bytes that start new characters begin either with binary 0 or binary 11; the
second and subsequent bytes of multibyte characters have binary 10 as their
high bits. So to count the characters, you just have to count the bytes that
don't begin with binary 10." |
|
Back to top |
|
|
Ike Kapetan
Joined: 17 Jun 2006 Posts: 3136 Location: Europe
|
Posted: Mon Jun 09, 2008 6:58 pm Post subject: |
|
|
List of Locale ID (LCID) Values as Assigned by Microsoft
"The following table lists the locales/languages with an assigned LCID. The
purpose of the document is to help developers who are defining NLS servi-
ces (sorting, time/date formatting, and keyboards/IMEs) for locales that do
not yet have native support in Windows to avoid conflict."
Language Codes: ISO 639, Microsoft and Macintosh
"Macintosh constants and codes are defined in enumerations in the Mac
header file Script.h and Windows constants and codes are defined in the
Platform SDK header file winnt.h. Note that many of the Microsoft codes
have no 'Windows Name' constant (these are marked "(no constant
defined)" and refer to codes that have been reserved for the languages
in question." |
|
Back to top |
|
|
Ike Kapetan
Joined: 17 Jun 2006 Posts: 3136 Location: Europe
|
Posted: Thu Jul 03, 2008 5:57 pm Post subject: |
|
|
The Truth About Unicode In Python
"In this post, I'm going to talk about a couple of the problems with unicode
in Python. Please note that this is not intended as a criticism of Python's
unicode support or the people who designed and implemented it. Most of
those people probably know a whole lot more about unicode than I do, and
the limitations discussed here are the result of a pragmatic approach to
implementing unicode support, rather than due to a lack of knowledge." |
|
Back to top |
|
|
Ike Kapetan
Joined: 17 Jun 2006 Posts: 3136 Location: Europe
|
Posted: Mon Sep 08, 2008 5:51 pm Post subject: |
|
|
uf - Unicode in linux
"In this case I have to get some unicode strings from the server and show
them in a GUI such as qt to the user !!! I guess I have to use wstring, Is it
enough or I have to use Ansi2Unicode and vice versa to parse it?" |
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
Posted: Tue Sep 16, 2008 10:54 pm Post subject: |
|
|
code-jam - Unicode
"Unicode is a standard way of character encoding which was designed to re-
place all old encodings S.A. ASCII using the Unicode standard transformation." |
|
Back to top |
|
|
Ike Kapetan
Joined: 17 Jun 2006 Posts: 3136 Location: Europe
|
Posted: Wed Oct 15, 2008 1:17 pm Post subject: |
|
|
CodeProject: The Complete Guide to C++ Strings, Part I - Win32
Character Encodings, by Michael Dunn
"You've undoubtedly seen all these various string types like TCHAR, std::string,
BSTR, and so on. And then there are those wacky macros starting with _tcs.
And you're staring at the screen thinking "wha?" Well stare no more, this guide
will outline the purpose of each string type, show some simple usages, and
describe how to convert to other string types when necessary." |
|
Back to top |
|
|
Ike Kapetan
Joined: 17 Jun 2006 Posts: 3136 Location: Europe
|
Posted: Mon Feb 23, 2009 1:56 pm Post subject: |
|
|
cboard - Printing Unicode to console
"It's supposed to print some Japanese characters, but all I see is ????. How
do I get the characters to display properly?" |
|
Back to top |
|
|
XNote Kapetan
Joined: 16 Jun 2006 Posts: 532
|
Posted: Mon Mar 02, 2009 2:16 pm Post subject: |
|
|
secondlife.com - Unicode In 5 Minutes
"Unicode is a standard for digital processing of written characters and text.
By enabling the exchange of text data internationally, it is a foundation for
global software." |
|
Back to top |
|
|
Ike Kapetan
Joined: 17 Jun 2006 Posts: 3136 Location: Europe
|
Posted: Fri Apr 24, 2009 2:51 pm Post subject: |
|
|
Bell Labs - The history of UTF-8 as told by Rob Pike
"UTF-8 was designed, in front of my eyes, on a placemat in a New Jersey
diner one night in September or so 1992." |
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
Posted: Sat Apr 17, 2010 12:38 pm Post subject: |
|
|
MF - How to replace ASCII with Unicode/UTF-8?
"How can you convert a program written in C (using ASCII) so that it can
handle Unicode strings?
I realize there is no simple answer to this question, so I'm posting the source
code from two short programs (from Dave Mark's book on C) that I want to
convert to handle Unicode." |
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
Posted: Sat Aug 20, 2011 2:43 am Post subject: |
|
|
se - Should UTF-16 be considered harmful?
"How many programmers are aware of the fact that UTF-16 is actually a
variable length encoding? By this I mean that there are code points that,
represented as surrogate pairs, take more than one element.
I know; lots of applications, frameworks and APIs use UTF-16, such as
Java's String, C#'s String, Win32 APIs, Qt GUI libraries, the ICU Unicode
library, etc. However, with all of that, there are lots of basic bugs in the
processing of characters out of BMP (characters that should be encoded
using two UTF-16 elements)." |
|
Back to top |
|
|
Ike Kapetan
Joined: 17 Jun 2006 Posts: 3136 Location: Europe
|
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
Posted: Mon Aug 26, 2013 7:55 pm Post subject: |
|
|
10kloc - Plain Text Doesn’t Exist… Unicode and encodings demystified
Byte Order Mark
If you wish to transfer documents between Little and Big Endian systems
in Unicode, UTF-8 and UTF-16 support a convention known as the Byte
Order Mark. Put simply, Byte Order Mark (BOM) denotes the Endianness
of the document, in the document itself. Encoding is marked by putting
2-bytes or 1 character in UTF-16, FE FF, at the start of every document,
and depending on the Endianness of the system, this will appear as either
FF FE or FE FF, giving reader immediate hint of the encoding. |
|
Back to top |
|
|
Ike Kapetan
Joined: 17 Jun 2006 Posts: 3136 Location: Europe
|
Posted: Thu Dec 07, 2017 1:02 am Post subject: |
|
|
https://github.com/google/unigem-objective-c
This repository contains Unicode Gems, a Mac app, an iOS app, and an iOS keyboard that makes it easy for you to use interesting typefaces in contexts that don't allow fonted text.
As an iOS app, you get an iPhone UI, an iPad UI, and iPad split view support. |
|
Back to top |
|
|
delovski
Joined: 14 Jun 2006 Posts: 3524 Location: Zagreb
|
|
Back to top |
|
|
Ike Kapetan
Joined: 17 Jun 2006 Posts: 3136 Location: Europe
|
Posted: Wed Jun 07, 2023 7:17 pm Post subject: |
|
|
r - cuneicode, and the Future of Text in C
"If you look up what 'ANSI_X3.4-1968' means, you'll find that it's the most
obnoxious and fancy way to spell a particularly old encoding. That is to say,
my default locale when I ask and use it in C or C++ -- on my brand new
Ubuntu 20.04 Focal LTS server, achieved from just pressing 'ok' to all the
setup options, installing build essentials, and then going out of my way to
get the most advanced Clang I can and combine it with the most up-to-date
glibc and libstdc++ I can -- is ASCII.
Not UTF-8. Not Latin-1!" |
|
Back to top |
|
|
Ike Kapetan
Joined: 17 Jun 2006 Posts: 3136 Location: Europe
|
Posted: Thu Aug 24, 2023 3:03 pm Post subject: |
|
|
Compatibility of printf with utf-8 encoded strings
"In order to get alignment, you want to count characters. Then, pass the bytes
count to printf. That can be achieved by using the * precision and passing the
count of bytes. For example, since accented e takes two bytes:
printf("'-4.*s'\n", 6, "éléphant");" |
|
Back to top |
|
|
|