Igor Delovski Board Forum Index Igor Delovski Board
My Own Personal Slashdot!
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Unicode

 
Post new topic   Reply to topic    Igor Delovski Board Forum Index -> General Programming
General Programming  
Author Message
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Wed Aug 02, 2006 4:07 pm    Post subject: Unicode Reply with quote

At C Board: Unicode + Name Resolution

"So I guess gethostbyname is deprecated, and I am trying to be all
polite, and trying to make a unicode compatible application with the new
name resolution technique of getaddrinfo + getnameinfo. I end up with a
dotted IP address representing the host, as a Unicode string."!
Back to top
View user's profile Send private message Visit poster's website
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Wed Aug 02, 2006 4:18 pm    Post subject: Reply with quote

Raymond Chen on Unicode and codepages:

1. On the fuzzy definition of a "Unicode application"
2. Keep your eye on the code page, practical exam
3. Why is the default console codepage called "OEM"?
4. Unicode collation is hard
5. TEXT vs. _TEXT vs. _T, and UNICODE vs. _UNICODE
6. Case mapping on Unicode is hard
7. Don't forget to #define UNICODE if you want Unicode

A reference to some other blog: Excellent blog about Windows and Unicode

"Michael Kaplan has probably forgotten more about Unicode than most people know."
Back to top
View user's profile Send private message Visit poster's website
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Sat Sep 16, 2006 11:01 am    Post subject: Reply with quote

Unicode discussions at JoS:

1. C++ ISO8859-1 to UTF-8

"What is the best way of converting the encoding of a std::string from
ISO8859-1 to UTF-8 ?

I'm working internally with ISO8859-1, but would like to call the dll from an
Oracle extproc process, where Oracle uses UTF-8."


2. Lost in Unicode

"It is all so confusing: wchar_t, char, UTF, multibyte, ANSI, etc.. etc.."
Back to top
View user's profile Send private message Visit poster's website
Vic
Guest





PostPosted: Tue Oct 03, 2006 6:25 pm    Post subject: Reply with quote

Catching up with Unicode 5.0

"Unicode 5.0 was released a week ago: congratulations to all concerned.
Unicode now has about 99,000 characters defined, though many of the
improvements in Unicode 5.0 are related to how to use characters (their
properties or display algorithms) rather than additions. There are only 1369
new characters compared to Unicode 4.1; and no milestone for
implementations such as Unicode 3.1 in 2001 when the number of
characters broke the 16-bit range."
Back to top
Mao
Guest





PostPosted: Tue Oct 10, 2006 12:38 pm    Post subject: Reply with quote

JoS: why my software show chinese charactor as "????"

"Here's the rule of thumb:

1. question marks (sometimes upside down question marks): conversion
into a character set, even an intermediate character set, that does not
support the characters.
2. little squares (or other special symbol): properly encoded, but the font
does not support the characters.
3. mixture of garble and little squares (moji bake): a conversion was done
assuming the wrong character set."
Back to top
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Fri Jan 05, 2007 6:03 pm    Post subject: Reply with quote

Raymond Chen: What('s) a character!

"All documentation that previously used byte to describe the size of
textual data had to be changed to read "the size of the buffer in bytes if
calling the ANSI version of the function or in WCHARs if calling the Unicode
version of the function." A few years ago the Platform SDK team accepted
my suggestion to adopt the less cumbersome "the size of the buffer in
TCHARs." Newer documentation from the core topics of the Platform SDK
tends to use this alternate formulation."
Back to top
View user's profile Send private message Visit poster's website
Ike
Kapetan


Joined: 17 Jun 2006
Posts: 3025
Location: Europe

PostPosted: Tue Apr 17, 2007 9:58 pm    Post subject: Reply with quote

RC: The Notepad file encoding problem, redux

"If a BOM is found, then life is easy, since the BOM tells you what encoding
the file uses. The problem is when there is no BOM. Now you have to guess,
and when you guess, you can guess wrong."
Back to top
View user's profile Send private message
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Fri May 04, 2007 9:40 pm    Post subject: Reply with quote

cboard: Wide character (unicode) and multi-byte character

"I am more confused when I saw sometimes we need codepage parameter
for wide character conversion, and sometimes we do not need for conversion.
Here are two examples, code page is used in WideCharToMultiByte() when
dealing with unciode character ... code page is not used in wcstombs() when
dealing with unciode character."
Back to top
View user's profile Send private message Visit poster's website
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Tue May 15, 2007 8:47 pm    Post subject: Reply with quote

cboard: How is REAL Unicode string included and displayed in a C program?

"The problem here is how one-byte ASCII characters and two-byte Unicodes
are mixxed in to a TEXT(...) string and the compiler can correctly identify them..."
Back to top
View user's profile Send private message Visit poster's website
3Plex



Joined: 18 May 2007
Posts: 5

PostPosted: Tue May 29, 2007 4:24 pm    Post subject: Reply with quote

Safe from the Losing Fight - wchar_t: Unsafe at any size

At the time that this was happening, I happened to work for Macromedia
(now Adobe). Being the most important company that implements Flash,
some of the Apple execs came down and talked to the Mac engineers at
Macromedia. When the appropriate time came, I sprang into action
demanding to know what would be done about wchar_t.

There was stunned silence.

“What’s wchar_t?” was the first answer. After explaining it, the next answer
was “We don’t implement that.” After pointing them to their own documentation,
the next answer was “Oh. Huh. Well, why did you use it? We don’t use
that crap. Use CFString instead!” After slamming my head against the table,
I attempted to explain wchar_t was used everywhere in our codebase,
and CFString wasn’t cross platform. “Sure it is! It works on both Mac OS 9
and Mac OS X!”
Back to top
View user's profile Send private message
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Sat Jun 09, 2007 11:24 pm    Post subject: Reply with quote

JoS: Would Unicode have been more popular?

"Often I wish they'd just made Unicode equal to UTF-8 from the beginning."
Back to top
View user's profile Send private message Visit poster's website
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Sun Jun 10, 2007 10:01 pm    Post subject: Reply with quote

Carbon List: How to use Unicode with Carbon

"I would like to created a carbon application and use UTF-8 encoded strings
for it's controls (if necessary I can also first convert to UTF-16 and then send
the string to Carbon."
Back to top
View user's profile Send private message Visit poster's website
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Wed Jul 04, 2007 12:30 am    Post subject: Reply with quote

RC: If the system says that an embedded string could not be converted
from Unicode to ANSI, maybe it's trying to tell you something


"No matter what ANSI code page you pick, there will be Unicode characters
that cannot be expressed in it. (And no, you can't set your ANSI code page
to UTF-8. Michael Kaplan discussed it last October, and before that, last July,
and before that, a week and a half previous (still July), and before that, two
years ago February. I think Michael might need to change the subtitle of his
blog to "Explaining why the ANSI code page can't be UTF-8 since 2005".)"
Back to top
View user's profile Send private message Visit poster's website
XNote
Kapetan


Joined: 16 Jun 2006
Posts: 532

PostPosted: Wed Aug 01, 2007 2:26 pm    Post subject: Reply with quote

JoS: Character encodings

"I am not sure I understand, how a web page that supports Japanese text
would be encoded in UTF-8, if the character set goes beyond 8 bits."
Back to top
View user's profile Send private message
Ike
Kapetan


Joined: 17 Jun 2006
Posts: 3025
Location: Europe

PostPosted: Fri Sep 14, 2007 5:27 pm    Post subject: Reply with quote

RC: The code page on the server is not necessarily the code page on the client

"The correct solution is to use FormatMessageW followed by WideCharToMultiByte(x),
where x is the OEM code page of the client. You need to get this information from the
client to the server somehow so that the server knows what character set the client
is going to use for displaying strings."
Back to top
View user's profile Send private message
Ike
Kapetan


Joined: 17 Jun 2006
Posts: 3025
Location: Europe

PostPosted: Sun Sep 16, 2007 7:49 pm    Post subject: Reply with quote

ubuntuforums - Writing "UTF-8" programs

"Just a tip: GTK is UTF ready, in case you are going to use it."
...
"Python 3.0 will have unicode variable names, and all strings are unicode,
coming next year. You can have unicode strings (explicit, default is ASCII)
and unicode in comments right now."
Back to top
View user's profile Send private message
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Wed Oct 31, 2007 10:57 am    Post subject: Reply with quote

ibm developerworks - The Pango connection: Part 1

"Pango is an open-source framework for the layout and rendering of
internationalized text, including right-to-left scripts and scripts such as
Tamil where glyphs are context-sensitive. Not surprisingly, Pango uses
Unicode characters internally (represented using UTF-8), and Pango's
interfaces also use UTF-8. Other encodings can be supported by using a
translation library such as GNU iconv to convert the text to UTF-8 before
processing."


... and Part 2

Tony Graham is the author of Unicode: A Primer , the first and currently
only book about the Unicode Standard, Version 3.0, and its uses.
Back to top
View user's profile Send private message Visit poster's website
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Thu Nov 01, 2007 7:01 pm    Post subject: Reply with quote

Apple: Character Encodings and Their Internet Names

"Table A-1 lists character encodings for various languages, gives some
of their common Internet names, and identifies the version of the Text
Encoding Conversion Manager for which character encoding was first
supported for use by the Text Encoding Converter and the Unicode
Converter."
Back to top
View user's profile Send private message Visit poster's website
XNote
Kapetan


Joined: 16 Jun 2006
Posts: 532

PostPosted: Wed Nov 07, 2007 2:46 pm    Post subject: Reply with quote

JoS: Guessing text encoding

"I have an app that parses text log files. If the log file happens to have a
Byte Order Mark, that makes it easy to detect if the file is UTF-8, UTF-16,
UNICODE, etc. However, many log files don't have such a marking."
Back to top
View user's profile Send private message
XNote
Kapetan


Joined: 16 Jun 2006
Posts: 532

PostPosted: Fri Feb 22, 2008 3:55 pm    Post subject: Reply with quote

JoS: Unicode standardization proposal

"UTF-XX encodings use variable-count-of-bytes-per-char approach. There is
an alternative UCS-XX encodings that support same-count-of-bytes-per-char,
useful for programming.

The same problem already existed with the standard one-byte-per-char ASCII
encoding, and the variable-count-of-bytes-per-char MBCS encoding...

UCS-4 ALWAYS uses 4 bytes per char, and as far as I know, it supports all
known commonly used languages, from English to Chinese. Many developers
currently use UCS-4 encoding internally in their applications, even if they have
to load and save files as their UTF-XX counterparts."
Back to top
View user's profile Send private message
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Tue May 06, 2008 7:54 pm    Post subject: Reply with quote

Google: Moving to Unicode 5.1

"Web pages can use a variety of different character encodings, like ASCII,
Latin-1, or Windows 1252, or Unicode. Most encodings can only represent a
few languages, but Unicode will handle anything from Chinese to French to
Arabic."
Back to top
View user's profile Send private message Visit poster's website
XNote
Kapetan


Joined: 16 Jun 2006
Posts: 532

PostPosted: Tue Jun 03, 2008 2:34 pm    Post subject: Reply with quote

Canonical - Counting Characters in UTF-8 Strings Is Fast

"So then I thought about how to do what Aristotle was suggesting. In UTF-8,
bytes that start new characters begin either with binary 0 or binary 11; the
second and subsequent bytes of multibyte characters have binary 10 as their
high bits. So to count the characters, you just have to count the bytes that
don't begin with binary 10."
Back to top
View user's profile Send private message
Ike
Kapetan


Joined: 17 Jun 2006
Posts: 3025
Location: Europe

PostPosted: Mon Jun 09, 2008 6:58 pm    Post subject: Reply with quote

List of Locale ID (LCID) Values as Assigned by Microsoft

"The following table lists the locales/languages with an assigned LCID. The
purpose of the document is to help developers who are defining NLS servi-
ces (sorting, time/date formatting, and keyboards/IMEs) for locales that do
not yet have native support in Windows to avoid conflict."


Language Codes: ISO 639, Microsoft and Macintosh

"Macintosh constants and codes are defined in enumerations in the Mac
header file Script.h and Windows constants and codes are defined in the
Platform SDK header file winnt.h. Note that many of the Microsoft codes
have no 'Windows Name' constant (these are marked "(no constant
defined)" and refer to codes that have been reserved for the languages
in question."
Back to top
View user's profile Send private message
Ike
Kapetan


Joined: 17 Jun 2006
Posts: 3025
Location: Europe

PostPosted: Thu Jul 03, 2008 5:57 pm    Post subject: Reply with quote

The Truth About Unicode In Python

"In this post, I'm going to talk about a couple of the problems with unicode
in Python. Please note that this is not intended as a criticism of Python's
unicode support or the people who designed and implemented it. Most of
those people probably know a whole lot more about unicode than I do, and
the limitations discussed here are the result of a pragmatic approach to
implementing unicode support, rather than due to a lack of knowledge."
Back to top
View user's profile Send private message
Ike
Kapetan


Joined: 17 Jun 2006
Posts: 3025
Location: Europe

PostPosted: Mon Sep 08, 2008 5:51 pm    Post subject: Reply with quote

uf - Unicode in linux

"In this case I have to get some unicode strings from the server and show
them in a GUI such as qt to the user !!! I guess I have to use wstring, Is it
enough or I have to use Ansi2Unicode and vice versa to parse it?"
Back to top
View user's profile Send private message
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Tue Sep 16, 2008 10:54 pm    Post subject: Reply with quote

code-jam - Unicode

"Unicode is a standard way of character encoding which was designed to re-
place all old encodings S.A. ASCII using the Unicode standard transformation."
Back to top
View user's profile Send private message Visit poster's website
Ike
Kapetan


Joined: 17 Jun 2006
Posts: 3025
Location: Europe

PostPosted: Wed Oct 15, 2008 1:17 pm    Post subject: Reply with quote

CodeProject: The Complete Guide to C++ Strings, Part I - Win32
Character Encodings
, by Michael Dunn

"You've undoubtedly seen all these various string types like TCHAR, std::string,
BSTR, and so on. And then there are those wacky macros starting with _tcs.
And you're staring at the screen thinking "wha?" Well stare no more, this guide
will outline the purpose of each string type, show some simple usages, and
describe how to convert to other string types when necessary."
Back to top
View user's profile Send private message
Ike
Kapetan


Joined: 17 Jun 2006
Posts: 3025
Location: Europe

PostPosted: Mon Feb 23, 2009 1:56 pm    Post subject: Reply with quote

cboard - Printing Unicode to console

"It's supposed to print some Japanese characters, but all I see is ????. How
do I get the characters to display properly?"
Back to top
View user's profile Send private message
XNote
Kapetan


Joined: 16 Jun 2006
Posts: 532

PostPosted: Mon Mar 02, 2009 2:16 pm    Post subject: Reply with quote

secondlife.com - Unicode In 5 Minutes

"Unicode is a standard for digital processing of written characters and text.
By enabling the exchange of text data internationally, it is a foundation for
global software."
Back to top
View user's profile Send private message
Ike
Kapetan


Joined: 17 Jun 2006
Posts: 3025
Location: Europe

PostPosted: Fri Apr 24, 2009 2:51 pm    Post subject: Reply with quote

Bell Labs - The history of UTF-8 as told by Rob Pike

"UTF-8 was designed, in front of my eyes, on a placemat in a New Jersey
diner one night in September or so 1992."
Back to top
View user's profile Send private message
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Sat Apr 17, 2010 12:38 pm    Post subject: Reply with quote

MF - How to replace ASCII with Unicode/UTF-8?

"How can you convert a program written in C (using ASCII) so that it can
handle Unicode strings?

I realize there is no simple answer to this question, so I'm posting the source
code from two short programs (from Dave Mark's book on C) that I want to
convert to handle Unicode."
Back to top
View user's profile Send private message Visit poster's website
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Sat Aug 20, 2011 2:43 am    Post subject: Reply with quote

se - Should UTF-16 be considered harmful?

"How many programmers are aware of the fact that UTF-16 is actually a
variable length encoding? By this I mean that there are code points that,
represented as surrogate pairs, take more than one element.

I know; lots of applications, frameworks and APIs use UTF-16, such as
Java's String, C#'s String, Win32 APIs, Qt GUI libraries, the ICU Unicode
library, etc. However, with all of that, there are lots of basic bugs in the
processing of characters out of BMP (characters that should be encoded
using two UTF-16 elements)."
Back to top
View user's profile Send private message Visit poster's website
Ike
Kapetan


Joined: 17 Jun 2006
Posts: 3025
Location: Europe

PostPosted: Mon Nov 12, 2012 11:27 pm    Post subject: Reply with quote

Kunststube - What Every Programmer Absolutely, Positively Needs
To Know About Encodings And Character Sets To Work With Text


"I hope this article can shed some more light on what exactly an encoding
is and just why all your text screws up when you least need it. This article
is aimed at developers (with a focus on PHP), but any computer user sho-
uld be able to benefit from it."
Back to top
View user's profile Send private message
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Mon Aug 26, 2013 7:55 pm    Post subject: Reply with quote

10kloc - Plain Text Doesn’t Exist… Unicode and encodings demystified

Byte Order Mark
If you wish to transfer documents between Little and Big Endian systems
in Unicode, UTF-8 and UTF-16 support a convention known as the Byte
Order Mark. Put simply, Byte Order Mark (BOM) denotes the Endianness
of the document, in the document itself. Encoding is marked by putting
2-bytes or 1 character in UTF-16, FE FF, at the start of every document,
and depending on the Endianness of the system, this will appear as either
FF FE or FE FF, giving reader immediate hint of the encoding.
Back to top
View user's profile Send private message Visit poster's website
Ike
Kapetan


Joined: 17 Jun 2006
Posts: 3025
Location: Europe

PostPosted: Thu Dec 07, 2017 1:02 am    Post subject: Reply with quote

https://github.com/google/unigem-objective-c

This repository contains Unicode Gems, a Mac app, an iOS app, and an iOS keyboard that makes it easy for you to use interesting typefaces in contexts that don't allow fonted text.

As an iOS app, you get an iPhone UI, an iPad UI, and iPad split view support.
Back to top
View user's profile Send private message
delovski



Joined: 14 Jun 2006
Posts: 3522
Location: Zagreb

PostPosted: Tue Feb 06, 2018 2:48 pm    Post subject: Reply with quote

ms - Code Page 1250 Windows Latin 2 (Central Europe)

App UI >> Globalization and Localization >> Appendix H Code Pages
Back to top
View user's profile Send private message Visit poster's website
Ike
Kapetan


Joined: 17 Jun 2006
Posts: 3025
Location: Europe

PostPosted: Wed Jun 07, 2023 7:17 pm    Post subject: Reply with quote

r - cuneicode, and the Future of Text in C

"If you look up what 'ANSI_X3.4-1968' means, you'll find that it's the most
obnoxious and fancy way to spell a particularly old encoding. That is to say,
my default locale when I ask and use it in C or C++ -- on my brand new
Ubuntu 20.04 Focal LTS server, achieved from just pressing 'ok' to all the
setup options, installing build essentials, and then going out of my way to
get the most advanced Clang I can and combine it with the most up-to-date
glibc and libstdc++ I can -- is ASCII.

Not UTF-8. Not Latin-1!"
Back to top
View user's profile Send private message
Ike
Kapetan


Joined: 17 Jun 2006
Posts: 3025
Location: Europe

PostPosted: Thu Aug 24, 2023 3:03 pm    Post subject: Reply with quote

Compatibility of printf with utf-8 encoded strings

"In order to get alignment, you want to count characters. Then, pass the bytes
count to printf. That can be achieved by using the * precision and passing the
count of bytes. For example, since accented e takes two bytes:

printf("'-4.*s'\n", 6, "éléphant");"
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    Igor Delovski Board Forum Index -> General Programming All times are GMT + 1 Hour
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Delovski.hr
Powered by php-B.B. © 2001, 2005 php-B.B. Group