C Programming – How to Combine Two Code Pages in One Program for Optimal Efficiency

c++

I am making a program that reads from a file that has characters from two different alphabets (Cyrillic and German). However, when printed to the terminal, ö, ä and ü come out as ?.

So far, I have tried:

using system("chcp 1251")
changing the encoding of the file my program reads from

Is there any way for the program to read the characters from both alphabets? Is there some 'mixed code page' I have missed out on?

Code:

    void readG(){
        system("cls");
    
        // open the file in read mode
        fptr = fopen("C:\\Users\\pl\\projects\\sources\\lernwortschatz.txt", "r");
    
        // print title
        printf("LERNWORTSCHATZ\n");
        printf("A1\n");
        printf("-------------------------------------------------\n");
    
        // read and print the file's contents
        while(fgets(str, 10000, fptr))
        {
            printf("%s", str);
        }
    
        // close the file
        fclose(fptr);
    }

Example:
What is expected: ergänzen – попълвам

What comes out: erg?nzen – попълвам

What I am trying to do with the file:
I want to get all of its contents and then immediately print it without saving.

Part of the result in bytes:
30
20
2d
20
4d
65
69
6e
65
20
57
3f
72
74
65
72
20
69
6d
20
4b
75
72
73
0a
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
2d
0a
61
6e
73
65
68
65
6e
20
2d
20
ffffffe2
ffffffe8
ffffffe6
0a
64
61
73
20
42
69
6c
64
2c
2d
65
72
20
2d
20
ffffffea
ffffffe0
fffffff0
fffffff2
ffffffe8
ffffffed
ffffffea
ffffffe0

Here is a part of the file the program reads:
ansehen – виж
das Bild,-er – картинка
hören – слушам
noch einmal – още един път
ankreuzen – зачерквам/попълвам

Here is this part of the file in hex representation:

00000000: 2d2d 2d2d 2d2d 2d2d 2d2d 2d0d 0a61 6e73  -----------..ans
00000010: 6568 656e 202d 20e2 e8e6 0d0a 6461 7320  ehen - .....das
00000020: 4269 6c64 2c2d 6572 202d 20ea e0f0 f2e8  Bild,-er - .....
00000030: edea e00d 0a68 3f72 656e 202d 20f1 ebf3  .....h?ren - ...
00000040: f8e0 ec0d 0a6e 6f63 6820 6569 6e6d 616c  .....noch einmal
00000050: 202d 20ee f9e5 20e5 e4e8 ed20 effa f20d   - ... .... ....
00000060: 0a61 6e6b 7265 757a 656e 202d 20e7 e0f7  .ankreuzen - ...

Best Answer

So, the file you have indeed does have two different single-byte character encodings on each line. That's quite the technical feat to have managed with any regular text editor! :)

Let's take the hören line 68 3f 72 65 6e 20 2d 20 f1 eb f3 f8 e0 ec as an example, but I'm going to modify it a bit because the hex dump you're showing is already broken; the byte 3F is the question mark, not what would be ö in ISO-8859-1 (F6).

I'm going to use Python to illustrate the problems you'll face because it's good at dealing with various encodings.

>>> x = '68 f6 72 65 6e 20 2d 20 f1 eb f3 f8 e0 ec'
>>> b = bytes.fromhex(x)
b'h\xf6ren - \xf1\xeb\xf3\xf8\xe0\xec'

If we just decode the hexadecimal encoding of those bytes into a bytestring, we can see its Python representation, where all of the printable 7-bit ASCII bytes are shown as themselves, but everything else is shown as an escape sequence. Don't be fooled, this is not human-readable text, it's just a sequence of bytes that partially looks readable.

Alright, so let's try to decode this into text as ISO-8859-1 (aka latin-1) (which is near to the CP1252 codepage).

>>> b.decode("latin-1")
'hören - ñëóøàì'

We can see that the ö for hören was decoded well, but the Cyrillic is unreadable mojibake.

Let's do it the other way, then:

>>> b.decode("cp1251")
'hцren - слушам'

The German turns out a bit unfortunate, because the byte \xf6 is interpreted as ц in CP1251 but the Russian checks out (according to Google Translate anyway).

So – if we were using Python, we'd decode this by splitting it and decoding each half:

>>> de_bytes, _, ru_bytes = b.partition(b" - ")
>>> (de_bytes.decode("latin-1"), ru_bytes.decode("cp1251"))
('hören', 'слушам')

(and this indeed prints out just fine on my Mac's terminal, and would also do so in Python UTF-8 Mode on Windows).

Now, back to C land: the issue is that fgets() and friends don't give a darn about encodings – they're all just bytes (though fgets() knows that the byte 0x0a (10 in decimal) is the newline character in ASCII encoding, and stops reading there).

When you read those bytes, you get exactly those bytes, and it's up to your app to interpret them. When you output those bytes using printf() on your regular Windows terminal, it will use the current console output codepage to translate the bytes into glyphs.

Technically, you could output these files correctly in your Windows terminal with something like

read a line
switch to codepage 1252 (SetConsoleOutputCP(1252);)
write out each Latin byte until you find space-dash-space
switch to codepage 1251 (SetConsoleOutputCP(1251);)
write out each Cyrillic byte until you're out of this line

... rinse and repeat.

Another option would be to read your input into Unicode codepoints, e.g. UTF-8 or UTF-16. You'd still have to interpret each half of the lines differently, and UTF-8 in particular is a variable-width encoding, so you can't trust strlen() to give you the actual human-eyes length of a string anymore, but at least your playing ground would be level enough so you could use some of the answers in Properly print utf8 characters in windows console.

The second table

Let's start by examining the second table, int shift = ";;;====~$::199"[(i*2&8) | (i/64)];. i/64 is the line number (6 to 0) and i*2&8 is 8 iff i is 4, 5, 6 or 7 mod 8.

if((i & 2) == 0) shift /= 8; shift = shift % 8 selects either the high octal digit (for i%8 = 0,1,4,5) or the low octal digit (for i%8 = 2,3,6,7) of the table value. The shift table ends up looking like this:

row col val
6   6-7 0
6   4-5 0
6   2-3 5
6   0-1 7
5   6-7 1
5   4-5 7
5   2-3 5
5   0-1 7
4   6-7 1
4   4-5 7
4   2-3 5
4   0-1 7
3   6-7 1
3   4-5 6
3   2-3 5
3   0-1 7
2   6-7 2
2   4-5 7
2   2-3 3
2   0-1 7
1   6-7 2
1   4-5 7
1   2-3 3
1   0-1 7
0   6-7 4
0   4-5 4
0   2-3 3
0   0-1 7

or in tabular form

Note that the author used the null terminator for the first two table entries (sneaky!).

This is designed after a seven-segment display, with 7s as blanks. So, the entries in the first table must define the segments that get lit up.

The first table

__TIME__ is a special macro defined by the preprocessor. It expands to a string constant containing the time at which the preprocessor was run, in the form "HH:MM:SS". Observe that it contains exactly 8 characters. Note that 0-9 have ASCII values 48 through 57 and : has ASCII value 58. The output is 64 characters per line, so that leaves 8 characters per character of __TIME__.

7 - i/8%8 is thus the index of __TIME__ that is presently being output (the 7- is needed because we are iterating i downwards). So, t is the character of __TIME__ being output.

a ends up equalling the following in binary, depending on the input t:

Each number is a bitmap describing the segments that are lit up in our seven-segment display. Since the characters are all 7-bit ASCII, the high bit is always cleared. Thus, 7 in the segment table always prints as a blank. The second table looks like this with the 7s as blanks:

So, for example, 4 is 01101010 (bits 1, 3, 5, and 6 set), which prints as

----!!--
!!--!!--
!!--!!--
!!!!!!--
----!!--
----!!--
----!!--

To show we really understand the code, let's adjust the output a bit with this table:

This is encoded as "?;;?==? '::799\x07". For artistic purposes, we'll add 64 to a few of the characters (since only the low 6 bits are used, this won't affect the output); this gives "?{{?}}?gg::799G" (note that the 8th character is unused, so we can actually make it whatever we want). Putting our new table in the original code:

main(_){_^448&&main(-~_);putchar(--_%64?32|-~7[__TIME__-_/8%8][">'txiZ^(~z?"-48]>>"?{{?}}?gg::799G"[_*2&8|_/64]/(_&2?1:8)%8&1:10);}

we get

          !!              !!                              !!   
    !!  !!              !!  !!  !!  !!              !!  !!  !! 
    !!  !!              !!  !!  !!  !!              !!  !!  !! 
          !!      !!              !!      !!                   
    !!  !!  !!          !!  !!      !!              !!  !!  !! 
    !!  !!  !!          !!  !!      !!              !!  !!  !! 
          !!              !!                              !!

just as we expected. It's not as solid-looking as the original, which explains why the author chose to use the table he did.

C – What Happens When You Don’t Free After malloc?

Just about every modern operating system will recover all the allocated memory space after a program exits. The only exception I can think of might be something like Palm OS where the program's static storage and runtime memory are pretty much the same thing, so not freeing might cause the program to take up more storage. (I'm only speculating here.)

So generally, there's no harm in it, except the runtime cost of having more storage than you need. Certainly in the example you give, you want to keep the memory for a variable that might be used until it's cleared.

However, it's considered good style to free memory as soon as you don't need it any more, and to free anything you still have around on program exit. It's more of an exercise in knowing what memory you're using, and thinking about whether you still need it. If you don't keep track, you might have memory leaks.

On the other hand, the similar admonition to close your files on exit has a much more concrete result - if you don't, the data you wrote to them might not get flushed, or if they're a temp file, they might not get deleted when you're done. Also, database handles should have their transactions committed and then closed when you're done with them. Similarly, if you're using an object oriented language like C++ or Objective C, not freeing an object when you're done with it will mean the destructor will never get called, and any resources the class is responsible might not get cleaned up.

Best Answer

Related Solutions

C – Explanation of Obfuscated Code Contest 2006 Entry sykes2.c

The second table

The first table

C – What Happens When You Don’t Free After malloc?

Related Question