Andrius
2014-04-01 09:41:43 UTC
Hi all,
Recently I tried to open MS-DOS source file (file attached) from http://www.computerhistory.org/atchm/microsoft-ms-dos-early-source-code/ and bf failed with error: "Cannot display file, unknown characters found.". At the same time the file was opened without any issues with many other text editors on MacOSX or Windows.
The origin of the problem is in buffer_find_encoding() in document.c that converts all files being opened to internal UTF-8 encoding. I think conversion might work differently on different platforms, thats why I want to report how it works on MacOSX.
Conversion success depends on two functions, g_convert() and g_utf8_validate(). At first, there is issue with g_utf8_validate(). If there are more than one null symbol in the string, it will return false, even if g_convert() produced no errors. Actually, this is why command.asm does not open. It has at the end of file bunch of x00 symbols, and g_utf8_validate() returns false. This issue can be fixed if we use instead of g_utf8_validate(newbuf, wsize, NULL) this: g_utf8_validate(newbuf, -1, NULL), which will assume null-terminated string. However, there might be some loss of information, since we will be truncating string at first detection of null symbol. I think it should be necessary to show message to the user about this (but we are in string freeze now, correct?).
Second issue is that when we get "Cannot display file, unknown characters found." error, file in the tab is opened anyway as blank file. If somebody will push "Save" button, the file will be saved as empty file and all information will be lost. I think we should mark such a files as read-only and do not allow saving them.
Would it make sense to have some fallback conversion, when all attempts to convert file failed? For example, g_convert() never fails, when ISO8859-1 is assumed, unfortunately, it stops conversion when null symbol is found or, probably some others non-printable symbols like x03 and produces truncated resulting string. But it might be different on other platforms. I think workaround would be to parse c string and replace control characters by something printable (for example there are u2400-u2420 control symbols in unicode) and make such a file read-only. At least one then might be able to understand what is wrong with file he tries to open. Would it make sense? Maybe such a converter is available somewhere?
Looking forward for comments.
Andrius
Recently I tried to open MS-DOS source file (file attached) from http://www.computerhistory.org/atchm/microsoft-ms-dos-early-source-code/ and bf failed with error: "Cannot display file, unknown characters found.". At the same time the file was opened without any issues with many other text editors on MacOSX or Windows.
The origin of the problem is in buffer_find_encoding() in document.c that converts all files being opened to internal UTF-8 encoding. I think conversion might work differently on different platforms, thats why I want to report how it works on MacOSX.
Conversion success depends on two functions, g_convert() and g_utf8_validate(). At first, there is issue with g_utf8_validate(). If there are more than one null symbol in the string, it will return false, even if g_convert() produced no errors. Actually, this is why command.asm does not open. It has at the end of file bunch of x00 symbols, and g_utf8_validate() returns false. This issue can be fixed if we use instead of g_utf8_validate(newbuf, wsize, NULL) this: g_utf8_validate(newbuf, -1, NULL), which will assume null-terminated string. However, there might be some loss of information, since we will be truncating string at first detection of null symbol. I think it should be necessary to show message to the user about this (but we are in string freeze now, correct?).
Second issue is that when we get "Cannot display file, unknown characters found." error, file in the tab is opened anyway as blank file. If somebody will push "Save" button, the file will be saved as empty file and all information will be lost. I think we should mark such a files as read-only and do not allow saving them.
Would it make sense to have some fallback conversion, when all attempts to convert file failed? For example, g_convert() never fails, when ISO8859-1 is assumed, unfortunately, it stops conversion when null symbol is found or, probably some others non-printable symbols like x03 and produces truncated resulting string. But it might be different on other platforms. I think workaround would be to parse c string and replace control characters by something printable (for example there are u2400-u2420 control symbols in unicode) and make such a file read-only. At least one then might be able to understand what is wrong with file he tries to open. Would it make sense? Maybe such a converter is available somewhere?
Looking forward for comments.
Andrius