Discussion:
Fix problem with opening files
Andrius
2014-04-01 09:41:43 UTC
Permalink
Hi all,
Recently I tried to open MS-DOS source file (file attached) from http://www.computerhistory.org/atchm/microsoft-ms-dos-early-source-code/ and bf failed with error: "Cannot display file, unknown characters found.". At the same time the file was opened without any issues with many other text editors on MacOSX or Windows.
The origin of the problem is in buffer_find_encoding() in document.c that converts all files being opened to internal UTF-8 encoding. I think conversion might work differently on different platforms, thats why I want to report how it works on MacOSX.
Conversion success depends on two functions, g_convert() and g_utf8_validate(). At first, there is issue with g_utf8_validate(). If there are more than one null symbol in the string, it will return false, even if g_convert() produced no errors. Actually, this is why command.asm does not open. It has at the end of file bunch of x00 symbols, and g_utf8_validate() returns false. This issue can be fixed if we use instead of g_utf8_validate(newbuf, wsize, NULL) this: g_utf8_validate(newbuf, -1, NULL), which will assume null-terminated string. However, there might be some loss of information, since we will be truncating string at first detection of null symbol. I think it should be necessary to show message to the user about this (but we are in string freeze now, correct?).
Second issue is that when we get "Cannot display file, unknown characters found." error, file in the tab is opened anyway as blank file. If somebody will push "Save" button, the file will be saved as empty file and all information will be lost. I think we should mark such a files as read-only and do not allow saving them.
Would it make sense to have some fallback conversion, when all attempts to convert file failed? For example, g_convert() never fails, when ISO8859-1 is assumed, unfortunately, it stops conversion when null symbol is found or, probably some others non-printable symbols like x03 and produces truncated resulting string. But it might be different on other platforms. I think workaround would be to parse c string and replace control characters by something printable (for example there are u2400-u2420 control symbols in unicode) and make such a file read-only. At least one then might be able to understand what is wrong with file he tries to open. Would it make sense? Maybe such a converter is available somewhere?
Looking forward for comments.
Andrius
Olivier Sessink
2014-04-01 10:21:35 UTC
Permalink
Post by Andrius
Hi all,
Recently I tried to open MS-DOS source file (file attached) from
http://www.computerhistory.org/atchm/microsoft-ms-dos-early-source-code/
and bf failed with error: "Cannot display file, unknown characters
found.". At the same time the file was opened without any issues with
many other text editors on MacOSX or Windows.
The origin of the problem is in buffer_find_encoding() in document.c
that converts all files being opened to internal UTF-8 encoding. I
think conversion might work differently on different platforms, thats
why I want to report how it works on MacOSX.
Conversion success depends on two functions, g_convert() and
g_utf8_validate(). At first, there is issue with g_utf8_validate(). If
there are more than one null symbol in the string, it will return
false, even if g_convert() produced no errors. Actually, this is why
command.asm does not open. It has at the end of file bunch of x00
symbols, and g_utf8_validate() returns false. This issue can be fixed
g_utf8_validate(newbuf, -1, NULL), which will assume null-terminated
string. However, there might be some loss of information, since we
will be truncating string at first detection of null symbol. I think
it should be necessary to show message to the user about this (but we
are in string freeze now, correct?).
A different way to do this: if we fail converting, as a last step we can
try to validate with g_utf8_validate(newbuf, -1, NULL), and if that
succeeds we continue. Can you send that example file so I can try and
see what happens?
Post by Andrius
Second issue is that when we get "Cannot display file, unknown
characters found." error, file in the tab is opened anyway as blank
file. If somebody will push "Save" button, the file will be saved as
empty file and all information will be lost. I think we should mark
such a files as read-only and do not allow saving them.
good point!
Post by Andrius
Would it make sense to have some fallback conversion, when all
attempts to convert file failed? For example, g_convert() never fails,
when ISO8859-1 is assumed, unfortunately, it stops conversion when
null symbol is found or, probably some others non-printable symbols
like x03 and produces truncated resulting string. But it might be
different on other platforms. I think workaround would be to parse c
string and replace control characters by something printable (for
example there are u2400-u2420 control symbols in unicode) and make
such a file read-only. At least one then might be able to understand
what is wrong with file he tries to open. Would it make sense? Maybe
such a converter is available somewhere?
Looking forward for comments.
conversion is a tricky business, we have had many bugs that were
introduced by a change that solved some other problem. So we should at
least be very careful. Opening a partially converted file read-only, and
allowing "save as" could solve this problem, but this is currently not
implemented in Bluefish.

Olivier
--
Bluefish website http://bluefish.openoffice.nl/
Blog http://oli4444.wordpress.com/
Andrius
2014-04-01 10:51:13 UTC
Permalink
Olivier,
test case, command.asm file was attached to the first email, I am attaching it once again.
I do not know if text files can contain null characters legally. On windows notepad just replaces nulls with spaces. Probably there might be situations when after null the string continues...
Andrius
Sent: Tuesday, April 1, 2014 1:21 PM
Subject: Re: Fix problem with opening files
Hi all,
Post by Andrius
Recently I tried to open MS-DOS source file (file attached) from http://www.computerhistory.org/atchm/microsoft-ms-dos-early-source-code/ and bf failed with error: "Cannot display file, unknown characters found.". At the same time the file was opened without any issues with many other text editors on MacOSX or Windows.
The origin of the problem is in buffer_find_encoding() in document.c that converts all files being opened to internal UTF-8 encoding. I think conversion might work differently on different platforms, thats why I want to report how it works on MacOSX.
Conversion success depends on two functions, g_convert() and g_utf8_validate(). At first, there is issue with g_utf8_validate(). If there are more than one null symbol in the string, it will return false, even if g_convert() produced no errors. Actually, this is why command.asm does not open. It has at the end of file bunch of x00 symbols, and g_utf8_validate() returns false. This issue can be fixed if we use instead of g_utf8_validate(newbuf, wsize, NULL) this: g_utf8_validate(newbuf, -1, NULL), which will assume null-terminated string. However, there might be some loss of information, since we will be truncating string at first detection of null symbol. I think it should be necessary to show message to the user about this (but we are in string freeze now, correct?).
A different way to do this: if we fail converting, as a last step we
can try to validate with g_utf8_validate(newbuf, -1, NULL), and if
that succeeds we continue. Can you send that example file so I can
try and see what happens?
Olivier Sessink
2014-04-01 12:24:02 UTC
Permalink
Post by Andrius
Olivier,
test case, command.asm file was attached to the first email, I am attaching it once again.
I do not know if text files can contain null characters legally. On
windows notepad just replaces nulls with spaces. Probably there might
be situations when after null the string continues...
what if we replace the calls to g_utf8_validate() to calls to this
function, would that work?

gboolean
utf8_validate_accept_trailing_nul(gchar *buffer, gsize buflen)
{
gboolean ret;
gchar *end=NULL;
gint i;
ret = g_utf8_validate(buffer, buflen, &end);
if (ret)
return TRUE;

if (end<=buffer)
return FALSE;

/* if all characters that are not valid are NUL characters, we
accept the conversion */
for (i=(end-buffer);i<buflen;i++) {
if (buffer[i]!=0)
return FALSE;
}
return TRUE;
}


Olivier
--
Bluefish website http://bluefish.openoffice.nl/
Blog http://oli4444.wordpress.com/
Andrius
2014-04-01 12:29:22 UTC
Permalink
I will try it in the evening. I think it should work..
________________________________
Sent: Tuesday, April 1, 2014 3:24 PM
Subject: Re: Fix problem with opening files
Olivier,
Post by Andrius
test case, command.asm file was attached to the first
email, I am attaching it once again.
Post by Andrius
I do not know if text files can contain null characters
legally. On windows notepad just replaces nulls with
spaces. Probably there might be situations when after null
the string continues...
what if we replace the calls to g_utf8_validate() to calls to this
function, would that work?
gboolean
utf8_validate_accept_trailing_nul(gchar *buffer, gsize buflen)
{
    gboolean ret;
    gchar *end=NULL;
    gint i;
    ret = g_utf8_validate(buffer, buflen, &end);
    if (ret)
        return TRUE;
    if (end<=buffer)
        return FALSE;
    /* if all characters that are not valid are NUL characters, we
accept the conversion */
    for (i=(end-buffer);i<buflen;i++) {
        if (buffer[i]!=0)
            return FALSE;
    }
    return TRUE;
}
Olivier
--
Bluefish website http://bluefish.openoffice.nl/ Blog http://oli4444.wordpress.com/
Olivier Sessink
2014-04-01 13:16:58 UTC
Permalink
Post by Andrius
I will try it in the evening. I think it should work..
it works on my test files, so I committed the change. Please do an
additional test.

Olivier
--
Bluefish website http://bluefish.openoffice.nl/
Blog http://oli4444.wordpress.com/
Andrius
2014-04-01 18:56:58 UTC
Permalink
Olivier,
indeed, command.asm file now opens correctly, however, some other files, for example copy.asm does not... still produces the same error. I attached file, please check. It seems that more robust solution is needed...
Andrius
________________________________
Sent: Tuesday, April 1, 2014 4:16 PM
Subject: Re: Fix problem with opening files
I will try it in the evening. I think it should work..
it works on my test files, so I committed the change. Please do an
additional test.
Olivier
--
Bluefish website http://bluefish.openoffice.nl/ Blog http://oli4444.wordpress.com/
Olivier Sessink
2014-04-01 19:11:53 UTC
Permalink
Post by Andrius
Olivier,
indeed, command.asm file now opens correctly, however, some other
files, for example copy.asm does not... still produces the same error.
I attached file, please check. It seems that more robust solution is
needed...
Andrius
I see. It has other control characters besides NUL. But other editors
also do not open it correctly.

I've added it to the roadmap that we need to improve loading of
corrupted files.

Olivier
--
Bluefish website http://bluefish.openoffice.nl/
Blog http://oli4444.wordpress.com/
Andrius
2014-04-01 20:52:12 UTC
Permalink
Yeah, it has 0x0D after bunch of 0x00, thats why it gives error. I tried to open it with TextEdit(Mac), Notepad and Wordpad (Windows), in all cases it opens without problems. It seems these extra null characters are the problem. I would just replace them space (and warn user about this replacement...).
Andrius
________________________________
Sent: Tuesday, April 1, 2014 10:11 PM
Subject: Re: Fix problem with opening files
Olivier,
Post by Andrius
indeed, command.asm file now opens correctly, however, some
other files, for example copy.asm does not... still produces the
same error. I attached file, please check. It seems that more
robust solution is needed...
Post by Andrius
Andrius
I see. It has other control characters besides NUL. But other
editors also do not open it correctly.
I've added it to the roadmap that we need to improve loading of
corrupted files.
Olivier
--
Bluefish website http://bluefish.openoffice.nl/ Blog http://oli4444.wordpress.com/
Olivier Sessink
2014-04-02 07:25:36 UTC
Permalink
Post by Andrius
Yeah, it has 0x0D after bunch of 0x00, thats why it gives error. I
tried to open it with TextEdit(Mac), Notepad and Wordpad (Windows), in
all cases it opens without problems. It seems these extra null
characters are the problem. I would just replace them space (and warn
user about this replacement...).
Especially combined with your previous suggestion: open the file
read-only, and require to use 'save as' to make it editable.

Olivier
--
Bluefish website http://bluefish.openoffice.nl/
Blog http://oli4444.wordpress.com/
Andrius
2014-04-02 20:13:10 UTC
Permalink
Olivier,
I just commited function for fallback conversion of corrupted strings. Could You review it? If it is Ok I will commit remaining of the code.
This conversion seems to be working well with corrupted files I send earlier. Also, I have code that would make doc read only if replacement of character is done (but not commited yet).
The side effect of this approach that now bf can open *any* file, even binary ones :-)
Andrius
________________________________
Sent: Wednesday, April 2, 2014 10:25 AM
Subject: Re: Fix problem with opening files
Yeah, it has 0x0D after bunch of 0x00, thats why it gives error. I tried to open it with TextEdit(Mac), Notepad and Wordpad (Windows), in all cases it opens without problems. It seems these extra null characters are the problem. I would just replace them space (and warn user about this replacement...).
Especially combined with your previous suggestion: open the file
read-only, and require to use 'save as' to make it editable.
Olivier
--
Bluefish website http://bluefish.openoffice.nl/ Blog http://oli4444.wordpress.com/
Olivier Sessink
2014-04-03 19:24:48 UTC
Permalink
Post by Andrius
Olivier,
I just commited function for fallback conversion of corrupted strings.
Could You review it? If it is Ok I will commit remaining of the code.
This conversion seems to be working well with corrupted files I send
earlier. Also, I have code that would make doc read only if replacement
of character is done (but not commited yet).
The side effect of this approach that now bf can open *any* file, even binary ones :-)
Andrius
looks good!

Olivier
Andrius
2014-04-04 04:14:26 UTC
Permalink
Olivier,
after second though I think that this feature might be confusing without explanatory messages (files opened in read-only mode, and they will be different from original ones), so probably it is better to wait till 2.2.6 and then commit it. What do You think?
Andrius
________________________________
Sent: Thursday, April 3, 2014 10:24 PM
Subject: Re: Fix problem with opening files
Post by Andrius
Olivier,
I just commited function for fallback conversion of corrupted strings.
Could You review it? If it is Ok I will commit remaining of the code.
This conversion seems to be working well with corrupted files I send
earlier. Also, I have code that would make doc read only if replacement
of character is done (but not commited yet).
The side effect of this approach that now bf can open *any* file, even binary ones :-)
Andrius
looks good!
Olivier
--
To unsubscribe from this list: send the line "unsubscribe bluefish-dev" in
Bluefish web site: http://bluefish.openoffice.nl/
Olivier Sessink
2014-04-04 08:46:26 UTC
Permalink
Post by Andrius
Olivier,
after second though I think that this feature might be confusing without
explanatory messages (files opened in read-only mode, and they will be
different from original ones), so probably it is better to wait till
2.2.6 and then commit it. What do You think?
Andrius
yes you might be right about that.

do you need to revert anything in the code? Or perhaps put the code
between an #ifdef?

Olivier
Andrius
2014-04-04 08:58:59 UTC
Permalink
No, I did not need to revert anything. Right now the behavior is the same as earlier, broken files are empty (I just added that they are read-only, so not possible to over-write with Save button).
Andrius
________________________________
Sent: Friday, April 4, 2014 11:46 AM
Subject: Re: Fix problem with opening files
Post by Andrius
Olivier,
after second though I think that this feature might be confusing without
explanatory messages (files opened in read-only mode, and they will be
different from original ones), so probably it is better to wait till
2.2.6 and then commit it. What do You think?
Andrius
yes you might be right about that.
do you need to revert anything in the code? Or perhaps put the code
between an #ifdef?
Olivier
--
To unsubscribe from this list: send the line "unsubscribe bluefish-dev" in
Bluefish web site: http://bluefish.openoffice.nl/
Olivier Sessink
2014-04-01 10:39:50 UTC
Permalink
Post by Andrius
Conversion success depends on two functions, g_convert() and
g_utf8_validate(). At first, there is issue with g_utf8_validate(). If
there are more than one null symbol in the string, it will return
false, even if g_convert() produced no errors. Actually, this is why
command.asm does not open. It has at the end of file bunch of x00
symbols, and g_utf8_validate() returns false.
Another possibility for this specific issue:

we can pass a pointer to g_utf8_validate() that will return the first
character that was not valid. If we check if the remainder of the string
contains only NUL bytes, we can safely truncate the file (I think). Is
that true for all situations (that we can safely truncate)?

Olivier
--
Bluefish website http://bluefish.openoffice.nl/
Blog http://oli4444.wordpress.com/
Loading...