Discussion:
UTF-8 corruption bug with diff -y
(too old to reply)
Sjur Nørstebø Moshagen
2018-11-08 08:47:57 UTC
Permalink
Hello

Using diff on text files with long lines risk corrupting UTF-8 enocded files when used with the default column width of 130 columns, if a multibyte char happens to be on the border of that limit. The diff command will truncate the resulting diff output in the middle of the byte sequence, producing malformed UTF-8 text.

To reproduce:

diff -y Input-text-1.txt Input-text-2.txt

The bug can be circumvented by setting the column width to a randomly high number, as long as it is higher than the longest diff line produced:

diff -y -W 200 Input-text-1.txt Input-text-2.txt

The files Input-text-1.txt and Input-text-2.txt (UTF-8 encoded) are attached. The text (excluding --------) is also reproduced below in case the attachments are removed during e-mail transfer.

Regards,
Sjur Moshagen


Input-text-1.txt:
--------
"<ja>"
"ja" CC
"<iešguđet>"
"iešguhtet" Pron Indef Acc
"iešguhtet" Pron Indef Attr
"iešguhtet" Pron Indef Gen
"<lágan>"
"lága" N Sem/Dummytag Ess
"lága" N Sem/Dummytag Sg Loc South Err/Orth
"lágan" A Sem/Hum Attr
"lágan" A Sem/Hum Sg Acc Err/Orth-nom-acc
"lágan" A Sem/Hum Sg Gen Err/Orth-nom-gen
"lágan" A Sem/Hum Sg Nom
"láhka" N Sem/Rule Sg Loc South Err/Orth
"<borramušat>"
"borramuš" N Sem/Food Pl Nom
"borramuš" N Sem/Food Sg Acc PxSg2
"borramuš" N Sem/Food Sg Gen PxSg2
"borrat" Ex/V TV Der/muš N Pl Nom
--------

Input-text-2.txt
--------
"<ja>"
"ja" CC
"<iešguđet lágan>"
"iešguđetlágan" A Sem/Dummytag Attr Err/SpaceCmpágan
"iešguđetlágan" A Sem/Dummytag Sg Acc Err/Orthacc Err/SpaceCmpágan
"iešguđetlágan" A Sem/Dummytag Sg Gen Err/Orthgen Err/SpaceCmpágan
"iešguđetlágan" A Sem/Dummytag Sg Nom Err/SpaceCmpágan
"<borramušat>"
"borramuš" N Sem/Food Pl Nom
"borramuš" N Sem/Food Sg Acc PxSg2
"borramuš" N Sem/Food Sg Gen PxSg2
"borrat" Ex/V TV Der/muš N Pl Nom
--------
Paul Eggert
2018-11-12 20:04:43 UTC
Permalink
Thanks, could you please send that email to bug-***@gnu.org? That's the
place for diffutils bugs these days.

Loading...