Discussion:
Bug of grep -E
(too old to reply)
iPack
2017-12-06 15:02:59 UTC
Permalink
[***@urain39-pc ~]$ cat test
Loading Image...

[***@urain39-pc ~]$ cat test | grep -Eo '[0-9a-f]{32}/[0-9A-Za-z%_\.\-]+'
a4ff5caad2fa35faa2271df9badacd35/Konachan.com%20-%20255941%20blush%20brown_eyes%20crying%20fate_kaleid_liner_prisma_illya%20fate_%28series%29%20japanese_clothes%20kimono%20long_hair%20miyu_edelfelt%20purple_hair%20tagme_%28artist%29%20tears.jpg

[***@urain39-pc ~]$ cat test | grep -Eo '[0-9a-f]{32}/[0-9A-Za-z\-%_\.]+'
a4ff5caad2fa35faa2271df9badacd35/Konachan
Eric Blake
2017-12-06 15:32:57 UTC
Permalink
Post by iPack
https://konachan.com/image/a4ff5caad2fa35faa2271df9badacd35/Konachan.com%20-%20255941%20blush%20brown_eyes%20crying%20fate_kaleid_liner_prisma_illya%20fate_%28series%29%20japanese_clothes%20kimono%20long_hair%20miyu_edelfelt%20purple_hair%20tagme_%28artist%29%20tears.jpg
a4ff5caad2fa35faa2271df9badacd35/Konachan.com%20-%20255941%20blush%20brown_eyes%20crying%20fate_kaleid_liner_prisma_illya%20fate_%28series%29%20japanese_clothes%20kimono%20long_hair%20miyu_edelfelt%20purple_hair%20tagme_%28artist%29%20tears.jpg
a4ff5caad2fa35faa2271df9badacd35/Konachan.com%20
It is bug ? or just my syntax error ?
Your syntax error.

In the C locale,

[0-9A-Za-z%_\.\-] matches digits, letters, %, _, \ (listed twice, but
the second listing is ignored), ., and -.

[0-9A-Za-z\-%_\.] matches digits, letters, the range of ASCII bytes
between \ and % (whoops - in ASCII, \ is 47 but % is 37 - you have a
backwards range, so that portion of the range expression matches nothing
at all), then _, \, and . Hence, '-' is not one of the characters
matched, and grep's output is shorter. POSIX permits the implementation
you saw; it also permits an implementation that refuses to grep at all
by declaring your regex invalid because of the backwards range.

In non-C locales, use of - in a [] expression that is not either the
first or the last member of the set is implementation-defined, and all
bets are off on what it matches (lately, GNU tools have been moving
towards rational-range-interpretation, which means treating the range as
the same bytes as it would match in the C locale; but other
implementations, or even older versions of GNU tools, tried to get fancy
and match any character that would collate between the two endpoints,
which gets weird fast).

It _looks_ like you were trying to use \- and \. as escape characters.
But inside [] (at least, the Extended Regular Expression syntax of 'grep
-E' as defined by POSIX), \ is not an escape character; and nothing
needs escaping (there are only special rules about where ], ^, and - are
handled). Yes, there are other flavors of regex engines (perl, for
example) where \ DOES act as an escape even inside []. Which is why it
is essential that you know the quirks of each regex engine you are
targetting.

By the way, bug-gnu-utils is no longer the preferred bug reporting
address for grep; it means your version of grep is probably quite
outdated. These days, 'grep --help' suggests bug-***@gnu.org for
reporting bugs.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org
John Cowan
2017-12-06 15:45:25 UTC
Permalink
Backslash is not an escape in character classes. The only way to get a -
in a character class is to make sure it is at the end or the beginning. So
in your second pattern, the sequence \-% means "every character from
backslash to percent', which is no characters at all.
Post by iPack
https://konachan.com/image/a4ff5caad2fa35faa2271df9badacd
35/Konachan.com%20-%20255941%20blush%20brown_eyes%20crying%
20fate_kaleid_liner_prisma_illya%20fate_%28series%29%
20japanese_clothes%20kimono%20long_hair%20miyu_edelfelt%
20purple_hair%20tagme_%28artist%29%20tears.jpg
]+'
a4ff5caad2fa35faa2271df9badacd35/Konachan.com%20-%20255941%
20blush%20brown_eyes%20crying%20fate_kaleid_liner_prisma_
illya%20fate_%28series%29%20japanese_clothes%20kimono%
20long_hair%20miyu_edelfelt%20purple_hair%20tagme_%28artist%29%20tears.jpg
]+'
a4ff5caad2fa35faa2271df9badacd35/Konachan.com%20
It is bug ? or just my syntax error ?
Bob Proulx
2017-12-06 18:31:52 UTC
Permalink
Post by iPack
a4ff5caad2fa35faa2271df9badacd35/Konachan.com%20
It is bug ? or just my syntax error ?
Recent versions of grep (at least on my Debian system) report this as
an invalid expression for the reasons already noted by others. Here
using your verbatim pattern (with the invalid \- \. escaping):

$ grep -Eo '[0-9a-f]{32}/[0-9A-Za-z\-%_\.]+' /dev/null
grep: Invalid range end

Updating would add this improved validation reporting capability. :-)

Bob

Loading...