Discussion:
Confusing/unclear documentation of Sed back references
(too old to reply)
Peter Kehl
2014-11-26 06:05:40 UTC
Permalink
Dear GNU sed maintainers,

I believe that either Sed documentation is unclear, or its implementation
has a defect, regarding back references in -r mode.
https://www.gnu.org/software/sed/manual/sed.html#index-Backreferences_002c-in-regular-expressions-103
reads that the match has to be between \( and its matching \). However,
that only works if you don't use -r mode:

echo HELLO | sed "s/\(HELLO\)/She said:\1/"
She said:HELLO

If you use -r, then
echo HELLO | sed *-r* "s/*\(*HELLO*\)*/She said:\1"
sed: -e expression #1, char 23: unterminated `s' command

The correct way with -r is to have the match between ( and ), not between
\( and \):
echo HELLO | sed *-r* "s/*(*HELLO*)*/She said:\1/"
She said:HELLO

That may be documented somewhere in your manual, but it's not documented at
the section above. This is highly confusing and counter-productive. Please
update the manual or fix the implementation.

*sed --version*
sed (GNU sed) 4.2.2
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Jay Fenlason, Tom Lord, Ken Pizzini,
and Paolo Bonzini.
GNU sed home page: <http://www.gnu.org/software/sed/>.
General help using GNU software: <http://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-***@gnu.org>.
Be sure to include the word ``sed'' somewhere in the ``Subject:'' field.

*bash --version*
GNU bash, version 4.2.53(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2011 Free Software Foundation, Inc.

*Fedora 20x64:*
uname -a
Linux localhost.localdomain 3.15.10-200.fc20.x86_64 #1 SMP Thu Aug 14
15:39:24 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Best regards,

-Peter Kehl
Bob Proulx
2014-11-26 19:29:07 UTC
Permalink
Post by Peter Kehl
Dear GNU sed maintainers,
I am not one of the sed maintainers. But I found your report
confusing and so am commenting upon it.
Post by Peter Kehl
If you use -r, then
echo HELLO | sed *-r* "s/*\(*HELLO*\)*/She said:\1"
sed: -e expression #1, char 23: unterminated `s' command
Look at that error. There are multiple problems. First is that there
is a missing the trailing slash. The closing '/' at the end of the
substitute command is missing. That is the error message.

You have many extra '*' characters in that command that should not be
there. As I am sure you know the '*' is an RE modifier that causes
the previous item to match zero or more times. It appears to me that
you have sent html to a mailing list. Never send html to mailing
lists. It looks like this list converts html to plain text and the
stars is the result of a broken conversion. Yet another reason never
to send html to a mailing list.

Broken test cases like that are terribly confusing. Removing the many
extra star characters and fixing the trailing slash and quoting is:

$ echo HELLO | sed "s/\(HELLO\)/She said:\1/"
She said:HELLO

$ echo HELLO | sed -r "s/(HELLO)/She said:\1/"
She said:HELLO

The \(...\) grouping is a BRE (basic regular express) construct. When
using ERE (extended regular expression) the parens are not quoted
because ERE syntax uses them directly unquoted. When changing regular
expression engines from BRE to ERE with -r the ERE syntax should be
used. It is often easy to use 'grep' and 'grep -E' to double check
differences in regular expression engines. It is a different tool
using the same RE engines and can provide another input.

The invocation section documents the -r option.

https://www.gnu.org/software/sed/manual/sed.html#Invoking-sed

-r
--regexp-extended
Use extended regular expressions rather than basic regular
expressions. Extended regexps are those that egrep accepts;
they can be clearer because they usually have less backslashes,
but are a GNU extension and hence scripts that use them are not
portable. See [Extended regular expressions].

The "Extended regular expressions" link points to the extended regular
expression section:

https://www.gnu.org/software/sed/manual/sed.html#Extended-regexps

\(abc*\)\1
becomes ‘(abc*)\1’ when using extended regular
expressions. Backreferences must still be escaped when using
extended regular expressions.

Hope that helps,
Bob
Eric Blake
2014-11-26 19:40:21 UTC
Permalink
Post by Bob Proulx
The invocation section documents the -r option.
https://www.gnu.org/software/sed/manual/sed.html#Invoking-sed
-r
--regexp-extended
Use extended regular expressions rather than basic regular
expressions. Extended regexps are those that egrep accepts;
they can be clearer because they usually have less backslashes,
but are a GNU extension and hence scripts that use them are not
portable. See [Extended regular expressions].
This is no longer entirely true. POSIX has proposed standardizing the
-E synonym of -r, which means that it IS portable to use 'sed -E' to get
extended regular expressions in modern sed implementations, and that it
is no longer a GNU-only extension:
http://austingroupbugs.net/view.php?id=528
(However, it is still true that the spelling 'sed -r' is still a GNU
extension, and you should get used to 'sed -E' instead)
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
Bob Proulx
2014-11-26 19:54:57 UTC
Permalink
Post by Eric Blake
Post by Bob Proulx
The invocation section documents the -r option.
https://www.gnu.org/software/sed/manual/sed.html#Invoking-sed
-r
--regexp-extended
Use extended regular expressions rather than basic regular
expressions. Extended regexps are those that egrep accepts;
they can be clearer because they usually have less backslashes,
but are a GNU extension and hence scripts that use them are not
portable. See [Extended regular expressions].
This is no longer entirely true. POSIX has proposed standardizing the
-E synonym of -r, which means that it IS portable to use 'sed -E' to get
extended regular expressions in modern sed implementations, and that it
http://austingroupbugs.net/view.php?id=528
(However, it is still true that the spelling 'sed -r' is still a GNU
extension, and you should get used to 'sed -E' instead)
I don't see how you can see that isn't entirely true. As I read
things the -E is still a proposal. At this time no sed -E option yet
exists in GNU sed.

https://www.gnu.org/software/sed/manual/sed.html#Invoking-sed

And that doesn't even mention the traditional Unix systems. Not even
in the current standards docs.

http://pubs.opengroup.org/onlinepubs/009695399/utilities/sed.html

It certainly can't be considered portable. Not even in bleeding edge
systems.

Bob
Eric Blake
2014-11-26 20:43:24 UTC
Permalink
Post by Bob Proulx
Post by Eric Blake
This is no longer entirely true. POSIX has proposed standardizing the
-E synonym of -r, which means that it IS portable to use 'sed -E' to get
extended regular expressions in modern sed implementations, and that it
I don't see how you can see that isn't entirely true. As I read
things the -E is still a proposal. At this time no sed -E option yet
exists in GNU sed.
It is documented in sed.git:

$ ./sed/sed --help | grep -A1 -- -E
-E, -r, --regexp-extended
use extended regular expressions in the script
(for portability use POSIX -E).
-s, --separate

and exists (albeit undocumented) in older sed:

$ sed --version | head -n1
sed (GNU sed) 4.2.2
$ echo abc | sed -E 's/(b)/B/'
aBc

Hmm - that means we haven't had a sed release in quite a while; 4.2.2
came out in 2012. Maybe this thread will spur a release.
Post by Bob Proulx
It certainly can't be considered portable. Not even in bleeding edge
systems.
I agree that it is not portable to older systems, but DOES work on
existing GNU and BSD sed implementations (even if it is undocumented in
GNU sed).
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
Jim Meyering
2014-11-26 20:58:11 UTC
Permalink
On Wed, Nov 26, 2014 at 12:43 PM, Eric Blake <***@redhat.com> wrote:
...
Post by Eric Blake
Hmm - that means we haven't had a sed release in quite a while; 4.2.2
came out in 2012. Maybe this thread will spur a release.
It is already being planned by me. I hadn't said anything here until now,
since my free-software plate has been mostly full with grep.
Now that grep-2.21 is out, I will spend some time on sed and gzip.

So, yes, patches are most welcome.
Peter Kehl
2014-11-26 21:02:00 UTC
Permalink
Hi Sed maintainers again,

to all of you and Bruce Korb: Thanks. However, the problem is still
there even when I have the third slash /:

in bash:
echo HELLO | sed -r "s/\(HELLO\)/She said:\1/"
sed: -e expression #1, char 24: invalid reference \1 on `s' command's RHS

I don't understand the differences between -r and -E etc. I'm just
questioning whether
https://www.gnu.org/software/sed/manual/sed.html#index-Backreferences_002c-in-regular-expressions-103
(section 3.5) is clear: The replacement can contain <skipped>
references <skipped> of the match which is contained between the nth
\( and its matching \).

Based on the above documentation section, one could assume that the
above prefixing the capturing parenthesis by backslash \( ... \) still
applies in -r mode. Even if the person has used capturing by
parenthesis (..) with no backslash with other regex tools, she or he
could assume that \( ... \) still applies - since there's a lot of
variation in the world of regex tools, so she can expect this to be
yet another flavour.

Please update section 3.5 of the manual to state that capturing by
\(...\) doesn't work in -r mode, and the user should use common regex
capturing by (...).

Bob Proulx:

Those extra stars * were added by GNU mailing program, since my
original email was in HTML - I had made the relevant parts bold, and
@gnu.org transformed those into stars. Since simple HTML formatting
would is commonly supported on forums etc. nowadays, I thought that
@gnu.org would support it, too....

Best regards,
-Peter Kehl
Post by Bob Proulx
Post by Peter Kehl
Dear GNU sed maintainers,
I am not one of the sed maintainers. But I found your report
confusing and so am commenting upon it.
Post by Peter Kehl
If you use -r, then
echo HELLO | sed *-r* "s/*\(*HELLO*\)*/She said:\1"
sed: -e expression #1, char 23: unterminated `s' command
Look at that error. There are multiple problems. First is that there
is a missing the trailing slash. The closing '/' at the end of the
substitute command is missing. That is the error message.
You have many extra '*' characters in that command that should not be
there. As I am sure you know the '*' is an RE modifier that causes
the previous item to match zero or more times. It appears to me that
you have sent html to a mailing list. Never send html to mailing
lists. It looks like this list converts html to plain text and the
stars is the result of a broken conversion. Yet another reason never
to send html to a mailing list.
Broken test cases like that are terribly confusing. Removing the many
$ echo HELLO | sed "s/\(HELLO\)/She said:\1/"
She said:HELLO
$ echo HELLO | sed -r "s/(HELLO)/She said:\1/"
She said:HELLO
The \(...\) grouping is a BRE (basic regular express) construct. When
using ERE (extended regular expression) the parens are not quoted
because ERE syntax uses them directly unquoted. When changing regular
expression engines from BRE to ERE with -r the ERE syntax should be
used. It is often easy to use 'grep' and 'grep -E' to double check
differences in regular expression engines. It is a different tool
using the same RE engines and can provide another input.
The invocation section documents the -r option.
https://www.gnu.org/software/sed/manual/sed.html#Invoking-sed
-r
--regexp-extended
Use extended regular expressions rather than basic regular
expressions. Extended regexps are those that egrep accepts;
they can be clearer because they usually have less backslashes,
but are a GNU extension and hence scripts that use them are not
portable. See [Extended regular expressions].
The "Extended regular expressions" link points to the extended regular
https://www.gnu.org/software/sed/manual/sed.html#Extended-regexps
\(abc*\)\1
becomes ‘(abc*)\1’ when using extended regular
expressions. Backreferences must still be escaped when using
extended regular expressions.
Hope that helps,
Bob
Bob Proulx
2014-11-26 21:14:12 UTC
Permalink
Post by Eric Blake
Post by Bob Proulx
It certainly can't be considered portable. Not even in bleeding edge
systems.
I agree that it is not portable to older systems, but DOES work on
existing GNU and BSD sed implementations (even if it is undocumented in
GNU sed).
It only exists in GNU in unreleased software? That doesn't count.
After there is a GNU sed release with it then it exists. Until then
it is simply a proposal. Because it is possible for any particular
feature to be different or removed in the actual release.

Cool that it exists already in *BSD!

Bob
Bob Proulx
2014-11-26 21:38:24 UTC
Permalink
Post by Peter Kehl
to all of you and Bruce Korb: Thanks. However, the problem is still
Unfortunately you forgot to change the syntax to extended regular
expression syntax. :-(
Post by Peter Kehl
echo HELLO | sed -r "s/\(HELLO\)/She said:\1/"
sed: -e expression #1, char 24: invalid reference \1 on `s' command's RHS
You are using -r but then you are NOT specifying the backreferences
correctly. In the above you are using \(...\) and that is for BREs
(basic regular expressions) and not EREs (extended regular
expressions). When you specify -r you are requesting to use EREs.
When you request EREs you must use ERE syntax.
Post by Peter Kehl
I don't understand the differences between -r and -E etc. I'm just
questioning whether
https://www.gnu.org/software/sed/manual/sed.html#index-Backreferences_002c-in-regular-expressions-103
(section 3.5) is clear: The replacement can contain <skipped>
references <skipped> of the match which is contained between the nth
\( and its matching \).
That document is for the default BRE engine. Everything on that page
is true concerning the syntax of BREs. You leave that page when you
specify -r to use EREs. If you don't use -r then everything on that
page is absolutely true. If you add -r then you must also add
knowledge of what -r changes.
Post by Peter Kehl
Based on the above documentation section, one could assume that the
above prefixing the capturing parenthesis by backslash \( ... \) still
applies in -r mode.
Why? This is a serious question. Please say a few words about why
you think \(...\) applies? This is the source of the confusion. If
we can get to the bottom of this point then we can fix something
fundamental.

Following through the documentation the first thing one should read
when looking at -r is the -r documentation.

https://www.gnu.org/software/sed/manual/sed.html#Invoking-sed

-r
--regexp-extended
Use extended regular expressions rather than basic regular
expressions. Extended regexps are those that egrep accepts;
they can be clearer because they usually have less backslashes,
but are a GNU extension and hence scripts that use them are not
portable. See [Extended regular expressions].

Documentation is always in the mind of the reader. What changes would
you suggest in the above to make it clear to you that using -r uses a
different regular expression engine than when not using -r?

The "Extended regular expressions" link points to the extended regular
expression section:

https://www.gnu.org/software/sed/manual/sed.html#Extended-regexps

\(abc*\)\1
becomes ‘(abc*)\1’ when using extended regular
expressions. Backreferences must still be escaped when using
extended regular expressions.

I think everyone will agree that the documentation on that page is
sparse. What changes would you suggest in the above?

Compare this to the grep documentation on this same topic.

https://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html#Basic-vs-Extended

Because extended regular expressions are an extension of basic regular
expressions they are not usually documented in isolation of basic
regular expressions. EREs are almost always documented in terms of
the differences from BREs. The changes are minor. All of the
documentation of BREs applies *except for the changes* that make EREs
different from BREs. The changes are small and it wouldn't make sense
to duplicate the entire BRE docs. And I think having duplicated
documentation on '^' for example would be more confusing.
Post by Peter Kehl
Even if the person has used capturing by parenthesis (..) with no
backslash with other regex tools, she or he could assume that \(
... \) still applies - since there's a lot of variation in the world
of regex tools, so she can expect this to be yet another flavour.
Exactly! That is exactly why one using -r should be using ERE syntax.
As documented in the manual.
Post by Peter Kehl
Please update section 3.5 of the manual to state that capturing by
\(...\) doesn't work in -r mode, and the user should use common regex
capturing by (...).
Hmm... So you are suggesting that every section of the manual that
mentions regular expressions be split into two sections? One section
would document the default BRE syntax. Another split section would
document the -r ERE syntax? I think that would be tedious to maintain
and laborious to read.
Post by Peter Kehl
Those extra stars * were added by GNU mailing program, since my
original email was in HTML - I had made the relevant parts bold, and
@gnu.org transformed those into stars. Since simple HTML formatting
would is commonly supported on forums etc. nowadays, I thought that
@gnu.org would support it, too....
But mailing lists are not web pages. In a web forum feel free to do
anything the web forum allows. But please no html on mailing lists.
If it is email then please use plain text.

Bob
Bruce Korb
2014-11-26 23:15:47 UTC
Permalink
Post by Bob Proulx
The invocation section documents the -r option.
https://www.gnu.org/software/sed/manual/sed.html#Invoking-sed
-r
--regexp-extended
Use extended regular expressions rather than basic regular
expressions. Extended regexps are those that egrep accepts;
they can be clearer because they usually have less backslashes,
but are a GNU extension and hence scripts that use them are not
portable. See [Extended regular expressions].
Can we fix the grammar too, please? i.e. "have *fewer* backslashes"
Thank you :)
Jim Meyering
2014-11-26 23:20:29 UTC
Permalink
On Wed, Nov 26, 2014 at 3:15 PM, Bruce Korb <***@gmail.com> wrote:
...
Post by Bruce Korb
Post by Bob Proulx
they can be clearer because they usually have less backslashes,
but are a GNU extension and hence scripts that use them are not
portable. See [Extended regular expressions].
Can we fix the grammar too, please? i.e. "have *fewer* backslashes"
Thank you :)
Thanks. That grammar problem appears to have been fixed already:

$ g grep usually.\*backslashes
doc/sed-in.texi:usually have fewer backslashes.
doc/sed.texi:usually have fewer backslashes.

Loading...