Jay Taylor's notes

back to listing index

regex - AWK: Access captured group from line pattern - Stack Overflow

[web search]
Original source (stackoverflow.com)
Tags: regular-expressions awk perl capturing-groups
Clipped on: 2020-01-23

Asked 9 years, 7 months ago
Viewed 155k times
216

If I have an awk command

pattern { ... }

and pattern uses a capturing group, how can I access the string so captured in the block?

asked Jun 2 '10 at 12:35
Image (Asset 1/10) alt=
  • Sometimes (in simple cases) it's possible to adjust the field separator (FS) and pick what one would like to match with a $field. Preformatting the input could help too. – Krzysztof Jabłoński Jul 1 '15 at 17:06
  • 1
    There is a better answer on the duplicate question. – Samuel Edwin Ward Jul 8 '15 at 16:04
  • 2
    Samuel Edwin Ward: That's a nice answer too! But it also requires gawk (since it uses gensub). – rampion Jul 8 '15 at 17:39
  • 164

    That was a stroll down memory lane...

    I replaced awk by perl a long time ago.

    Apparently the AWK regular expression engine does not capture its groups.

    you might consider using something like :

    perl -n -e'/test(\d+)/ && print $1'

    the -n flag causes perl to loop over every line like awk does.

    answered Jun 2 '10 at 12:50
    Image (Asset 2/10) alt=
    Apparently someone disagrees. This web page is from 2005 : tek-tips.com/faqs.cfm?fid=5674 It confirms that you cannot reuse matched groups in awk. – Peter Tillemans Jun 2 '10 at 13:00
  • 3
    I prefer 'perl -n -p -e...' over awk for almost all use cases, since it is more flexible, more powerful and has a saner syntax in my opinion. – Peter Tillemans Jun 23 '11 at 18:39
  • 14
    gawk != awk. They're different tools and gawk isn't available by default in most places. – Oli Sep 4 '12 at 12:21
  • 5
    The OP specifically asked for an awk solution, so I don't think this is an answer. – Joppe Feb 22 '16 at 16:22
  • 5
    @Joppe you can't give an awk solution if there is no solution. In line 3 I explain that AWK does not support capturing groups and I gave an alternative, which the OP apparently appreciated because this answer was accepted. How could I better answer this question? – Peter Tillemans Mar 9 '16 at 7:54
  • 316

    With gawk, you can use the match function to capture parenthesized groups.

    gawk 'match($0, pattern, ary) {print ary[1]}' 

    example:

    echo "abcdef" | gawk 'match($0, /b(.*)e/, a) {print a[1]}' 

    outputs cd.

    Note the specific use of gawk which implements the feature in question.

    For a portable alternative you can achieve similar results with match() and substr.

    example:

    echo "abcdef" | awk 'match($0, /b[^e]*/) {print substr($0, RSTART+1, RLENGTH-1)}'

    outputs cd.

    Image (Asset 3/10) alt=
    Yes, the gxxx variants have lots of additional GNU goodness and power. – Peter Tillemans Jun 23 '11 at 18:33
    30

    This is something I need all the time so I created a bash function for it. It's based on glenn jackman's answer.

    Definition

    Add this to your .bash_profile etc.

    function regex { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'0'}']}'; }

    Usage

    Capture regex for each line in file

    $ cat filename | regex '.*'

    Capture 1st regex capture group for each line in file

    $ cat filename | regex '(.*)' 1
    answered Dec 29 '12 at 20:32
    Image (Asset 4/10) alt=
    How is it different from using grep -o? – bfontaine Mar 28 '17 at 14:38
  • @bfontaine Could grep -o output captured groups? – Olle Härstedt Mar 7 '18 at 15:29
  • 1
    @OlleHärstedt No it couldn’t. It only covers your use-case when you don’t have capture-groups. In that case it gets ugly with chained grep -o's. – bfontaine Mar 7 '18 at 17:16
  • 14

    You can use GNU awk:

    $ cat hta
    RewriteCond %{HTTP_HOST} !^www\.mysite\.net$
    RewriteRule (.*) http://www.mysite.net/$1 [R=301,L]
    
    $ gawk 'match($0, /.*(http.*?)\$/, m) { print m[1]; }' < hta
    http://www.mysite.net/
    answered Nov 28 '12 at 3:51
    Image (Asset 5/10) alt=
    +1. Also, with any awk: awk 'match($0, /.*(http.*?)\$/) { print substr($0,RSTART,RLENGTH) }' – Ed Morton Nov 28 '12 at 4:43
  • 5
  • 1
    Ed Morton: that deserves a top-level answer I'd say. edit: uhm... that prints RewriteRule (.*) http://www.mysite.net/$ for me, which is more than the subgroup. – rampion Nov 29 '12 at 13:02
  • 2
  • 4

    You can simulate capturing in vanilla awk too, without extensions. Its not intuitive though:

    step 1. use gensub to surround matches with some character that doesnt appear in your string. step 2. Use split against the character. step 3. Every other element in the splitted array is your capture group.

    $ echo 'ab cb ad' | awk '{ split(gensub(/a./,SUBSEP"&"SUBSEP,"g",$0),cap,SUBSEP); print cap[2]"|" cap[4] ; }'
    ab|ad
    
    answered Mar 21 '12 at 1:58
    Image (Asset 6/10) alt=
    I'm almost certain that gensub is a gawk specific function. What do you get from your awk if you type awk --version ;-?). Good luck to all. – shellter Apr 13 '12 at 5:28
  • 6
    I'm fully certain that gensub is a gawk-ism, though BusyBox awk also has it. This answer could also be implemented using gsub, though: echo 'ab cb ad' | awk '{gsub(/a./,SUBSEP"&"SUBSEP);split($0,cap,SUBSEP);print cap[2]"|"cap[4]}' – dubiousjim Apr 19 '12 at 1:05
  • 2
    gensub() is a gawk extension, gawk's manual clearly say so. Other awk variants may also implement it, but it is still not POSIX. Try gawk --posix '{gsub(...)}' and it will complain – MestreLion Apr 21 '12 at 5:19
  • 2
    @MestreLion, you mean it will complain for gawk --posix '{gensub(...)}'. – dubiousjim Apr 24 '12 at 0:08
  • 1
    Despite you were wrong about POSIX awk having the gensub function, your example applied to a very limited scenario: the whole pattern is grouped, it can't match something like all key=(value) when I want to extract only the value parts. – Meow Sep 24 '15 at 13:24
  • 2

    I struggled a bit with coming up with a bash function that wraps Peter Tillemans' answer but here's what I came up with:

    function regex { perl -n -e "/$1/ && printf \"%s\n\", "'$1' }

    I found this worked better than opsb's awk-based bash function for the following regular expression argument, because I do not want the "ms" to be printed.

    '([0-9]*)ms$'
    answered Aug 3 '16 at 19:16
    Image (Asset 7/10) alt=
    I prefer this solution, since you can see the parts of the group that delimit the capture, while also omitting them. However, could someone elxplain how this works? I can't get this perl syntax to work properly in BASH, because I don't understand it very well - especially the double/single-quote marks around $1 – Demis Dec 19 '17 at 18:39
  • It is not something I have done before or since, but looking back what it is doing is concatenating two strings, the first string being in double quotes (this first string contains embedded double quotes escaped with backslash) and the second string being in single quotes. Then the result of that concatenation is supplied as argument to perl -e. Also you need to know that the first $1 (the one within double quotes) is substituted with the first argument to the function, while the second $1 (the one within single quotes) is left untouched. See this example – wytten Dec 19 '17 at 23:01
  • I see, that's making a bit more sense now. So where in the perl command is the regex match/group capture definition? I see you wrote '([0-9]*)ms$' - is that supplied as an argument (and the string another argument)? And the output from perl -e is being inserted into bash's printf command then, to replace %s, is that right? Thanks, I am hoping to use this. – Demis Dec 20 '17 at 23:55
  • 1
    You pass a regular expression enclosed in single quotes as the sole argument to the regex bash function. Example – wytten Dec 21 '17 at 13:51
  • 3
    This answer is hidden. This answer was deleted and converted to a comment 7 years ago by casperOne.

    GNU awk: accessing captured groups in replacement text

    Image (Asset 8/10) alt=
    An URL is not an answer. URLs may die. Please provide at least a brief explanation of the solution, or some usage example. – MestreLion Apr 21 '12 at 5:16
    comments disabled on deleted / locked posts / reviews

    Your Answer

    community wiki

    Not the answer you're looking for? Browse other questions tagged or ask your own question.

    Hot Network Questions