sed and Multi-Line Search and Replace

by filosofo. Posted on April 26, 2008 at 1:21 pm

I’ve been experimenting with getting regular expression patterns to match over multiple lines using sed. For example, one might want to change

<p>previous text</p>
<h2>
<a href="http://some-link.com">A title here</a>
</h2>
<p>following text</p>

to

<p>previous text</p>
No title here
<p>following text</p>

sed cycles through each line of input one line at a time, so the most obvious way to match a pattern that extends over several lines is to concatenate all the lines into what is called sed’s “hold space,” then look for the pattern in that (long) string. That’s what I do in the following lines:

#!/bin/sh
sed -n '
# if the first line copy the pattern to the hold buffer
1h
# if not the first line then append the pattern to the hold buffer
1!H
# if the last line then ...
$ {
        # copy from the hold to the pattern buffer
        g
        # do the search and replace
        s/<h2.*</h2>/No title here/g
        # print
        p
}
'
sample.php > sample-edited.php;

A more compact version:


sed -n '1h;1!H;${;g;s/<h2.*</h2>/No title here/g;p;}' sample.php > sample-edited.php;
 

As far as I can tell, that’s the most efficient way to match general multi-line patterns. I initially thought it might be more efficient not to keep the complete input in the hold buffer, so I modified the algorithm to print out the string whenever a match is found:


#!/bin/sh
sed -n '1h
1!{
        # if the sought-after regex is not found, append the pattern space to hold space
        /<h2.*</h2>/ !H
        # copy hold space into pattern space
        g
        # if the regex is found, then...
        /<h2.*</h2>/ {
                # the regular expression
                s/<h2.*</h2>/No title here/g
                # print
                p
                # read the next line into the pattern space
                n
                # copy the pattern space into the hold space
                h
        }
        # copy pattern buffer into hold buffer
        h
}
# if the last line then print
$p
'
sample.php > sample-edited.php;
 

In the last example, sed concatenates lines only until it finds a match, and then it prints the line (after substituting the text). Then, it starts again to concatenate the following lines.

However, that approach is usually massively inefficient, as the regex work increases logarithmically. Unless a sed guru can point out a better way, I’m going to continue using the first approach.

I’ve put the following script, which I call “sedml,” for sed multi-line, in my bash path.

#!/bin/sh
if [ "$#" -lt 2 ]
then
exit;
fi

# change the input file if no 3rd argument
if [ -z "$3" ]
then
        outputfile="$1"
else
        outputfile="$3"
fi
sed -n '
# if the first line copy the pattern to the hold buffer
1h
# if not the first line then append the pattern to the hold buffer
1!H
# if the last line then ...
$ {
        # copy from the hold to the pattern buffer
        g
        # do the search and replace
        '
"$2"'
        # print
        p
}
'
$1 > $1.tmp;
mv -f $1.tmp $outputfile;
 

So I can replace multi-line patterns in multiple files like so:

 grep -rl '<h2' * | while read i; do sedml $i "s/<h2.*</h2>/No title here/g" $i.tmp; done;

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*

Subscribe without commenting.