sed and Multi-Line Search and Replace

by
filosofo

on Apr 26, 2008

sed and Multi-Line Search and Replace

I’ve been experimenting with getting regular expression patterns to match over multiple lines using sed. For example, one might want to change

<p>previous text</p>

<h2>

<a href="http://some-link.com">A title here</a>

</h2>

<p>following text</p>

<p>previous text</p>

No title here

<p>following text</p>

sed cycles through each line of input one line at a time, so the most obvious way to match a pattern that extends over several lines is to concatenate all the lines into what is called sed‘s “hold space,” then look for the pattern in that (long) string. That’s what I do in the following lines:

#!/bin/sh

sed -n '

# if the first line copy the pattern to the hold buffer

1h

# if not the first line then append the pattern to the hold buffer

1!H

# if the last line then ...

$ {

        # copy from the hold to the pattern buffer

        g

        # do the search and replace

        s/<h2.*</h2>/No title here/g

        # print

        p

}

' sample.php > sample-edited.php;

A more compact version:

sed -n '1h;1!H;${;g;s/<h2.*</h2>/No title here/g;p;}' sample.php > sample-edited.php;

 

As far as I can tell, that’s the most efficient way to match general multi-line patterns. I initially thought it might be more efficient not to keep the complete input in the hold buffer, so I modified the algorithm to print out the string whenever a match is found:

#!/bin/sh

sed -n '1h 

1!{

        # if the sought-after regex is not found, append the pattern space to hold space

        /<h2.*</h2>/ !H

        # copy hold space into pattern space

        g

        # if the regex is found, then...

        /<h2.*</h2>/ {

                # the regular expression

                s/<h2.*</h2>/No title here/g

                # print 

                p

                # read the next line into the pattern space

                n

                # copy the pattern space into the hold space

                h

        }

        # copy pattern buffer into hold buffer

        h

}

# if the last line then print

$p

' sample.php > sample-edited.php;

In the last example, sed concatenates lines only until it finds a match, and then it prints the line (after substituting the text). Then, it starts again to concatenate the following lines.

However, that approach is usually massively inefficient, as the regex work increases logarithmically. Unless a sed guru can point out a better way, I’m going to continue using the first approach.

I’ve put the following script, which I call “sedml,” for sed multi-line, in my bash path.

#!/bin/sh

if [ "$#" -lt 2 ] 

then

exit;

fi

# change the input file if no 3rd argument

if [ -z "$3" ]

then

        outputfile="$1"

else

        outputfile="$3"

fi

sed -n '

# if the first line copy the pattern to the hold buffer

1h

# if not the first line then append the pattern to the hold buffer

1!H

# if the last line then ...

$ {

        # copy from the hold to the pattern buffer

        g

        # do the search and replace

        '"$2"'

        # print

        p

}

' $1 > $1.tmp;

mv -f $1.tmp $outputfile;

So I can replace multi-line patterns in multiple files like so:

 grep -rl '<h2' * | while read i; do sedml $i "s/<h2.*</h2>/No title here/g" $i.tmp; done;

This entry was written by filosofo, posted on April 26, 2008 at 1:21 pm, filed under Computers and tagged bash, grep, Linux, regex, sed, shell. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Post a comment or leave a trackback: Trackback URL.

10 Comments

David Runion wrote on April 30, 2009 at 11:45 am

Thank you for this. You are a philosopher and a poet.

-David
bence wrote on June 8, 2009 at 4:45 pm

Hi, filosofo

Maybe I misunderstood something, nut I created the following bash script:

#!/bin/bash SRCH="a\nb" file="test.txt" sed -i.bak -n ' # if the first line copy the pattern to the hold buffer 1h # if not the first line then append the pattern to the hold buffer 1!H # if the last line then ... $ { # copy from the hold to the pattern buffer g # do the search and replace '"$SRCH"' # print p } ' $file;

…and it cannot find the pattern in test.txt:

a b

any tips?
bence wrote on June 8, 2009 at 4:48 pm

Hi, filosofo

Maybe I misunderstood something, but I created the following bash script:

#!/bin/bash SRCH="a\nb" file="test.txt" sed -i.bak -n ' # if the first line copy the pattern to the hold buffer 1h # if not the first line then append the pattern to the hold buffer 1!H # if the last line then ... $ { # copy from the hold to the pattern buffer g # do the search and replace '"$SRCH"' # print p } ' $file;

…and it cannot find the pattern in test.txt:

a b

any tips?
anku wrote on June 19, 2009 at 6:00 pm

>> # do the search and replace
>> ‘”$SRCH”‘

If you want to search and replace, do something like this :
s/'”$SRCH”‘/xxx/
muhsayd

wrote on August 19, 2009 at 10:02 pm

hi all,

I’ve the following issue.
some virus injecting php and html files with some malicious code,
it’s something like:

<iframe ……..
………….
………

till i could stop this virus, i want to automate some script to match and delete this patern or replace with some empty string,

i tried your example but it seems to have something wrong,
could you help me please with this issue ?

thanks in advance.
bun wrote on October 9, 2009 at 8:41 am

$ things i notice:
1. (i’m not sure about this since i didn’t try your sc). it looks like there is a problem in the invocation line:
grep -rl '<h2' * | while read i; do sedml $i "s/<h2.*/No title here/g" $i.tmp done

- in the arg for the sed ‘s’ cmd, isn’t that the / in will be taken by sed as the ‘s’ delimiter? yes, it can be simply changed to another char. but what bothers me (maybe it’s just me since i barely know regex) is that the sc user must be a bit familiar with regex & moreover with sed ‘s’ cmd

- also, the ‘s’ cmd regex will go across multiple since .* match longest pattern

2. as u said: for each input file, the hold buf contains all the complete input, meaning more text more mem consumed, worse for other app, more danger for sed to be killed

3. (i’m assuming you’re concern about efficiency since u gave 2nd try to the problem for it)

- for each file found by ‘grep’ containing at least 1 of the searched html tag (), it brings up a subsh which in turn brings up sed, meaning a pair of subsh & sed if started & ended for each input file.

- it execs grep in this sh while it also starts a subsh to perform ‘while read i…’. inside the while loop, the subsh will starts another subsh (due to #!/bin/sh), then the last created subsh starts sed, & repetition begins

- yes, how process is treated after it exits (particularly whether the core image is still in the mem / not) is kernel-dependent. but 1 thing for sure is that single invocation of an app to deal with multiple input files will always be more efficient than a single invocation of it for each input file, even for *nix where its philosophy of small tools requires low rsc cost for process creation

$ the idea basically:
sed -n ‘
/<h2/b proc
p; b

:proc
n
s//no title here/p
t
b proc
‘ INPUT_FILE…

$ some minimum enhancements i can think of:
1. for the searched html tag:
– it’s variable
– user doesn’t need to know about regex
– the tag can be specific by specifying its attribute

> ex:
input:
– only replace that particular h1, other h1 with diff title including ones that don’t have title isn’t affected. also that the input need not be full. the previous input can be like in the html code
– since all paired html tag looks like: start mark <TAG… and end mark </TAG…, user need only input the start mark, this also eliminate forcing user to know about regex

2. the replacement text can be any char including ones that are special in sed's eye: & \N, the user don't need to know about it

3. for sed to be invoked only 1 once for multiple input files while retaining the ability to save the original file / editing in place & without manual tmp file, there is only 1 way: use -i[SUFFIX]

$ here's the sc, my apology for its ugliness, i don't know much about *nix:

#! /bin/bash

usage(){
echo "usage: $0 [i][b SUFFIX] TAG TEXT [FILES…]"
exit 1
}

while getopts ":ib:" OPT; do
case $OPT in
i) OPT_I=-i
unset SUFFIX
;;
b) OPT_I=-i
SUFFIX="$OPTARG"
;;
*) usage
;;
esac
done

shift $(($OPTIND-1))

# only TAG & TEXT must be present in invocation line
if(($#, so that no need to give all tag’s attrs
if [ “${START:${#START}-1}” == “>” ]; then
START=${START:0:${#START}-1}
fi

# is there syntax / tool to do this cleanly? for substr based on char
END=${START%% *} #get only up to the 1st space
END=${END:0:1}/${END:1}

TEXT=$2
# reformat & and \ to be edible for sed ‘s’ cmd
TEXT=”${TEXT//\\/\\\\}”
TEXT=”${TEXT/&/\&}”
#TEXT=”$(echo “$TEXT” | sed -r ‘s/\\/\\\\/g;’)”

shift 2

# preserve *nix philosophy of ‘small tool’ by expecting input of files from stdin & defaultly sending output to stdout
if(($#==0)); then
while read F; do
FILES=”${FILES} $F”
done
else
FILES=$*
fi

sed -n ${OPT_I}$SUFFIX ‘
/'”$START”‘/b proc
p
b

:proc
n
s|.*'”$END”‘.*|'”$TEXT”‘|p
t #sc goes to eof if matching closing tag is found
b proc
‘ $FILES

$ i’m sure there’re lots more than can be done to it without breaking unix philosophy, such as its ability to change a tag at a particular nested lv, but i won’t know since i’m not a web dev
bun wrote on October 10, 2009 at 11:47 pm

$ gee, i was testing my last posted sc, bugs came out:
– it doesn’t ignore spacing within <> (fixed)
– it also modifies more specific tags if given a more general one of that tag, ex: given %lt;h1> it also modifies <h1 …>

$ ‘f’ opt is added, if ‘f’ used, it performs fixed match for the given tag: <h1> only matches <h1>), while if no ‘f’ <h1> matches <h1 …>

$ it still ignores tag nesting (gonna work on this one)

#! /bin/bash

set -o xtrace

usage(){
echo "usage: $0 [i][b SUFFIX][f] TAG TEXT [FILES...]"
exit 1
}

FIXED=0

while getopts ":ib:f" OPT; do
case $OPT in
i) OPT_I=-i
unset SUFFIX
;;
b) OPT_I=-i
SUFFIX="$OPTARG"
;;
f)
FIXED=1
;;
*) usage
;;
esac
done

shift $(($OPTIND-1))

# only TAG & TEXT must be present in invocation line
if(($# < 2)); then
usage
fi

# there must be no leading / trailing space around the tag given in cmd-line
START=$1

if [ "${START:0:1}" == " < " ]; then
START=${START:1}
fi
if [ "${START:${#START}-1}" == " > " ]; then
START=${START:0:${#START}-1}
fi

# is there syntax / tool to do this cleanly? for substr based on char
END=/${START%% *} #get only up to the 1st space

TEXT=$2
# reformat & and \ to be edible for sed 's' cmd
TEXT="${TEXT//\\/\\\\}"
TEXT="${TEXT/&/\&}"

shift 2

# preserve *nix philosophy of 'small tool' by expecting input of files from stdin & defaultly sending output to stdout
if(($#==0)); then
while read F; do
FILES="${FILES} $F"
done
else
FILES=$*
fi

if(($FIXED==0)); then
START=$START'.*'
else
START=$START' *'
fi

echo "$START"

sed -n ${OPT_I}$SUFFIX '
/[ *'"$START"']/b proc
p;b

:proc
n
s||'"$TEXT"'|p
t
b proc
' $FILES

$ btw, i was actually found this page when googled “multiline sed”. i was learning sed & got stuck with gnu ext of multi-line M for ‘s’ cmd / pattern matching. gnu doc only defines it, i need examples, anyone?
bun wrote on October 10, 2009 at 11:52 pm

$ sorry, the sed part didn’t come right

sed -n ${OPT_I}$SUFFIX ' /< *'"$START"'>/b proc p b

:proc
n
s|< *'"$END"' *>|'"$TEXT"'|p
t
b proc
' $FILES
Frank wrote on February 22, 2010 at 2:55 pm

When trying to replicate this example by running the command suggested:
sed -n '1h;1!H;${;g;s/<h2.*/No title here/g;p;}' sample.php

I get the following error:
sed: -e expression #1, char 26: unknown option to `s'
Frank wrote on February 22, 2010 at 3:04 pm

Never mind. I figured it out. Although I can’t show the final code because I can’t figure out how to escape all the chars. Basically, you have to escape the backslash on the closing h2 tag.

3 Trackbacks

Velo, Rapido: Going Places » Blog Archive » Grumble Grumble Grumble sed Grumble Grumble on September 4, 2008 at 2:15 pm

[…] if I wanted to wipe everything above that and substitute some include script? I’d use sed, […]
LDIF-Dateien zur Konfiguration von OpenLDAP bequem erzeugen « Abraxas on March 28, 2010 at 5:22 am

[…] Ersetzen von Text über mehrere Zeilen hinweg mit sed […]
sed - testing multiline search and replace on May 10, 2010 at 1:11 pm

[…] = "#002285"; netseer_network_id = 1040; [Log in to get rid of this advertisement] Hi, I found a example how to do multiline search and replace. I try to make this working to my needs but with no success […]

Austin Matzko's Blog

sed and Multi-Line Search and Replace

10 Comments

Post a Comment

3 Trackbacks