sed and Multi-Line Search and Replace

I’ve been experimenting with getting regular expression patterns to match over multiple lines using sed. For example, one might want to change

<p>previous text</p>
<h2>
<a href="http://some-link.com">A title here</a>
</h2>
<p>following text</p>

to

<p>previous text</p>
No title here
<p>following text</p>

sed cycles through each line of input one line at a time, so the most obvious way to match a pattern that extends over several lines is to concatenate all the lines into what is called sed‘s “hold space,” then look for the pattern in that (long) string. That’s what I do in the following lines:

#!/bin/sh
sed -n '
# if the first line copy the pattern to the hold buffer
1h
# if not the first line then append the pattern to the hold buffer
1!H
# if the last line then ...
$ {
        # copy from the hold to the pattern buffer
        g
        # do the search and replace
        s/<h2.*</h2>/No title here/g
        # print
        p
}
'
sample.php > sample-edited.php;

A more compact version:


sed -n '1h;1!H;${;g;s/<h2.*</h2>/No title here/g;p;}' sample.php > sample-edited.php;
 

As far as I can tell, that’s the most efficient way to match general multi-line patterns. I initially thought it might be more efficient not to keep the complete input in the hold buffer, so I modified the algorithm to print out the string whenever a match is found:


#!/bin/sh
sed -n '1h
1!{
        # if the sought-after regex is not found, append the pattern space to hold space
        /<h2.*</h2>/ !H
        # copy hold space into pattern space
        g
        # if the regex is found, then...
        /<h2.*</h2>/ {
                # the regular expression
                s/<h2.*</h2>/No title here/g
                # print
                p
                # read the next line into the pattern space
                n
                # copy the pattern space into the hold space
                h
        }
        # copy pattern buffer into hold buffer
        h
}
# if the last line then print
$p
'
sample.php > sample-edited.php;
 

In the last example, sed concatenates lines only until it finds a match, and then it prints the line (after substituting the text). Then, it starts again to concatenate the following lines.

However, that approach is usually massively inefficient, as the regex work increases logarithmically. Unless a sed guru can point out a better way, I’m going to continue using the first approach.

I’ve put the following script, which I call “sedml,” for sed multi-line, in my bash path.

#!/bin/sh
if [ "$#" -lt 2 ]
then
exit;
fi

# change the input file if no 3rd argument
if [ -z "$3" ]
then
        outputfile="$1"
else
        outputfile="$3"
fi
sed -n '
# if the first line copy the pattern to the hold buffer
1h
# if not the first line then append the pattern to the hold buffer
1!H
# if the last line then ...
$ {
        # copy from the hold to the pattern buffer
        g
        # do the search and replace
        '
"$2"'
        # print
        p
}
'
$1 > $1.tmp;
mv -f $1.tmp $outputfile;
 

So I can replace multi-line patterns in multiple files like so:

 grep -rl '<h2' * | while read i; do sedml $i "s/<h2.*</h2>/No title here/g" $i.tmp; done;

10 Comments

  1. Thank you for this. You are a philosopher and a poet.

    -David

  2. Hi, filosofo

    Maybe I misunderstood something, nut I created the following bash script:

    #!/bin/bash
    SRCH="a\nb"
    file="test.txt"
    sed -i.bak -n '
    # if the first line copy the pattern to the hold buffer
    1h
    # if not the first line then append the pattern to the hold buffer
    1!H
    # if the last line then ...
    $ {
    # copy from the hold to the pattern buffer
    g
    # do the search and replace
    '"$SRCH"'
    # print
    p
    }
    ' $file;

    …and it cannot find the pattern in test.txt:

    a
    b

    any tips?

  3. Hi, filosofo

    Maybe I misunderstood something, but I created the following bash script:

    #!/bin/bash
    SRCH="a\nb"
    file="test.txt"
    sed -i.bak -n '
    # if the first line copy the pattern to the hold buffer
    1h
    # if not the first line then append the pattern to the hold buffer
    1!H
    # if the last line then ...
    $ {
    # copy from the hold to the pattern buffer
    g
    # do the search and replace
    '"$SRCH"'
    # print
    p
    }
    ' $file;

    …and it cannot find the pattern in test.txt:

    a
    b

    any tips?

  4. >> # do the search and replace
    >> ‘”$SRCH”‘

    If you want to search and replace, do something like this :
    s/'”$SRCH”‘/xxx/

  5. hi all,

    I’ve the following issue.
    some virus injecting php and html files with some malicious code,
    it’s something like:

    <iframe ……..
    ………….
    ………

    till i could stop this virus, i want to automate some script to match and delete this patern or replace with some empty string,

    i tried your example but it seems to have something wrong,
    could you help me please with this issue ?

    thanks in advance.

  6. $ things i notice:
    1. (i’m not sure about this since i didn’t try your sc). it looks like there is a problem in the invocation line:
    grep -rl '<h2' * | while read i; do
    sedml $i "s/<h2.*/No title here/g" $i.tmp
    done

    - in the arg for the sed ‘s’ cmd, isn’t that the / in will be taken by sed as the ‘s’ delimiter? yes, it can be simply changed to another char. but what bothers me (maybe it’s just me since i barely know regex) is that the sc user must be a bit familiar with regex & moreover with sed ‘s’ cmd

    - also, the ‘s’ cmd regex will go across multiple since .* match longest pattern

    2. as u said: for each input file, the hold buf contains all the complete input, meaning more text more mem consumed, worse for other app, more danger for sed to be killed

    3. (i’m assuming you’re concern about efficiency since u gave 2nd try to the problem for it)

    - for each file found by ‘grep’ containing at least 1 of the searched html tag (), it brings up a subsh which in turn brings up sed, meaning a pair of subsh & sed if started & ended for each input file.

    - it execs grep in this sh while it also starts a subsh to perform ‘while read i…’. inside the while loop, the subsh will starts another subsh (due to #!/bin/sh), then the last created subsh starts sed, & repetition begins

    - yes, how process is treated after it exits (particularly whether the core image is still in the mem / not) is kernel-dependent. but 1 thing for sure is that single invocation of an app to deal with multiple input files will always be more efficient than a single invocation of it for each input file, even for *nix where its philosophy of small tools requires low rsc cost for process creation

    $ the idea basically:
    sed -n ‘
    /<h2/b proc
    p; b

    :proc
    n
    s//no title here/p
    t
    b proc
    ‘ INPUT_FILE…

    $ some minimum enhancements i can think of:
    1. for the searched html tag:
    – it’s variable
    – user doesn’t need to know about regex
    – the tag can be specific by specifying its attribute

    > ex:
    input:
    – only replace that particular h1, other h1 with diff title including ones that don’t have title isn’t affected. also that the input need not be full. the previous input can be like in the html code
    – since all paired html tag looks like: start mark <TAG… and end mark </TAG…, user need only input the start mark, this also eliminate forcing user to know about regex

    2. the replacement text can be any char including ones that are special in sed's eye: & \N, the user don't need to know about it

    3. for sed to be invoked only 1 once for multiple input files while retaining the ability to save the original file / editing in place & without manual tmp file, there is only 1 way: use -i[SUFFIX]

    $ here's the sc, my apology for its ugliness, i don't know much about *nix:

    #! /bin/bash

    usage(){
    echo "usage: $0 [i][b SUFFIX] TAG TEXT [FILES…]"
    exit 1
    }

    while getopts ":ib:" OPT; do
    case $OPT in
    i) OPT_I=-i
    unset SUFFIX
    ;;
    b) OPT_I=-i
    SUFFIX="$OPTARG"
    ;;
    *) usage
    ;;
    esac
    done

    shift $(($OPTIND-1))

    # only TAG & TEXT must be present in invocation line
    if(($#, so that no need to give all tag’s attrs
    if [ “${START:${#START}-1}” == “>” ]; then
    START=${START:0:${#START}-1}
    fi

    # is there syntax / tool to do this cleanly? for substr based on char
    END=${START%% *} #get only up to the 1st space
    END=${END:0:1}/${END:1}

    TEXT=$2
    # reformat & and \ to be edible for sed ‘s’ cmd
    TEXT=”${TEXT//\\/\\\\}”
    TEXT=”${TEXT/&/\&}”
    #TEXT=”$(echo “$TEXT” | sed -r ‘s/\\/\\\\/g;’)”

    shift 2

    # preserve *nix philosophy of ‘small tool’ by expecting input of files from stdin & defaultly sending output to stdout
    if(($#==0)); then
    while read F; do
    FILES=”${FILES} $F”
    done
    else
    FILES=$*
    fi

    sed -n ${OPT_I}$SUFFIX ‘
    /'”$START”‘/b proc
    p
    b

    :proc
    n
    s|.*'”$END”‘.*|'”$TEXT”‘|p
    t #sc goes to eof if matching closing tag is found
    b proc
    ‘ $FILES

    $ i’m sure there’re lots more than can be done to it without breaking unix philosophy, such as its ability to change a tag at a particular nested lv, but i won’t know since i’m not a web dev

  7. $ gee, i was testing my last posted sc, bugs came out:
    – it doesn’t ignore spacing within <> (fixed)
    – it also modifies more specific tags if given a more general one of that tag, ex: given %lt;h1> it also modifies <h1 …>

    $ ‘f’ opt is added, if ‘f’ used, it performs fixed match for the given tag: <h1> only matches <h1>), while if no ‘f’ <h1> matches <h1 …>

    $ it still ignores tag nesting (gonna work on this one)

    #! /bin/bash

    set -o xtrace

    usage(){
    echo "usage: $0 [i][b SUFFIX][f] TAG TEXT [FILES...]"
    exit 1
    }

    FIXED=0

    while getopts ":ib:f" OPT; do
    case $OPT in
    i) OPT_I=-i
    unset SUFFIX
    ;;
    b) OPT_I=-i
    SUFFIX="$OPTARG"
    ;;
    f)
    FIXED=1
    ;;
    *) usage
    ;;
    esac
    done

    shift $(($OPTIND-1))

    # only TAG & TEXT must be present in invocation line
    if(($# < 2)); then
    usage
    fi

    # there must be no leading / trailing space around the tag given in cmd-line
    START=$1

    if [ "${START:0:1}" == " < " ]; then
    START=${START:1}
    fi
    if [ "${START:${#START}-1}" == " > " ]; then
    START=${START:0:${#START}-1}
    fi

    # is there syntax / tool to do this cleanly? for substr based on char
    END=/${START%% *} #get only up to the 1st space

    TEXT=$2
    # reformat & and \ to be edible for sed 's' cmd
    TEXT="${TEXT//\\/\\\\}"
    TEXT="${TEXT/&/\&}"

    shift 2

    # preserve *nix philosophy of 'small tool' by expecting input of files from stdin & defaultly sending output to stdout
    if(($#==0)); then
    while read F; do
    FILES="${FILES} $F"
    done
    else
    FILES=$*
    fi

    if(($FIXED==0)); then
    START=$START'.*'
    else
    START=$START' *'
    fi

    echo "$START"

    sed -n ${OPT_I}$SUFFIX '
    /[ *'"$START"']/b proc
    p;b

    :proc
    n
    s||'"$TEXT"'|p
    t
    b proc
    ' $FILES

    $ btw, i was actually found this page when googled “multiline sed”. i was learning sed & got stuck with gnu ext of multi-line M for ‘s’ cmd / pattern matching. gnu doc only defines it, i need examples, anyone?

  8. $ sorry, the sed part didn’t come right

    sed -n ${OPT_I}$SUFFIX '
    /< *'"$START"'>/b proc
    p
    b

    :proc
    n
    s|< *'"$END"' *>|'"$TEXT"'|p
    t
    b proc
    ' $FILES

  9. When trying to replicate this example by running the command suggested:
    sed -n '1h;1!H;${;g;s/<h2.*/No title here/g;p;}' sample.php

    I get the following error:
    sed: -e expression #1, char 26: unknown option to `s'

  10. Never mind. I figured it out. Although I can’t show the final code because I can’t figure out how to escape all the chars. Basically, you have to escape the backslash on the closing h2 tag.

Post a Comment

Your email is never shared. Required fields are marked *

*
*

3 Trackbacks