sed and regular expression

Introduction

SED is a stream editing command in Linux, it is very powerful in text processing under Linux environment.

SED can be used in many different ways, such as:

Text substitution,
Selective printing of text files,
In-a-place editing of text files,
Non-interactive editing of text files, and many more.

SED follows a simple workflow: Read -> Execute -> Display

Read: SED reads a line from the input stream (file, pipe, or stdin) and stores it in its internal buffer called pattern buffer.
Execute: All SED commands are applied sequentially on the pattern buffer. By default, SED commands are applied on all lines (globally) unless line addressing is specified.
Display: Send the (modified) contents to the output stream. After sending the data, the pattern buffer will be empty.

Points to Note

Pattern buffer is a private, in-memory, volatile storage area used by the SED.
By default, all SED commands are applied on the pattern buffer, hence the input file remains unchanged. GNU SED provides a way to modify the input file in-a-place. We will explore about it in later sections.
There is another memory area called hold buffer which is also private, in- memory, volatile storage area. Data can be stored in a hold buffer for later retrieval. At the end of each cycle, SED removes the contents of the pattern buffer but the contents of the hold buffer remains persistent between SED cycles. However SED commands cannot be directly executed on hold buffer, hence SED allows data movement between the hold buffer and the pattern buffer.
Initially both pattern and hold buffers are empty.
If no input files are provided, then SED accepts input from the standard input stream (stdin).
If address range is not provided by default, then SED operates on each line.

Common Options

Below list are options are GNU specific; and may not be supported by other variants of the SED.

-n, –silent: Suppress the default printing of the patter buffer.
-e script, script is an editing command. By using this option, we can specify multiple commands.
-f script_file, script_file is a file containing editing commands.
–follow-symlinks: If this option is provided, the SED follows symbolic linki while editing files in place.
-i[SUFFIX], –in-place[=SUFFIX]: This option is used to edit file in place. If suffix is provided, it takes a backup of the original file, otherwise it overwrites the original file. Never use this option if you are not sure
what sed will do since once the original file was overwriten there is no way to revert it back. Instead, use redirect operator > or combined use with -i.bak.
-l N, –line-lenght=N: This option sets the line length for l command to N characters.
–posix: This option disables all GNU extensions.
-r, –regexp-extended: This option allows to use extended regular expressions rather than basic regular expressions.
-s, –separate: deal with multiple files
-u, –unbuffered: When this option is provided, the SED loads minimal amount of data from the input files and flushes the output buffers more often. It is useful for editing the output of “tail -f” when you do not want to wait for the output.
-z, –null-data: By default, the SED separates each line by a new-line character. If NULL-data option is provided, it separates the lines by NULL characters.

Pattern Flags

g Global
I Ignore case

Addressing

Addressing is used to restrict SED to operate only on certain lines.
There multiply ways to specify an address:

Exact

Specifying an exact address number n

$ sed -n '3s/pattern/abc/p' file

$ sed -n '$ p' file
Range

M, N, Specify a range starting from line M end with N

$ sed -n '3,$ p' file

M, /pattern/, or /pattern/, N, Specify a range using pattern range

$ sed -n '^, /pattern/p' file

$ sed -n '/pattern/,6p' file

$ sed -n '/patter/, +4 p' file

Base + Offset

M, +n, Specify a range starting from line M and end in next +N lines

$ sed -n '2,+4 p' file
Base + Step

M~N, Specify a range starting at line number M and process every Nth line

$ sed -n '1~2 p' This will print only odd lines from the file

$ sed -n '2~2 p' This will print only even lines from the file

Commands

p Print
= Print line number

e.g. $ sed = filename will print line number follow with line content (seperated by ‘\n’)

Count the total number of lines of a file

$ sed -n '$ =' file

d Delete
w Write

If we want to backup (copy) or extract certain lines from a file, we can use this command.

$ sed -n 'w redis.conf.bak' redis.conf This has exact the same effect with using $ cp redis.conf redis.conf.bak

$ sed -n '2~2 w emphasis.txt' book.txt This will extract only even lines from book.txt
a Append

$ sed '2 a =====' book.txt
i Insert

$ sed '2 i =====' book.txt
c Change

$ sed '2 c =====' book.txt
y Translate

Transforms the characters by position

The Syntax of the translate command y

[address1[,address2]]y/list-1/list-2/

Note that translation is based on the position of the character from list 1 to the character in the same position in list 2 and both lists must be explicit character lists. Regular expressions and character classes are unsupported. Additionally, the size of list 1 and list 2 must be same.

1 2	$ echo "1 5 15 20" \| sed 'y/151520/IVXVXX/' I V IV XX

l List

l command is used to display hidden characters in text.

Suppose we have a file containing a lot of \t

$ sed 's/ /\t/g' books.txt > junk.txt

We can display the hidden characters by using l command like this:

$ sed -n 'l' junk.txt

l command has a very useful feature that it can be used to perform line wrap after a certain number of characters.

The following example wraps lines after 25 characters:

$ sed -n 'l 25' books.txt

A wrap limit of 0 means never break the line unless there is a new line character. 

`$ sed -n 'l 0' books.txt`

q Quit

By default, SED follows read, execute, and repeat workflow; but when the quit command is encountered, it simply stops the current execution.

q command does not accept range of addresses, it only supports a single address

q command can accept a value which can be used as the exit status

$ sed '3 q' books.txt This will the first 3 lines from the file

$ sed '/pattern/ q 100' books.txt This command will exit with status code 100

r Read

Read from a file

$ sed '3 r junk.txt' books.txt
e Execute

We can execute external commands from SED using the e command.

Syntax:

[address1[,address2]]e [command]

$ sed '3 e date' books.txt
abc
Fri Nov 25 21:00:00 CST 2016
123

If we have a file containing several shell commands like this:

date 
cal 
uname

We can use sed with e command to execute them like this:

$ sed 'e' commands.txt

Miscellaneous Commands

n

Clear the current pattern space and read the next line in.

Execution order:

Sed command1 
Sed command2 
Sed command3
n command 
Sed command4
Sed command4

In this case, SED applies the first three commands on the pattern buffer, clears the pattern buffer, fetches the next line into the pattern buffer,
and thereafter applies the fourth and fifth commands on it. This is a very important concept.

N

Keep current pattern space and append the next line in.
Explanation*

By default, SED operates on single line, however it can operate on multiple lines as well. Multi-line commands are denoted by uppercase letters.
For example, unlike the n command, the N command does not clear and print the pattern space.
Instead, it adds a newline \n at the end of the current pattern space and appends the next line from
the input-file to the current pattern space and continues with the SED’s standard flow by executing the rest of the SED commands.

Suppose we have sample.txt whose content is:

apple
17.7
egg
29.0
skipboard
99

We can convert it to the form of this:

apple, 17.7
egg, 29.0
skipboard, 99

By executing:

$ sed 'N;s/\n/, /g' sample.txt

Explanation

Let’s see how it works.

Initially, the command N reads the first line, apple into the pattern buffer and appends \n followed by the next line. Therefore the pattern space now contains apple\n17.7. In the next step, we use s/\n/, / to replace the \n with a comma ,.

Print all lines in a multi-line patter space until it reaches a newline character \n created by N command.

Let’s see how we can print only the odd lines of sample.txt

$ sed -n 'N;P' books.txt
apple
egg
skipboard

x

Exchange the content of pattern buffer and hold buffer

Let’s see how we use x to print only the even line of the file

$ sed -n 'x;n;p' sample.txt
17.7
29.0
99

Let us understand how this command works.

Initially, SED reads the first line, i.e., apple into the pattern buffer.

x command moves this line to the hold buffer.
n fetches the next line 17.7 into the pattern buffer.
The control passes to the command followed by n which prints the contents of the pattern buffer.
The process repeats until the file is exhausted.

h

h deals with hold buffer, it copies data from the pattern buffer to overwrite everything in the hold buffer.
H

H Append a newline \n to the contents of the hold space, and then append the contents of the pattern space to that of the hold space

Sed applications

1. Print

(1) Print whole file, act as cat command

$ sed '' a.txt

(2) Print file with line number

$ sed = me.txt | sed 'N;s/\n/ /'

(3) Print a specific line or a range of line

$ sed '4!d' me.txt

$ sed '2,4!d' me.txt

(4) Print the matching line with line number

$ sed -n '/foo/='

(5) Print only the matching line

$ sed -n '/pattern/p' me.txt
$ sed -n '/pattern/{!d;p}' me.txt
$ sed -n 's/pattern/p' me.txt

(6) Match ignore case

$ sed '/pattern/I p' me.txt

(6) Print the matching line preceeded with its line number

$ sed '/pattern/!d;=' me.txt |sed 'N;s/\n/:/'

(7) Print only the matched string

Group symbol () of regular expression is very useful to assist operations on a matched string.

By default, sed will print the whole line, if we want to print only the matched string, we can use group symbol ().

$ sed -nE 's/.* ([0-9.]+)$/\1/p' <<< "the price of egg is 12.7"

$ sed -E '1!d;s/.* ([0-9]+)$/\1/g' <<< "apple good 18901"

$ sed -E 's/^.* \([^ ][^ ]*\)/\1/g' <<< 'Apple is a very nutrient-rich friut'

$ sed -E 's/.* ([0-9]+)$/\1/;tx;d;:x'　<<< "apple good 18901"

Also, if we want to modify the matched string, () is very helpful.

e.g. we have a file ‘me.txt’ like this:

1.A_ A_ 1.A_                        1.A A_ 1.A_
2.B_ B_ 2.B_    Modified to -->     2.B B_ 2.B_     
3.C_ C_ 3.C_                        3.C C_ 3.C_

we can do this:

$ sed sed -nE 's/^([0-9]\S+)_( .*)/\1\2/;p' me.txt

(8) Print the last field

$ sed -E 's/^.* \([^ ][^ ]*\)/\1/g' <<< 'Apple is a very nutrient-rich friut'
$ sed -E 's/.* ([^ ]+)$/\1/g' <<< 'Apple is a very nutrient-rich friut'
$ sed -E 's/^.* (\S*)/\1/' <<< "this is a good day"

2. Remove

(1) Remove empty lines

$ sed '/^$/d'

(2) Remove leading whitespace of each line

$ sed -E 's/^\s+//g' <<< "     Nice tutorial!"

$ sed -n 's/^ *//g' << "    hahaha"

(3) Remove trailing whitespace of each line

$ sed -E 's/\s*$//g' <<< "Nice tutorial!    "

(4) Remove the next blank line of matching line

$ sed '/pattern/n;/^$/d' me.txt

(5) Remove the preceeding / of a path

$ sed 's/^\///g'

(6) Remove all the <> tags of a html file

$ sed 's/<[^>]*>//g' index.html

(7) Remove consecutive whitespace and replace with single whitespace

$ sed -E 's/\s+/ /g' <<< "hello    goodbye"

3. Edit

& represents the matched string, we can use it repeatedly

(1) Copy the matched string

$ sed 's/[0-9]*/& &/' <<< "123 abc"
123 123 abc

(2) Prepend a line with some string

$ sed 's/.*/HaHa: &/' filename

or simpler
$ sed 's/^/HaHa:‘ filename

(3) Append a line with something

$ sed 's/$/HoHo' filename

(4) Uppercase, Lowercase

GNU extention provides \l, \L, \u, \U to extend the ability of sed

\U char-list : Uppercase all char in char-list
\u char-list : Uppercase only the first char in char-list
\L char-list : Lowercase all char in char-list
\l char-list : Lowercase only the first char in char-list

Capitalize the first character of a sentence

$ sed -r 's/.*/\u&/' <<< "nick meets kim in this morning"
Nick meets kim in this morning

Capitalize each word of a sentence

$ sed -r 's/\w+/\u&/g' <<< "nick meets kim in this morning"
Nick Meets Kim In This Morning

We can achieve the above result using tr or awk

$ tr '[:upper:]' '[:lower:]' < input.txt > output.txt

$ awk '{print tolower($0)}' <<< "UPPER"

(5) Ignore case substitution

$ sed 's/He/xx/gI' <<< "hello world"
xxllo world

(6) Field processing

It is common that the file SED is processing is a delimiter separated file.
In that case, we often need to use group symbols of regular expression to assist processing.

Assume there are three fields in a line separated by whitespace, we can use '(.*) (.*) (.*)' to match all of them,
and then use \1, \2, \3 to refer to each of them.

$ sed -r 's/(.*) (.*) (.*)/Pre1\1 Pre3\2 Pre2\3/' <<< "One Two Three"

More generally, if fields are separated by whitespace, we can use \w+ to match all fields, and use & to refer to each of them.

$ sed -r 's/\w+/\u&/g' <<< "good bye"

We can also use a number to refer to a specific field

Assume we have a friute.csv file like this:

name     origin    price
apple     USA      18.5
orange    Vetran    26.8
cherry    Mexico    50.1

We want to modify the cherry’s price to 60, we can do this:

$ sed '4s/\S\+/60.6/3' friute.csv

Another, if we want to capitalize the first column, we can do:

$ sed -r 's/\S+/\u&/1' friute.csv

Remove the last field

$ sed 's/\w*$//' <<< 'apple 123 beast'
$ sed -r 's/^.* ([^ ]*)/\1/' <<< "hello world haha"

Operations on multiple files

Use -s option

$ sed -s -i 's/me/he/g' *.txt

Reference
http://www.grymoire.com/Unix/Sed.html
http://www.pement.org/sed/sed1line.txt
http://www.theunixschool.com/2013/02/sed-examples-replace-delete-print-lines-csv-files.html
http://coolshell.cn/articles/9104.html