12. Two file processing
12. Two file processing 관련
This chapter focuses on solving problems which depend upon the contents of two or more files. These are usually based on comparing records and fields. Sometimes, record number plays a role too. You'll also learn about the getline
built-in function.
Info
The example_files directory has all the files used in the examples.
Comparing records
Consider the following input files which will be compared line wise to get the common and unique lines.
cat colors_1.txt
# teal
# light blue
# green
# yellow
cat colors_2.txt
# light blue
# black
# dark green
# yellow
The key features used in the solution below:
- For two files as input,
NR==FNR
will betrue
only when the first file is being processed next
will skip rest of the script and fetch the next recorda[$0]
by itself is a valid statement. It will create an uninitialized element in arraya
with$0
as the key (assuming the key doesn't exist yet)$0
in a checks if the given string ($0
here) exists as a key in the arraya
common lines
same as:
grep -Fxf colors_1.txt colors_2.txt
awk 'NR==FNR{a[$0]; next} $0 in a' colors_1.txt colors_2.txt
# light blue
# yellow
lines from colors_2.txt
not present in colors_1.txt
same as:
grep -vFxf colors_1.txt colors_2.txt
awk 'NR==FNR{a[$0]; next} !($0 in a)' colors_1.txt colors_2.txt
# black
# dark green
reversing the order of input files gives lines from colors_1.txt
not present in colors_2.txt
awk 'NR==FNR{a[$0]; next} !($0 in a)' colors_2.txt colors_1.txt
# teal
# green
Warning
Note that the NR==FNR
logic will fail if the first file is empty, since NR
wouldn't get a chance to increment. You can set a flag after the first file has been processed to avoid this issue. See this unix.stackexchange thread for more workarounds.
no output
awk 'NR==FNR{a[$0]; next} !($0 in a)' /dev/null greeting.txt
gives the expected output
awk '!f{a[$0]; next} !($0 in a)' /dev/null f=1 greeting.txt
# Hi there
# Have a nice day
# Good bye
Comparing fields
In the previous section, you saw how to compare the contents of whole records between two files. This section will focus on comparing only specific fields. The below sample file will be one of the two file inputs for examples in this section.
cat marks.txt
# Dept Name Marks
# ECE Raj 53
# ECE Joel 72
# EEE Moi 68
# CSE Surya 81
# EEE Tia 59
# ECE Om 92
# CSE Amy 67
To start with, here's a single field comparison. The problem statement is to fetch all records from marks.txt
if the first field matches any of the departments listed in the dept.txt
file.
cat dept.txt
# CSE
# ECE
note that dept.txt
is used to build the array keys first
awk 'NR==FNR{a[$1]; next} $1 in a' dept.txt marks.txt
# ECE Raj 53
# ECE Joel 72
# CSE Surya 81
# ECE Om 92
# CSE Amy 67
if the header is needed as well
awk 'NR==FNR{a[$1]; next} FNR==1 || $1 in a' dept.txt marks.txt
# Dept Name Marks
# ECE Raj 53
# ECE Joel 72
# CSE Surya 81
# ECE Om 92
# CSE Amy 67
For multiple field comparison, you need to construct the key robustly. Simply concatenating field values can lead to false matches. For example, field values abc
and 123
will wrongly match ab
and c123
. To avoid this, you may introduce some string between the field values, say "_"
(if you know the field themselves cannot have this character) or FS
(safer option). You could also allow awk to bail you out. If you use the ,
symbol (not ","
as a string) between the field values, the value of the special variable SUBSEP
is inserted. SUBSEP
has a default value of the non-printing character \034
which is usually not used as part of text files.
cat dept_name.txt
# EEE Moi
# CSE Amy
# ECE Raj
uses SUBSEP
as a separator between the field values to construct the ke note the use of parentheses for key testing
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' dept_name.txt marks.txt
ECE Raj 53
EEE Moi 68
CSE Amy 67
In this example, one of the field is used for numerical comparison.
cat dept_mark.txt
ECE 70
EEE 65
CSE 80
match Dept and minimum marks specified in dept_mark.txt
awk 'NR==FNR{d[$1]=$2; next}
$1 in d && $3 >= d[$1]' dept_mark.txt marks.txt
# ECE Joel 72
# EEE Moi 68
# CSE Surya 81
# ECE Om 92
Here's an example of adding a new field.
cat role.txt
# Raj class_rep
# Amy sports_rep
# Tia placement_rep
awk -v OFS='\t' 'NR==FNR{r[$1]=$2; next}
{$(NF+1) = FNR==1 ? "Role" : r[$2]} 1' role.txt marks.txt
# Dept Name Marks Role
# ECE Raj 53 class_rep
# ECE Joel 72
# EEE Moi 68
# CSE Surya 81
# EEE Tia 59 placement_rep
# ECE Om 92
# CSE Amy 67 sports_rep
getline
As the name indicates, the getline
function allows you to read a line from a file on demand. This is easiest to use when you need something based on line numbers. The following example shows how you can replace the m
th line from a file with the n
th line from another file. There are many syntax variations with getline
, here the line read is saved in a variable.
return value handling is not shown here, but should be done ideally
awk -v m=3 -v n=2 'BEGIN{while(n-- > 0) getline s < "greeting.txt"}
FNR==m{$0=s} 1' table.txt
# brown bread mat hair 42
# blue cake mug shirt -7
# Have a nice day
Here's an example where two files are processed simultaneously. In this case, the return value of getline
is also used. It will be 1
if the line was read successfully, 0
if there's no more input to be read as end of file has already been reached and -1
if something went wrong. The ERRNO
special variable will have the error details.
print line from greeting.txt
if the last column of the corresponding line from table.txt
is a positive number
awk -v file='table.txt' '(getline line < file)==1{n=split(line, a);
if(a[n]>0) print}' greeting.txt
# Hi there
# Good bye
If a file is passed as an argument to the awk
command that cannot be opened, you get an error. For example:
awk '{print $2}' xyz.txt
# awk: fatal: cannot open file 'xyz.txt' for reading: No such file or directory
It is recommended to always check for the return value when using getline
or perhaps use techniques from the previous sections to avoid getline
altogether.
xyz.txt
doesn't exist, but output doesn't show something went wrongawk '{getline line < "xyz.txt"; print $NF, line}' table.txt
# 42
# -7
# 3.14
awk -v file='xyz.txt' '{ e=(getline line < file);
if(e<0){print file ": " ERRNO; exit}
print $NF, line }' table.txt
# xyz.txt: No such file or directory
Info
See gawk manual: getline for details, especially about corner cases and errors. See also awk.freeshell: getline caveats.
Summary
This chapter discussed a few cases where you need to compare contents between two files. The NR==FNR
trick is handy for such cases. You also saw a few examples with the getline
function.
Next chapter will discuss how to handle duplicate contents.
Exercises
Info
The exercises directory has all the files used in this section.
Exercise 1
Use the contents of match_words.txt
file to display matching lines from jumbled.txt
and sample.txt
. The matching criteria is that the second word of lines from these files should match the third word of lines from match_words.txt
.
cat match_words.txt
# %whole(Hello)--{doubt}==ado==
# just,\joint*,concession<=nice
'concession' is one of the third words from 'match_words.txt
' and second word from 'jumbled.txt
'
awk ##### add your solution here
# wavering:concession/woof\retailer
# No doubt you like it too
'concession' is one of the third words from 'match_words.txt
' and second word from 'jumbled.txt
'
awk -v FPAT='\\w+' 'NR==FNR{a[$3]; next} $2 in a' match_words.txt jumbled.txt sample.txt
# wavering:concession/woof\retailer
# No doubt you like it too
Exercise 2
Interleave the contents of secrets.txt
with the contents of a file passed via the -v
option as shown below.
awk -v f='table.txt' ##### add your solution here
# stag area row tick
# brown bread mat hair 42
# ---
# deaf chi rate tall glad
# blue cake mug shirt -7
# ---
# Bi tac toe - 42
# yellow banana window shoes 3.14
# ---
awk -v f='table.txt' '{print; getline < f; print; print "---"}' secrets.txt
# stag area row tick
# brown bread mat hair 42
# ---
# deaf chi rate tall glad
# blue cake mug shirt -7
# ---
# Bi tac toe - 42
# yellow banana window shoes 3.14
# ---
Exercise 3
The file search_terms.txt
contains one search string per line, and these terms have no regexp metacharacters. Construct an awk
command that reads this file and displays the search terms (matched case insensitively) that were found in every file passed as the arguments after search_terms.txt
. Note that these terms should be matched anywhere in the line (so, don't use word boundaries).
cat search_terms.txt
# hello
# row
# you
# is
# at
awk ##### add your solution here
##file list## search_terms.txt jumbled.txt mixed_fs.txt secrets.txt table.txt
# at
# row
awk ##### add your solution here
##file list## search_terms.txt addr.txt sample.txt
# is
# you
# hello
awk -v IGNORECASE=1 'NR==FNR{s[$0]; next} {for(k in s) if($0 ~ k) a[k]}
ENDFILE{for(k in a) s[k]++; delete a}
END{for(k in s) if(s[k]==(ARGC-2)) print k}
' search_terms.txt jumbled.txt mixed_fs.txt secrets.txt table.txt
# at
# row
awk -v IGNORECASE=1 'NR==FNR{s[$0]; next} {for(k in s) if($0 ~ k) a[k]}
ENDFILE{for(k in a) s[k]++; delete a}
END{for(k in s) if(s[k]==(ARGC-2)) print k}
' search_terms.txt addr.txt sample.txt
# is
# you
# hello
Exercise 4
Display lines from scores.csv
by matching the first field based on a list of names from the names.txt
file. Also, change the output field separator to a space character.
cat names.txt
# Lin
# Cy
# Ith
awk ##### add your solution here
# Lin 78 83 80
# Cy 97 98 95
# Ith 100 100 100
awk -F, 'NR==FNR{a[$1]; next} $1 in a{$1=$1; print}' names.txt scores.csv
# Lin 78 83 80
# Cy 97 98 95
# Ith 100 100 100
Exercise 5
What's the default value of the special variable SUBSEP
? Where is it commonly used?
Exercise 6
The result.csv
file has three columns — name, subject and mark. The criteria.txt
file has two columns — name and subject. Match lines from result.csv
based on the two columns from criteria.txt
provided the mark column is greater than 80.
cat result.csv
# Amy,maths,89
# Amy,physics,75
# Joe,maths,79
# John,chemistry,77
# John,physics,91
# Moe,maths,81
# Ravi,physics,84
# Ravi,chemistry,70
# Yui,maths,92
cat criteria.txt
# Amy maths
# John chemistry
# John physics
# Ravi chemistry
# Yui maths
awk ##### add your solution here
# Amy,maths,89
# John,physics,91
# Yui,maths,92
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a && $3 > 80' criteria.txt FS=, result.csv
# Amy,maths,89
# John,physics,91
# Yui,maths,92