Match all lines that aren’t in a list

I wanted to take the output of a grep search and use that to build another search, such that any line of the input file for the second search that matched any of the patterns built from the first search would be excluded from the resulting output. I’ve wanted this kind of thing before but I think I have always resorted to several single-purpose searches.

Could I have accomplished the same thing with a single cunningly-devised regular expression? I suspect so — I need to get better at using regular expressions. Anyway, here’s the script (I used the bash shell version 3.2.33 on Cygwin…)


#!/bin/sh
# Program: find-not-in-list.sh
# Author: Daniel Meyer
# Date: 3/11/2008
# Purpose: Given an input file name and one or more awk-style patterns to match,
# this program outputs all lines that do NOT match any of the specified patterns.
#
# Example 0:
# Suppose you have a file inputfile.txt, and you want to output the lines in this file minus the ones that
# start with 422 AND contain a couple of values (2838 and 01523). You could do this using a series of
# “grep -v” (invert match) statements:
#
# < inputfile.txt grep -v "^422.*2838" | grep -v "^422.*01523" # # That's fine if the number of patterns to match is small and/or the input file size is small. # But it's a lot of hand typing if you need to invert-match many patterns... and if the input file is large, # reading the file many times could take a while. # # Example 1: # This program solves the second problem: By hand-typing patterns for lines you *don't* want from the inputfile, # you can # # echo /^422.*2838/ /^422.*01523/ | xargs find-not-in-list.sh inputfile.txt > shavedfile.txt
#
# Example 2:
# This example solves the first problem. Instead of typing in each pattern by hand as in example 1, the patterns
# could be built from the output of a search. For example, in a tab-delimited file if you want to take the second
# field of each line that starts with 143 and contains the string INTERALIA, and then show the input file excluding
# lines that start with 422 and contain the contents of that column, you could execute the following command:
#
# for x in `grep “^143.*INTERALIA” < inputfile.txt | cut --fields=2`; do echo /^422.*$x/; done | xargs find-not-in-list.sh inputfile.txt # # Notice how we build our pattern by adding on to what was in the second column in the file using an echo command # to standard output and then piping that to xargs. (The xargs command passes the contents of its standard input # as parameters, so in this example if the grep command found 2838 and 01523 in inputfile.txt, find-not-in-list.sh # would be called with three parameters, as follows: # # find-not-in-list.sh inputfile.txt /^422.*2838/ /^422.*01523/ if [ $# -lt 2 ]; then echo "Usage: sh find-not-in-list.sh inputfile pattern [pattern...]" echo " Each pattern should be an AWK-style regular expression beginning and ending with '/'" exit 1 fi inputfile=$1 shifttempfile=`mktemp` # Build an awk command for not matching the first pattern echo -n "\$0 !~ $1" >> $tempfile

# Add in all the rest of the patterns to not match
while [ $# -gt 1 ]; do
shift
echo -n ” && \$0 !~ $1″ >> $tempfile
done

# Run the awk command
awk -f $tempfile < $inputfile rm $tempfile exit 0 [/sourcecode] find-not-in-list.sh.txt

Advertisements

  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s