//
you're reading...
Uncategorized

Using Unix grep to solve Crossword puzzles

Grep

Grep is a useful UNIX tool that is used for finding pattern matches in files. It has many interesting uses, but the purpose for which we are going to use it now is finding words that match a particular pattern.

 

For example, finding a 10-letter crossword puzzle word matching the pattern below in a dictionary file containing hundreds of thousands of properly spelled English words:

s _ _ o _ l y _ _ d

The command that we want to enter first is one that takes us to a directory where there is a file containing 234,936 English words, one word per line.

 

 cd   /usr/share/dict

The “web2” File of English Words

All UNIX systems have a file, usually named “words” that resides in the UNIX file system. This file contains several hundred thousand recognized words in the target language, which we here assume to be English. In OS X, as well as virtually all versions of UNIX, this file resides at:
 /usr/share/dict/words –> web2
This is in fact exactly where we went with the terminal and the “cd” command above. In Mac OS X the “words” file is actually a link to a file called “web2” so that if you reference “words”, you will actually find “web2”. This is done for several reasons, including convenience in supporting multiple languages. The “words” file links to the appropriate file for whatever language is set up on the computer, which for English-based BSD systems is “web2”.The “README” file in the same directory has some interesting insights into the origin of the “web2” and “web2a files, including the fact that web2 is a complete list of all 234,936 words included in Webster’s Second International dictionary, which was copyrighted in 1934 and whose copyright (according to Webster’s) has now expired. The web2a file contains additional hyphenated words.

 

Using grep

Grep works when you type “grep” on the command line, followed by a space (press the space bar) and the pattern you are trying to match, followed by a space and the path to and name of a file that you are going to search. The blank spaces are the standard UNIX way to inicate separate command line elements. You active the command by pressing the “return” or “enter” key.

grep Commands

Again, we are using only the simplest capabilities of grep, so we need only the word “grep” followed by one or more spaces, followed by a pattern to match, followed by the path to a file (our “web2” file) that we are going to search.

grep pattern indicators include:

. (a period)
This means “Match any character”.
^ (a carat or “shift 6”)
This means the next character in the pattern must be at the beginning of a line.
$ (dollar sign or “shift 4)
This means the previous character in the pattern must be at the end of a line.

Other than that, we just enter the explicit letters that we are looking for. For example the following search:

 grep ^.ash$ web2
would match all 4-letter words ending in ash, and produce the following list:
bash cash dash fash gash hash lash mash nash pash rash sash tash wash

We are using the beginning and end of line markers for grep rather than the beginning of word “\<“ and end of word “\>” markers because the words are arranged one per line, and so strictly speaking there are no word ends or beginnings in the file.

Let’s Do It!

So now let’s look at the problem originally posed at the beginning of this page. We will try to match:

s _ _ o _ l y _ _ d

The required pattern for this is:

  ^s..o.ly..d$

Indicating that it is 10 letters long, that there must be an “s” in the first position, an “o” in the fourth position, an “l” in the sixth position, a “y” in the seventh position, and a “d” in the tenth position, which must be the last letter of the 10-letter word.

After this pattern is one or more spaces, followed by the description of the file to be searched, which is “web2”, the file containing the 234,936 English words.

Our command line is thus:

  grep ^s..o.ly..d$ web2

 

 

Assignment

To receive my assessment of how well you understood this tutorial, search for the following pattern using grep in web2:

     _ a _ _ _ f _ c _ n _ _ _   (13 letters)
Advertisement

Discussion

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: