you're reading...

Unix lab: Generate all proper names using curl, cat, sed, awk, and sort

Here is a working definition: a proper name is a string representing a individual person, place, or organization, spelled with initial capital letters.

Of course there are many lists of names on the web and even a website devoted to generating fake names .  The government makes available a website at census.gov that lists names associated with collected census data, but they need to be edited to be proper.

Building a list of proper first names from census.gov website using unix tools: curl, awk, and sed

curl is a unix tool to transfer data from or to a server, using one of the supported protocols. To download a list of first names from the census.gov website we use curl as follows with redirection:

$ curl http://www2.census.gov/topics/genealogy/1990surnames/dist.all.last  > last.txt
$ curl http://www2.census.gov/topics/genealogy/1990surnames/dist.female.first > female.txt
$ curl http://www2.census.gov/topics/genealogy/1990surnames/dist.male.first > male.txt

Each of the three files, (dist.all.last), (dist. male.first), and (dist female.first) contain four items of data. The four items are: A “Name”, a Frequency in percent, a Cumulative Frequency in percent, and overall Rank.

In the file (dist.all.last) one entry appears as:

MOORE 0.312 5.312 9

This means in the area sample, MOORE ranks 9th in terms of frequency. 5.312 percent of the sample population is covered by MOORE and the 8 names occurring more frequently than MOORE. The surname, MOORE, is possessed by 0.312 percent of our population sample.

Clean up the list using tools awk and sed from the command line.

  1.  Concatenate first name files:
    $ cat female.txt male.txt > names1.txt
  2. We can remove non-alphabetic characters with the command:
    $ sed 's/[^a-zA-Z]//g'  < names1.txt > names2.txt
  3. Remove all empty lines
    $ sed '/^$/d' < names2.txt > names3.txt
  4. Converts all upper to lower case
     $ sed y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/  < names3.txt > names4.txt
  5. Replace the first letter with upper case
     $ awk '{for(i=0;++i<=NF;){OFS=(i==NF)?RS:FS;printf toupper(substr($i,0,1)) substr($i,2) OFS }}' <names4.txt  > names5.txt
  6. Sort and removes duplicates: 
    $ sort -u names5.txt > names6.txt
  7. List the tail end of the names list:
     $ tail  names6.txt

Programming assignment:
Create an executable unix shell script called allnames that combines all the functions listed above and returns the long list of names. Test your script by using a pipe to the grep command to see if your name is there.




No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: