Subversion

CVS is out, Subversion is in – by Chip Turner

Introduction

In case no one happened to tell, you, CVS is dead. Bereft of life, it
rests in peace. Oh, sure, people still use it, and it is still included in
most Linux distributions, including Fedora™ Core, but it is quite dead. It
died after a long, drawn-out sickness after years of neglect. Sadly, it
died of the incurable disease ‘broken architecture.’ Nothing could be
done besides making its final days (well, years) as comfortable as
possible. But now, finally, gone it is, and its replacement is a much
younger, much healthier, much better architected, and much more capable
version control system—Subversion.

In a world where you can buy hundreds of gigabytes of storage for less
than a hundred dollars, is it really necessary to have a complex version
control system at all? After all, you can just make copies of the file
you’re changing and use the diff command to look at old versions, right?
Well, you can, but hopefully by the end of this article you will not only
see the use of version control in general, but why you absolutely,
positively must be using Subversion to manage all of your own files.

Although the name might imply otherwise, Subversion is a version control
system that will feel fairly comfortable to anyone with CVS experience.
It is not a drastic change to a whole new paradigm of version control, nor
is it an avant garde tool that revolutionizes command line version
control. No, although it is neither of those, Subversion is most
definitely an important version control tool, and, unless you need some of
the more specialized features of other modern version control software, it
is the one you should reach for by default. Again, CVS is dead.

Billed as a better CVS, Subversion is aimed at centralized, client-server
version control much like CVS, Perforce, and Visual SourceSafe. It began
with the intentions of meeting feature parity with most of CVS (most,
meaning the few areas where it diverges, it diverges for good reasons)
while having a cleaner and more extensible codebase to act as a launching
pad for more innovative features in later versions. As we shall see, the
Subversion team more than delivered on this promise.

Concepts

Like CVS, Subversion has a concept of a single, central repository (often
residing on a dedicated server) that stores all data about the projects
you are working on. This data is called a repository, and it is best
thought of as the ultimate source of truth and history for your work. It
knows about every change you have ever committed and can instantly take
you back and forth in time to inspect those changes and build further upon
them. You never work in the repository directly, though. Instead, you
pull subsets of it into working copies that typically reside on other
systems such as your desktop computer. In these working copies, you make
your changes, and when you are pleased with them, you commit those changes
into the central repository where they become once and forever part of
history.

Each commit (also called a check-in) to the repository is called a revision,
and, in Subversion, revisions are numbered. A commit can be a change to
one file or a dozen, to directories, or to metadata (which we’ll
discuss shortly). The first change you make to your repository is
revision 1; predictably, the second is revision 2, and so forth. In
addition, we speak of HEAD when we mean the latest version of the
repository; so, when you check in revision 17, then HEAD is revision 17,
but when you check in revision 18, then HEAD is revision 18. Whether you
change one file or a hundred files, if the changes you make are part of a
single commit, then they become a single revision. In addition, suppose
you are in the middle of a commit and your network switch catches on fire,
your desktop is struck by lightning, or you hit Ctrl-C. In
Subversion, a commit is an atomic operation, meaning it either succeeds
entirely or fails entirely; unlike CVS, you can’t end up with half of your
files saved to the repository but the other half unchanged.

You can also undo a change you’ve made, either manually (say, deleting a
line you mistakenly added to a file) or by asking Subversion something
akin to ‘take the change association with revision 13117, reverse it, and
apply it to my working copy.’ When you commit that change, however, the
revision number does not go down; to Subversion, it is just another change
(even if it undid a previous one), and so the revision number is a simple
increment. So time marches forever, signified by revision numbers,
always forward, never backwards. In a way, you can think of the revision
numbers like important events on a timeline; while it may be a week
between revision 7 and 8, or revisions 100 through 150 may take place in a
single minute, you are guaranteed revision 8 came after revision 7 and no
change occurred in between. In fact, if you want to undo a change and
absolutely must remove it from the repository (say, you accidentally
committed a plain text file with a password in it—bad, bad, bad!),
you must go to great lengths to banish such a file from the repository.
So such a thing is possible, but difficult. (Just asking Subversion to
remove it from the latest change in the repository isn’t
enough—Subversion, after all, lets you time travel, and it is
relatively easy to ask for yesterday’s copy of the file, even if it has
been deleted today).

Not only does Subversion offer version control of files and directories,
it also offers version control of metadata. In essence, metadata
is data about data. In the world of Subversion, such metadata is called a
property, and every file and directory can have as many properties as you
wish. Changing a property, just like changing a file, requires a
commit to the repository. Metadata like this is commonly used for
indicating if a file is binary or text (not an easy thing to do in an
automated fashion in a world of UTF-8 and other character encodings),
whether it has Windows, UNIX, or old-style Mac line endings, etc. In
addition, you can define your own metadata for your files to indicate,
say, where a file originally came from, what kind of processing it might
need, or anything else you can envision. Once you are in the mode of
thinking about metadata and file properties, you begin to see a myriad of
uses for them. Subversion’s versioning of this metadata is especially
powerful.

My first repository

Enough theory; let’s actually take Subversion for a test drive. Unless
you are accessing someone else’s repository, the first thing you will want
to do is create a repository. For our purposes, we will simply make one
in your home directory. If you don’t have Subversion installed,
run yum install subversion as root.

To create the repository, execute the following command
($HOME/snvrepo will be the server location of the
repository, not the location of your working copy):

<code class="command">svnadmin create --fs-type fsfs $HOME/svnrepo</code>

Simple as that—no output means everything went fine. The usage is
quite simple—svnadmin create PATH. We add the
--fs-type fsfs in case of older versions of Subversion,
but as of 1.2, fsfs is the default file system type (don’t worry, this
doesn’t matter for typical use; suffice it to say, as we will see later, a
Subversion repository is effectively just a versioned, user-land file
system).

Although administration commands are performed with the
svnadmin command, the majority
of the time, you will simply use the svn command to manage your
repository. Now that we have a repository, we need to create a working
copy—the server repository directory $HOME/svnrepo is best
thought of as an opaque directory that we generally won’t need to
manipulate. To create your working copy, check out the repository with
the following command:

<code class="command">svn checkout file://$HOME/svnrepo $HOME/checkout</code>

If you see the following output, the check out was successful:

<code class="computeroutput">Checked out revision 0.</code>

It creates a (seemingly) empty directory called checkout in your
home directory. However, if you issue the ls -la

command in the checkout directory, you will see:

<code class="computeroutput">drwxrwxr-x    3 cturner cturner 4096 Aug  8 19:29 ./
drwxr-xr-x  125 cturner cturner 4096 Aug  8 19:29 ../
drwxrwxr-x    7 cturner cturner 4096 Aug  8 19:29 .svn/</code>

Ah, not quite as empty as first glance might tell us. If you have used
CVS, you are no doubt familiar with CVS/ directories
inside of every version controlled directory. The
.svn/ directory is analogous to that, though since
the name begins with a period, it is hidden from ls

(and, more practically, from wildcard expansion such as ls
*
).

Let’s create a file. First, create a simple file with the
echo command:

<code class="command">echo 'my first repository' &gt; README</code>

Then use the command svn status to check the status of
the new file, and you will see the following output:

<code class="computeroutput">?      README</code>

The svn status command, in this context, asks
Subversion to tell us what it knows about various files in comparison to
what the server knows. In the first invocation, it is saying it knows
absolutely nothing about the file (denoted by a
? in the first column); this means no
file named README is in HEAD of the repository, which
is what we expect as this is an empty repository. Once we run
svn add README though, the story is different, as
svn status shows us:

<code class="computeroutput">A         README</code>

In this case, A means the file has been
added to our working copy, but not yet checked in. In general,
svn status will only show us lines of output for
changes in our working copy.

Let’s go ahead and commit our single file:

<code class="command">svn commit -m 'my first file!'</code>

Adding the file produces the following output:

<code class="computeroutput">Adding         README
Transmitting file data .
Committed revision 1.</code>

Performing an svn status shows:

<code class="computeroutput">At revision 1.</code>

Generally, a commit is simply svn commit. Subversion
will then pop up your editor of choice (as defined by the
EDITOR environment variable variable) for
you to describe your check-in—here you generally leave a message for
posterity, describing the change, why it was needed, and perhaps even
referencing a bug tracking number. For the sake of an easily read
article, though, we include that message on the command like via the
-m option.

Notice that we performed an svn update
after our commit. This is necessary for the next
step. Generally speaking, even though our commit
created revision 1, our repository was last synced at
revision 0. This means we need to ask the server for any changes since
our checkout (or the last time we synced our repository). We do this with
a simple svn update command.

Let’s view our history with the svn log command:

<code class="computeroutput">------------------------------------------------------------------------
r1 | cturner | 2005-08-08 19:55:34 -0700 (Mon, 08 Aug 2005) | 1 line

my first file!
------------------------------------------------------------------------</code>

There it is, our change along with our check-in message. To see what files
were changed, though, we add two options:

<code class="command">svn log -v -r 1</code>

which gives the output:

<code class="computeroutput">------------------------------------------------------------------------
r1 | cturner | 2005-08-08 19:55:34 -0700 (Mon, 08 Aug 2005) | 1 line
Changed paths:
A /README

my first file!
------------------------------------------------------------------------</code>

The -v option tells Subversion to be verbose, which, in
the case of svn log, means to list the files changed
(the leading /in /README
indicates our change was at the root of our repository). The -r
1
parameter tells Subversion to give us just the changes for
revision 1, not all changes like svn log defaults to.
Generally you want to combine -r # with
-v so you don’t end up with page after page of changes
scrolling by. Likewise, you can do svn log -v -r HEAD

instead of the numeric revision to see the latest change.

Getting fancy

The above is enough to create files, edit files, and generally be
productive ad a basic level, but Subversion offers much more. First and
foremost, Subversion will version control directories. This means, unlike
CVS, adding and removing directories are part of the repository history:

<code class="computeroutput">[gandalf@moria checkout]$ mkdir src
[gandalf@moria checkout]$ echo 'first file' &gt; src/file1.txt
[gandalf@moria checkout]$ echo 'second file' &gt; src/file2.txt
[gandalf@moria checkout]$ svn status
?      src
[gandalf@moria checkout]$ svn add src/
A         src
A         src/file1.txt
A         src/file2.txt
[gandalf@moria checkout]$ svn status
A      src
A      src/file2.txt
A      src/file1.txt
[gandalf@moria checkout]$ svn commit -m 'add some source files'
Adding         src
Adding         src/file1.txt
Adding         src/file2.txt
Transmitting file data ..
Committed revision 2.
[gandalf@moria checkout]$ svn update
At revision 2.
[gandalf@moria checkout]$ svn log -r 2 -v
------------------------------------------------------------------------
r2 | cturner | 2005-08-08 20:09:15 -0700 (Mon, 08 Aug 2005) | 1 line
Changed paths:
A /src
A /src/file1.txt
A /src/file2.txt

add some source files
------------------------------------------------------------------------</code>

As simple as that, we’ve made a directory, added it to our working copy,
and committed it. Now let’s change a file that we already have created
(which, generally, is a more common operation; after all, files are only
created once, but edited many times). After changing the contents of the
file src/file1.txt, svn stat shows
us that it has been modified:

<code class="computeroutput">M      src/file1.txt</code>

To commit it:

<code class="command">svn commit -m 'replace file1 with new content'</code>

which produces the output:

<code class="computeroutput">Sending        src/file1.txt
Transmitting file data .
Committed revision 3.</code>

and svn up produces:

<code class="computeroutput">At revision 3.</code>

Note that this time we have shortened svn status to
simply svn stat and svn update to
just svn up. svn offers a number of
abbreviations, which are visible via svn help, which
will list all of the commands svn supports as well as abbreviations in
parentheses after each command.

One thing that may differ from other version control systems you’ve used
is that you did not have to explicitly check a file out for editing or
otherwise mark it as being modified—you just edit the file. Also
notice that, this time, svn stat showed us the
M state. This means the file has been locally
modified. Let’s explore this change further, though. Subversion not only
lets you see the reasoning behind each change and the list of changed
files, but it also lets you see the actual change with the svn
diff
command. In our case, we wish to see the changes that
occurred in going from revision 2 to revision 3:

<code class="command">svn diff -r 2:3</code>

which produces:

<code class="computeroutput">Index: src/file1.txt
===================================================================
--- src/file1.txt       (revision 2)
+++ src/file1.txt       (revision 3)
@@ -1 +1 @@
-first file
+this is the new file1</code>

The output is a unified diff of the files that have changed between
revisions 2 and 3; in our case, only one file changed
(src/file1.txt) and the change replaced the one and
only line in the file. If we omitted the :3 and just
executed svn diff -r 2, then svn would perform the diff
between revision 2 and whatever revision we had most recently synced in
our working copy. We can also view more changes at once if we
wish—we just execute svn diff -r M:N where M is
less than N. The result, again, is a diff, this time representing all
changes between revision M and N. When you are editing your working copy,
svn diff (without the -r parameter)
will show a diff between your working copy and the version of the
repository you last synced to (note, this isn’t against the latest version
in the repository—for that, just svn up and
svn diff again).

Let’s explore our first-ever change with this new tool and see how it
looks. svn diff -r 0:1 produces:

<code class="computeroutput">Index: README
===================================================================
--- README      (revision 0)
+++ README      (revision 1)
@@ -0,0 +1 @@
+my first repository</code>

This says ‘give us the change between revision 0 and 1’ which is simply
us adding the README file. One limitation of this
view of a diff is that it isn’t obvious if the file was present before and
empty, or if it never existed—the diff simply looks like it added a
line to the file. However, svn log shows us the truth.

Suppose we decide, though, that our original README
should be named README.txt. If we were using CVS, we
would be forced to delete README and create a new
file, README.txt from the previous file’s contents.
This loses the history of the file, though. In Subversion, though, we
have full control. The command svn mv README README.txt produces:

<code class="computeroutput">A         README.txt
D         README</code>

And, svn stat produces:

<code class="computeroutput">A  +   README.txt
D      README</code>

There are two important things here. First, to Subversion, a rename looks
almost like an addition (represented by the
A change for
README.txt) and a deletion (represented by the
D change for
README). The only difference is the
+ next to the
A which, in this case, makes all the
difference. When we do an svn mv or an svn
cp
, Subversion will actually copy the history and metadata of
the file with it.

Also worth noticing is that we did not run a bare mv on
the file ourself. Subversion changed our working copy for us. Likewise,
when we use svn cp, Subversion will copy the file for
us (preserving history and metadata) so that we don’t have to. Committing
with the command svn commit -m 'rename README ->
README.txt'
produces:


<code class="computeroutput">Deleting       README
Adding         README.txt

Committed revision 4.</code>

svn up produces:

<code class="computeroutput">At revision 4.</code>

And, svn diff -r 3:4 produces:

<code class="computeroutput">Index: README
===================================================================
--- README      (revision 3)
+++ README      (revision 4)
@@ -1 +0,0 @@
-my first repository
Index: README.txt
===================================================================
--- README.txt  (revision 0)
+++ README.txt  (revision 4)
@@ -0,0 +1 @@
+my first repository</code>

This is somewhat troubling, though. Notice that according to the message
with commit and the diff, it looks like we just completely removed the
README file and added a new file called
README.txt. svn log -v -r 4 however shows
us something different:

<code class="computeroutput">------------------------------------------------------------------------
r4 | cturner | 2005-08-08 20:27:02 -0700 (Mon, 08 Aug 2005) | 1 line
Changed paths:
D /README
A /README.txt (from /README:3)

rename README -&gt; README.txt
------------------------------------------------------------------------</code>

Notice the (from /README:3) next to the
A line. This means Subversion copied the
history and metadata of the file, basing the new file on the old. We can
also see this with a variant svn log README.txt that shows us
the sordid history of a single file:

<code class="computeroutput">------------------------------------------------------------------------
r4 | cturner | 2005-08-08 20:27:02 -0700 (Mon, 08 Aug 2005) | 1 line

rename README -&gt; README.txt
------------------------------------------------------------------------
r1 | cturner | 2005-08-08 19:55:34 -0700 (Mon, 08 Aug 2005) | 1 line

my first file!
------------------------------------------------------------------------</code>

Notice that although there was no file called
README.txt in revision 1 (r1), log shows it to us as
part of the history for README.txt.

This is an example of an important concept to remember. Sometimes, a
change is not easily represented for human consumption. Often, we are
used to looking at changes in terms of diffs of files. Some changes,
though, such as renames or metadata changes do not represent themselves
well as diffs. So even though in some ways it looks like Subversion lost
the fact that README.txt was once
README, this is actually just an artifact of how we
are looking at the changes. Rest assured, Subversion is doing the right
thing internally.

Let’s take renames a bit further, well beyond anything CVS might let us
do—let’s rename a directory! Using the command svn mv src
text-files
produces:

<code class="computeroutput">A         text-files
D         src/file2.txt
D         src/file1.txt
D         src</code>

which gives the following output for svn stat:

<code class="computeroutput">A  +   text-files
D      src
D      src/file2.txt
D      src/file1.txt</code>

Now, we have to commit the directory name change:

<code class="command">svn commit -m 'rename src to text-files'</code>

which produces:


<code class="computeroutput">Deleting       src
Adding         text-files

Committed revision 5.</code>

Issuing svn up produces:

<code class="computeroutput">At revision 5.</code>

There is one major difference this time, and that is even though we
performed svn mv on the directory, it remained until
the commit took place. This is simply Subversion’s record keeping (even
though src/ is empty of our files, it still has the
.svn/ directory) and not actually a problem.

Now let’s make a change, but abort before we commit. Suppose in a moment
of anger, we execute the svn rm * command.

Oh no! Our working copy is empty! Remember, though, this is just a
working copy; until we perform an svn commit, nothing
has changed in the server (though as we will soon see, even if it had, we
could undo it). We have two options. One is to blow away our working
copy and start anew with a fresh checkout. This works, but there is a
more elegant option for a more civilized system such as Subversion:

<code class="command">svn revert -R .</code>

produces:

<code class="computeroutput">Reverted 'text-files'
Reverted 'text-files/file2.txt'
Reverted 'text-files/file1.txt'
Reverted 'README'</code>

And now svn up shows us:

<code class="computeroutput">At revision 5.</code>

Voila! Not only are our files back, but as you can see from svn
up
, we didn’t change the repository (which is still at revision
5 from our previous change).

Alas! svn revert only works when you have yet to check
in a change. If you realize a mistake after a commit, you must do
something else. In our case, let us suppose we did made such a
mistake—we should never have renamed README

into README.txt. We need to undo that change. We
have two options. One is we simply svn mv README.txt
README
and commit. That will work fine, and Subversion will
DTRT (do the right thing) and preserve history and metadata. But suppose
our change were one over hundreds of files in dozens of
directories…that could be tedious to fix by hand. Fortunately, unlike
in real life, with Subversion we can easily undo our past sins. First, we
find the change we wish to undo with svn log:

<code class="computeroutput">------------------------------------------------------------------------
r5 | cturner | 2005-08-08 20:32:29 -0700 (Mon, 08 Aug 2005) | 1 line

rename src to text-files
------------------------------------------------------------------------
r4 | cturner | 2005-08-08 20:27:02 -0700 (Mon, 08 Aug 2005) | 1 line

rename README -&gt; README.txt
------------------------------------------------------------------------
r3 | cturner | 2005-08-08 20:12:39 -0700 (Mon, 08 Aug 2005) | 1 line

replace file1 with new content
------------------------------------------------------------------------
r2 | cturner | 2005-08-08 20:09:15 -0700 (Mon, 08 Aug 2005) | 1 line

add some source files
------------------------------------------------------------------------
r1 | cturner | 2005-08-08 19:55:34 -0700 (Mon, 08 Aug 2005) | 1 line

my first file!
------------------------------------------------------------------------</code>

Ah there it is. The change from revision 3 to revision 4. Now we use the
svn merge -r 4:3 command to merge in the change we wish to
undo:

<code class="computeroutput">D    README.txt
A    README</code>

svn stat shows that the merge is set to take place:

<code class="computeroutput">D      README.txt
A  +   README</code>

The last step is to commit the change with svn commit -m 'undo
change 3:4'
which produces:

<code class="computeroutput">Adding         README
Deleting       README.txt</code>

To confirm, svn up shows:

<code class="computeroutput">At revision 6.</code>

A few interesting points—first, we specified 4:3, not 3:4. This
actually makes sense as it is the change from revision 3 to revision 4 we
wish to undo, so we specify them in reverse order. We can also specify
‘backwards’ revisions like this when viewing diffs, should we find the
need. Second, the change looks identical to what we would see with just
performing an svn mv. Although internally Subversion
is being smart about the file’s metadata and contents, in actuality
reverting this particular change is simply an svn mv.

Conclusion

Hopefully our whirlwind tour of Subversion has left you with an
understanding of the power of version control in general and of Subversion
in particular. If you are a CVS user, you hopefully noticed two key
things. One, that the command line usage of svn is very similar to cvs.
Two, that you can do far more with Subversion than with CVS and you can
work more reliably with clearer behavior and more predictable results.

There is far, far more that Subversion has to offer, however. This is but
a quick glance. Fortunately, the resources available online are of very
high quality. In particular, there is an entire book freely available
online at http://svnbook.red-bean.com/.

If you find the book useful, don’t hesitate to order the print copy
(published by O’Reilly, no less); it is an indispensable resource both as
a tutorial and introduction and as a reference.

About the author

Chip Turner is a Site Reliability Engineer at Google, Inc. Before that,
he spent four years working at Red Hat on the Red Hat Network, perl, and
several perl packages for Fedora Core and Red Hat® Enterprise Linux®. He
also maintains a number of CPAN modules, contributes to other open source
projects, and generally abuses Linux personally and professionally on a
daily basis. In his spare time he enjoys playing with his dog and arguing
for no apparent reason.

Sendmail configuration for domains with wild card DNS entry

Problem


My SMTP server, sendmail 8.13.4, is trying to add the domain name to
the recipient’s email address when it is not able to find the domain
name.

Here, for example:

**********************************************
** THIS IS A WARNING MESSAGE ONLY **
** YOU DO NOT NEED TO RESEND YOUR MESSAGE **
**********************************************

The original message was received at Mon, 8 Aug 2005 06:47:30 -0700
from superman [202.163.211.54]

—– Transcript of session follows —–
< endu...@anotherdomain.com >… Deferred: Connection timed out with
anotherdomain.com.mydomain.com.
Warning: message still undelivered after 12 hours
Will keep trying until message is 5 days old

[r…@mydomain.com cf]# sendmail -bt -C /etc/mail/sendmail.cf
ADDRESS TEST MODE (ruleset 3 NOT automatically invoked)
Enter < ruleset > < address >
> /try smtp s…@thisdomainnotexistatall.com

Trying envelope recipient address s…@thisdomainnotexistatall.com for
mailer smtp
canonify input: super @ thisdomainnotexistatall . com
Canonify2 input: super < @ thisdomainnotexistatall . com >
Canonify2 returns: super < @ thisdomainnotexistatall . com . mydomain . com . >
canonify returns: super < @ thisdomainnotexistatall . com . mydomain . com . >
2 input: super < @ thisdomainnotexistatall . com . mydomain . com . >
2 returns: super < @ thisdomainnotexistatall . com . mydomain . com . >
EnvToSMTP input: super < @ thisdomainnotexistatall . com . mydomain . com . >
PseudoToReal input: super < @ thisdomainnotexistatall . com . mydomain . com . >
PseudoToReal returns: super < @ thisdomainnotexistatall . com . mydomain . com . >
MasqSMTP input: super < @ thisdomainnotexistatall . com . mydomain . com . >
MasqSMTP returns: super < @ thisdomainnotexistatall . com . mydomain . com . >
EnvToSMTP returns: super < @ thisdomainnotexistatall . com . mydomain . com . >
final input: super < @ thisdomainnotexistatall . com . mydomain . com . >
final returns: super @ thisdomainnotexistatall . com . mydomain . com

Now, where did anotherdomain.com.mydomain.com come from ? Also, why is it adding mydomain.com ?

Solution

Now, This could be because the domain, mydomain.com has got wildcard entry (*.mydomain.com). So, edit the option in /etc/mail/sendmail.cf to the following:

O ResolverOptions=+AAONLY -DNSRCH HasWildcardMX

from the doc:

The ResolverOptions ( I) option allows you to tweak name server options. The command line takes a series of flags as documented in resolver(3) (with the leading RES_ deleted). Each can be preceded by an optional `+’ or `-‘. For example, the line

O ResolverOptions=+AAONLY -DNSRCH

turns on the AAONLY (accept authoritative answers only) and turns off the DNSRCH (search the domain path) options. Most resolver libraries default DNSRCH, DEFNAMES, and RECURSE flags on and all others off. You can also include HasWildcardMX to specify that there is a wildcard MX record matching your domain; this turns off MX matching when canonifying names, which can lead to inappropriate canonifications.

How To Look Like A UNIX Guru

Terence Parr

UNIX is an extremely popular platform for deploying server software partly because of its security and stability, but also because it has a rich set of command line and scripting tools. Programmers use these tools for manipulating the file system, processing log files, and generally automating as much as possible.

If you want to be a serious server developer, you will need to have a certain facility with a number of UNIX tools; about 15. You will start to see similarities among them, particularly regular expressions, and soon you will feel very comfortable. Combining the simple commands, you can build very powerful tools very quickly–much faster than you could build the equivalent functionality in C or Java, for example.

This lecture takes you through the basic commands and then shows you how to combine them in simple patterns or idioms to provide sophisticated functionality like histogramming. This lecture assumes you know what a shell is and that you have some basic familiarity with UNIX.

Everything is a stream

The first thing you need to know is that UNIX is based upon the idea of a stream. Everything is a stream, or appears to be. Device drivers look like streams, terminals look like streams, processes communicate via streams, etc… The input and output of a program are streams that you can redirect into a device, a file, or another program.

Here is an example device, the null device, that lets you throw output away. For example, you might want to run a program but ignore the output.

$ ls &gt; /dev/null # ignore output of ls

where “# ignore output of ls” is a comment.

Most of the commands covered in this lecture process stdin and send results to stdout. In this manner, you can incrementally process a data stream by hooking the output of one tool to the input of another via a pipe. For example, the following piped sequence prints the number of files in the current directory modified in August.

$ ls -l | grep Aug | wc -l

Imagine how long it would take you to write the equivalent C or Java program. You can become an extremely productive UNIX programmer if you learn to combine the simple command-line tools. Even when programming on a PC, I use MKS’s UNIX shell and command library to make it look like a UNIX box. Worth the cash.

Getting help

If you need to know about a command, ask for the “man” page. For example, to find out about the ls command, type

$ man ls
LS(1)                   System General Commands Manual                   LS(1)

NAME
ls - list directory contents

SYNOPSIS
ls [-ACFLRSTWacdfgiklnoqrstux1] [file ...]

DESCRIPTION
For each operand that names a file of a type other than directory, ls
...

You will get a summary of the command and any arguments.

If you cannot remember the command’s name, try using apropos which finds commands and library routines related to that word. For example, to find out how to do checksums, type

$ apropos checksum
cksum(1), sum(1)         - display file checksums and block counts
md5(1)                   - calculate a message-digest fingerprint (checksum) for a file

The basics

There are 4 useful ways to display the contents or portions of a file. The first is the very commonly used command cat. For example, to display your bash initialization file, type:

$ cat ~parrt/.bash_profile

where ~parrt is user parrt’s home directory. If a file is really big, you will probably want to use more, which spits the file out in screen-size chunks.

$ more /var/log/mail.log

If you only want to see the first few lines of a file or the last few lines use head and tail.

$ head /var/log/mail.log
$ tail /var/log/mail.log

You can specify a number as an argument to get a specific number of lines:

$ head -30 /var/log/mail.log

The most useful incantation of tail prints the last few lines of a file and then waits, printing new lines as they are appended to the file. This is great for watching a log file:

$ tail -f /var/log/mail.log

If you need to know how many characters, words, or lines are in a file, use wc:


$ wc /var/log/mail.log
164    2916   37896 /var/log/mail.log

Where the numbers are, in order, lines, words, then characters. For clarity, you can use wc -l to print just the number of lines.

Tarballs

Note: The name comes from a similar word, hairball (stuff that cats throw up), I’m pretty sure.

To collect a bunch of files and directories together, use tar. For example, to tar up your entire home directory and put the tarball into /tmp, do this

$ cd ~parrt
$ cd .. # go one dir above dir you want to tar
$ tar cvf /tmp/parrt.backup.tar parrt

By convention, use .tar as the extension. To untar this file use

$ cd /tmp
$ tar xvf parrt.backup.tar

tar untars things in the current directory!

After running the untar, you will find a new directory, /tmp/parrt, that is a copy of your home directory. Note that the way you tar things up dictates the directory structure when untarred. The fact that I mentioned parrt in the tar creation means that I’ll have that dir when untarred. In contrast, the following will also make a copy of my home directory, but without having a parrt root dir:

$ cd ~parrt
$ tar cvf /tmp/parrt.backup.tar *

It is a good idea to tar things up with a root directory so that when you untar you don’t generate a million files in the current directly. To see what’s in a tarball, use


$ tar tvf /tmp/parrt.backup.tar

Most of the time you can save space by using the z argument. The tarball will then be gzip‘d and you should use file extension .tar.gz:

$ cd ~parrt
$ cd .. # go one dir above dir you want to tar
$ tar cvfz /tmp/parrt.backup.tar.gz parrt

Unzipping requires the z argument also:

$ cd /tmp
$ tar xvfz parrt.backup.tar.gz

If you have a big file to compress, use gzip:

$ gzip bigfile

After execution, your file will have been renamed bigfile.gz. To uncompress, use

$ gzip -d bigfile.gz

To display a text file that is currently gzip‘d, use zcat:

$ zcat bigfile.gz

Searching streams

One of the most useful tools available on UNIX and the one you may use the most is grep. This tool matches regular expressions (which includes simple words) and prints matching lines to stdout.

The simplest incantation looks for a particular character sequence in a set of files. Here is an example that looks for any reference to System in the java files in the current directory.

grep System *.java

You may find the dot ‘.’ regular expression useful. It matches any
single character but is typically combined with the star, which
matches zero or more of the preceding item. Be careful to enclose the
expression in single quotes so the command-line expansion doesn’t
modify the argument. The following example, looks for references to
any a forum page in a server log file:

$ grep '/forum/.*' /home/public/cs601/unix/access.log

or equivalently:

$ cat /home/public/cs601/unix/access.log | grep '/forum/.*'

The second form is useful when you want to process a collection of files as a single stream as in:

cat /home/public/cs601/unix/access*.log | grep '/forum/.*'

If you need to look for a string at the beginning of a line, use caret ‘^’:

$ grep '^195.77.105.200' /home/public/cs601/unix/access*.log

This finds all lines in all access logs that begin with IP address
195.77.105.200.

If you would like to invert the pattern matching to find lines that do not match a pattern, use -v. Here is an example that finds references to non image GETs in a log file:

$ cat /home/public/cs601/unix/access.log | grep -v '/images'

Now imagine that you have an http log file and you would like to filter out page requests made by nonhuman spiders. If you have a file called spider.IPs, you can find all nonspider page views via:

$ cat /home/public/cs601/unix/access.log | grep -v -f /tmp/spider.IPs

Finally, to ignore the case of the input stream, use -i.

Translating streams

Morphing a text stream is a fundamental UNIX operation. PERL is a good tool for this, but since I don’t like PERL I stick with three tools: tr, sed, and awk. PERL and these tools are line-by-line tools in that they operate well only on patterns fully contained within a single line. If you need to process more complicated patterns like XML or you need to parse a programming language, use a context-free grammar tool like ANTLR.

tr

For manipulating whitespace, you will find tr very useful.

If you have columns of data separated by spaces and you would like the columns to collapse so there is a single column of data, tell tr to replace space with newline tr ‘ ‘ ‘\n’. Consider input file /home/public/cs601/unix/names:


jim scott mike
bill randy tom

To get all those names in a column, use

$ cat /home/public/cs601/unix/names | tr ' ' '\n'

If you would like to collapse all sequences of spaces into one single space, use tr -s ‘ ‘.

To convert a PC file to UNIX, you have to get rid of the ‘\r’ characters. Use tr -d ‘\r’.

sed

If dropping or translating single characters is not enough, you can
use sed (stream editor) to replace or delete text chunks matched by
regular expressions. For example, to delete all references to word
scott in the names file from above, use

$ cat /home/public/cs601/unix/names | sed 's/scott//'

which substitutes scott for nothing. If there are multiple references to scott on a single line, use the g suffix to indicate “global” on that line otherwise only the first occurrence will be removed:

$ ... | sed 's/scott//g'

If you would like to replace references to view.jsp with index.jsp, use

$ ... | sed 's/view.jsp/index.jsp/'

If you want any .asp file converted to .jsp, you must match the file name with a regular expression and refer to it via \1:

$ ... | sed 's/\(.*\).asp/\1.jsp/'

The \(…\) grouping collects text that you can refer to with \1.

If you want to kill everything from the ‘,’ character to end of line, use the end-of-line marker $:

$ ... | sed 's/,.*$//' # kill from comma to end of line

awk

When you need to work with columns of data or execute a little bit of code for each line matching a pattern, use awk. awk programs are pattern-action pairs. While some awk programs are complicated enough to require a separate file containing the program, you can do some amazing things using an argument on the command-line.

awk thinks input lines are broken up into fields (i.e., columns) separate by whitespace. Fields are referenced in an action via $1, $2, … while $0 refers to the entire input line.

A pattern-action pair looks like:

pattern {action}

If you omit the pattern, the action is executed for each input line. Omitting the action means print the line. You can separate the pairs by newline or semicolon.

Consider input

aasghar Asghar, Ali
wchen   Chen, Wei
zchen   Chen, Zhen-Jian

If you want a list of login names, ask awk to print the first column:

$ cat /home/public/cs601/unix/emails.txt | awk '{print $1;}'

If you want to convert the login names to email addresses, use the printf C-lookalike function:

$ cat /home/public/cs601/unix/emails.txt | awk '{printf("%s@cs.usfca.edu,",$1);}'

Because of the missing \n in the printf string, you’ll see the output all on one line ready for pasting into a mail program:

aasghar@cs.usfca.edu,wchen@cs.usfca.edu,zchen@cs.usfca.edu

You might also want to reorder columns of data. To print firstname, lastname, you might try:

$ cat /home/public/cs601/unix/emails.txt | awk '{printf("%s %s\n", $3, $2);}'

but you’ll notice that the comma is still there as it is part of the column:

Ali Asghar,
Wei Chen,
Zhen-Jian Chen,

You need to pipe the output thru tr (or sed) to strip the comma:

$ cat /home/public/cs601/unix/emails.txt | \
awk '{printf("%s %s\n", $3, $2);}' | \
tr -d ','

Then you will see:


Ali Asghar
Wei Chen
Zhen-Jian Chen

You can also use awk to examine the value of content. To sum up the first column of the following data (in file /home/public/cs601/unix/coffee):

3 parrt
2 jcoker
8 tombu

use the following simple command:

$ awk '{n+=$1;} ; END {print n;}' &lt; /home/public/cs601/unix/coffee

where END is a special pattern that means “after processing the stream.”

If you want to filter or sum all values less than or equal to, say 3, use an if statement:

$ awk '{if ($1&lt;=3) n+=$1;} END {print n;}' &lt; /home/public/cs601/unix/coffee

In this case, you will see output 5 (3+2);

Using awk to grab a particular column is very common when processing log files. Consider a http://www.jguru.com page view log file, /home/public/cs601/unix/pageview-20021022.log, that are of the form:

date-stamp(thread-name): userID-or-IPaddr URL site-section

So, the data looks like this:

20021022_00.00.04(tcpConnection-80-3019):       203.6.152.30    /faq/subtopic.jsp?topicID=472&page=2    FAQs
20021022_00.00.07(tcpConnection-80-2981):       995134  /index.jsp      Home
20021022_00.00.08(tcpConnection-80-2901):       66.67.34.44     /faq/subtopic.jsp?topicID=364   FAQs
20021022_00.00.12(tcpConnection-80-3003):       217.65.96.13    /faq/view.jsp?EID=736437        FAQs
20021022_00.00.13(tcpConnection-80-3019):       203.124.210.98  /faq/topicindex.jsp?topic=JSP   FAQs/JSP
20021022_00.00.15(tcpConnection-80-2988):       202.56.231.154  /faq/index.jsp FAQs
20021022_00.00.19(tcpConnection-80-2976):       66.67.34.44     /faq/view.jsp?EID=225150        FAQs
220021022_00.00.21(tcpConnection-80-2974):       143.89.192.5    /forums/most_active.jsp?topic=EJB       Forums/EJB
20021022_00.00.21(tcpConnection-80-2996):       193.108.239.34  /guru/edit_account.jsp  Guru
20021022_00.00.21(tcpConnection-80-2996):       193.108.239.34  /misc/login.jsp Misc
...

When a user is logged in, the log file has their user ID rather than their IP address.

Here is how you get a list of URLs that people view on say October 22, 2002:


$ awk '{print $3;}' &lt; /home/public/cs601/unix/pageview-20021022.log
/faq/subtopic.jsp?topicID=472&page=2
/index.jsp
/faq/subtopic.jsp?topicID=364
/faq/view.jsp?EID=736437
/faq/topicindex.jsp?topic=JSP
/faq/index.jsp
/faq/view.jsp?EID=225150
/forums/most_active.jsp?topic=EJB
/guru/edit_account.jsp
/misc/login.jsp
...

If you want to count how many page views there were that day that were not processing pages (my processing pages are all of the form process_xxx), pipe the results through grep and wc:

$ awk '{print $3;}' &lt; /home/public/cs601/unix/pageview-20021022.log | \
grep -v process | \
wc -l
67850

If you want a unique list of URLs, you can sort the output and then use uniq:

$ awk '{print $3;}' &lt; /home/public/cs601/unix/pageview-20021022.log | \
sort | \
uniq

uniq just collapses all repeated lines into a single line–that is why you must sort the output first. You’ll get output like:

/article/index.jsp
/article/index.jsp?page=1
/article/index.jsp?page=10
/article/index.jsp?page=2
...

Moving files between machines

rsync

When you need to have a directory on one machine mirrored on another machine, use rsync. It compares all the files in a directory subtree and copies over any that have changed to the mirrored directory on the other machine. For example, here is how you could “pull” all logs files from livebox.jguru.com to the box from which you execute the rsync command:

$ hostname
jazz.jguru.com
$ rsync -rabz -e ssh -v 'parrt@livebox.jguru.com:/var/log/jguru/*' \
/backup/web/logs

rsync will delete or truncate files to ensure the files stay the same. This is bad if you erase a file by mistake–it will wipe out your backup file. Add an argument called –suffix to tell rsync to make a copy of any existing file before it overwrites it:

$ hostname
jazz.jguru.com
$ rsync -rabz -e ssh -v --suffix .rsync_`date '+%Y%m%d'` \
'parrt@livebox.jguru.com:/var/log/jguru/*' /backup/web/logs

where `date ‘+%Y%m%d’` (in reverse single quotes) means “execute this date command”.

To exclude certain patterns from the sync, use –exclude:


$ rsync -rabz --exclude=entitymanager/ --suffix .rsync_`date '+%Y%m%d'` \
-e ssh -v 'parrt@livebox.jguru.com:/var/log/jguru/*' /backup/web/logs

scp

To copy a file or directory manually, use scp:

$ scp lecture.html parrt@nexus.cs.usfca.edu:~parrt/lectures

Just like cp, use -r to copy a directory recursively.

Miscellaneous

find

Most GUIs for Linux or PCs have a search facility, but from the command-line you can use find. To find all files named .p4 starting in directory ~/antlr/depot/projects, use:

$ find  ~/antlr/depot/projects -name '.p4'

The default “action” is to -print.

You can specify a regular expression to match. For example, to look under your home directory for any xml files, use:

$ find ~ -name '*.xml' -print

Note the use of the single quotes to prevent command-line expansion–you want the ‘*’ to go to the find command.

You can execute a command for every file or directory found that matches a name. For example, do delete all xml files, do this:

$ find ~ -name '*.xml' -exec rm {} \;

where “{}” stands for “current file that matches”. The end of the command must be terminated with ‘;’ but because of the command-line expansion, you’ll need to escape the ‘;’.

You can also specify time information in your query. Here is a shell script that uses find to delete all files older than 14 days.

#!/bin/sh

BACKUP_DIR=/var/data/backup

# number of days to keep backups
AGE=14 # days
AGE_MINS=$[ $AGE * 60 * 24 ]

# delete dirs/files
find $BACKUP_DIR/* -cmin +$AGE_MINS -type d -exec rm -rf {} \;

fuser

If you want to know who is using a port such as HTTP (80), use fuser. You must be root to use this:

$ sudo /sbin/fuser -n tcp 80
80/tcp:              13476 13477 13478 13479 13480
13481 13482 13483 13484 13486 13487 13489 13490 13491
13492 13493 13495 13496 13497 13498 13499 13500 13501 13608

The output indicates the list of processes associated with that port.

whereis

Sometimes you want to use a command but it’s not in your PATH and you can’t remember where it is. Use whereis to look in standard unix locations for the command.

$ whereis fuser
fuser: /sbin/fuser /usr/man/man1/fuser.1 /usr/man/man1/fuser.1.gz
$ whereis ls
ls: /bin/ls /usr/man/man1/ls.1 /usr/man/man1/ls.1.gz

whereis also shows man pages.

which

Sometimes you might be executing the wrong version of a command and you want to know which version of the command your PATH indicates should be run. Use which to ask:

$ which ls
alias ls='ls --color=tty'
/bin/ls
$ which java
/usr/local/java/bin/java

If nothing is found in your path, you’ll see:

$ which fuser
/usr/bin/which: no fuser in (/usr/local/bin:/usr/local/java/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/X11R6/bin:/home/parrt/bin)

kill

To send a signal to a process, use kill. Typically you’ll want to just say kill pid where pid can be found from ps or top (see below).

Use kill -9 pid when you can’t get the process to die; this means kill it with “extreme prejudice”.

traceroute

If you are having trouble getting to a site, use traceroute to watch the sequence of hops used to get to a site:

$ /usr/sbin/traceroute www.cnn.com
1  65.219.20.145 (65.219.20.145)  2.348 ms  1.87 ms  1.814 ms
2  loopback0.gw5.sfo4.alter.net (137.39.11.23)  3.667 ms  3.741 ms  3.695 ms
3  160.atm3-0.xr1.sfo4.alter.net (152.63.51.190)  3.855 ms  3.825 ms  3.993 ms
...

What is my IP address?

$ /sbin/ifconfig

Under the eth0 interface, you’ll see the inet addr:


eth0      Link encap:Ethernet  HWaddr 00:10:DC:58:B1:F0
inet addr:138.202.170.4  Bcast:138.202.170.255  Mask:255.255.255.0
...

pushd, popd

Instead of cd you can use pushd to save the current dir and then automatically cd to the specified directory. For example,

$ pwd
/Users/parrt
$ pushd /tmp
/tmp ~
$ pwd
/tmp
$ popd
~
$ pwd
/Users/parrt

top

To watch a dynamic display of the processes on your box in action, use top.

ps

To print out (wide display) all processes running on a box, use ps auxwww.

Useful combinations

How to kill a set of processes

If you want to kill all java processes running for parrt, you can
either run killall java if you are parrt or generate a “kill”
script via:

$ ps auxwww|grep java|grep parrt|awk '{print "kill -9 ",$2;}' &gt; /tmp/killparrt
$ bash /tmp/killparrt # run resulting script

The /tmp/killparrt file would look something like:

kill -9 1021
kill -9 1023
kill -9 1024

Note: you can also do this common task with:

$ killall java

How to make a histogram

A histogram is set of count, value pairs indicating how often the value occurs. The basic operation will be to sort, then count how many values occur in a row and then reverse sort so that the value with the highest count is at the top of the report.

$ ... | sort |uniq -c|sort -r -n

Note that sort sorts on the whole line, but the first column is obviously significant just as the first letter in someone’s last name significantly positions their name in a sorted list.

uniq -c collapses all repeated sequences of values but prints the number of occurrences in front of the value. Recall the previous sorting:

$ awk '{print $3;}' &lt; /home/public/cs601/unix/pageview-20021022.log | \
sort | \
uniq
/article/index.jsp
/article/index.jsp?page=1
/article/index.jsp?page=10
/article/index.jsp?page=2
...

Now add -c to uniq:

$ awk '{print $3;}' &lt; /home/public/cs601/unix/pageview-20021022.log | \
sort | \
uniq -c
623 /article/index.jsp
6 /article/index.jsp?page=1
10 /article/index.jsp?page=10
109 /article/index.jsp?page=2
...

Now all you have to do is reverse sort the lines according to the first column numerically.

$ awk '{print $3;}' &lt; /home/public/cs601/unix/pageview-20021022.log | \
sort | \
uniq -c | \
sort -r -n
6170 /index.jsp
2916 /search/results.jsp
1397 /faq/index.jsp
1018 /forums/index.jsp
884 /faq/home.jsp?topic=Tomcat
...

In practice, you might want to get a histogram that has been “despidered” and only has faq related views. You can filter out all page view lines associated with spider IPs and filter in only faq lines:

$ grep -v -f /tmp/spider.IPs /home/public/cs601/unix/pageview-20021022.log | \
awk '{print $3;}'| \
grep '/faq' | \
sort | \
uniq -c | \
sort -r -n
1397 /faq/index.jsp
884 /faq/home.jsp?topic=Tomcat
525 /faq/home.jsp?topic=Struts
501 /faq/home.jsp?topic=JSP
423 /faq/home.jsp?topic=EJB
...

If you want to only see despidered faq pages that were referenced more than 500 times, add an awk command to the end.

$ grep -v -f /tmp/spider.IPs /home/public/cs601/unix/pageview-20021022.log | \
awk '{print $3;}'| \
grep '/faq' | \
sort | \
uniq -c | \
sort -r -n | \
awk '{if ($1&gt;500) print $0;}'
1397 /faq/index.jsp
884 /faq/home.jsp?topic=Tomcat
525 /faq/home.jsp?topic=Struts
501 /faq/home.jsp?topic=JSP

Generating scripts and programs

I like to automate as much as possible. Sometimes that means writing a program that generates another program or script.

Processing mail files

I wanted to get a sequence of SQL commands that would update our database whenever someone’s email bounced. Processing the mail file is pretty easy since you can look for the error code followed by the email address. A bounced email looks like:

From MAILER-DAEMON@localhost.localdomain  Wed Jan  9 17:32:33 2002
Return-Path: &lt;&gt;
Received: from web.jguru.com (web.jguru.com [64.49.216.133])
by localhost.localdomain (8.9.3/8.9.3) with ESMTP id RAA18767
for &lt;notifications@jguru.com&gt;; Wed, 9 Jan 2002 17:32:32 -0800
Received: from localhost (localhost)
by web.jguru.com (8.11.6/8.11.6) id g0A1W2o02285;
Wed, 9 Jan 2002 17:32:02 -0800
Date: Wed, 9 Jan 2002 17:32:02 -0800
From: Mail Delivery Subsystem &lt;MAILER-DAEMON@web.jguru.com&gt;

Message-Id: &lt;200201100132.g0A1W2o02285@web.jguru.com&gt;
To: &lt;notifications@jguru.com&gt;
MIME-Version: 1.0
Content-Type: multipart/report; report-type=delivery-status;
boundary="g0A1W2o02285.1010626322/web.jguru.com"
Subject: Returned mail: see transcript for details
Auto-Submitted: auto-generated (failure)

This is a MIME-encapsulated message

--g0A1W2o02285.1010626322/web.jguru.com

The original message was received at Wed, 9 Jan 2002 17:32:02 -0800
from localhost [127.0.0.1]

----- The following addresses had permanent fatal errors -----
&lt;pain@intheneck.com&gt;
(reason: 550 Host unknown)

----- Transcript of session follows -----
550 5.1.2 &lt;pain@intheneck.com&gt;... Host unknown (Name server: intheneck.com: host not found)
...

Notice the SMTP 550 error message. Look for that at the start of a line then kill the angle brackets, remove the and use awk to print out the SQL:

# This script works on one email or a file full of other emails
# since it just looks for the SMTP 550 or 554 results and then
# converts them to SQL commands.
grep -E '^(550|554)' | \
sed 's/[&lt;&gt;]//g' | \
sed 's/\.\.\.//' | \
awk "{printf(\"UPDATE PERSON SET bounce=1 WHERE email='%s';\n\",\$3);}" &gt;&gt; bounces.sql

I have to escape the $2 because it means something to the surround bash shell script and I want awk to see the dollar sign.

Generating getter/setters

#!/bin/bash
# From a type and name (plus firstlettercap version),
# generate a Java getter and setter
#
# Example: getter.setter String name Name
#

TYPE=$1
NAME=$2
UPPER_NAME=$3

echo "public $TYPE get$UPPER_NAME() {"
echo "  return $NAME;"
echo "}"
echo
echo "void set$UPPER_NAME($TYPE $NAME) {"
echo "  this.$NAME = $NAME;"
echo "}"
echo