Hosted by |
|
Here is a little program to convert files that have Windows-style line endings (carriage-return, line-feed or "CRLF") to UNIX style. The file format doesn't matter to Tcl, which can read either kind of file (Macintosh file format too.) However, sometimes you want to get rid of the extra carriage-return characters when working on a file that came from Windows. First we present the program and a summary of the commands in it. After that we'll look at parts of the program in more detail.
Running a Tcl scriptIn UNIX you can start a file with #!/usr/local/bin/tclsh and it will automatically launch Tcl to interpret the script. On Windows you can set up an association with ".tcl" files endings and Tcl. In practice, I often end up creating Windows shortcuts instead. For a shortcut you need to specify the path to the Tclsh program, e.g., "C:/Program Files/Tcl/bin/tclsh84.exe" and then give the script name as the first argument to Tclsh. Tcl ProceduresThe main part of the program is a procedure named Unix2Dos. This procedure takes one argument, f, that is the name of a file or directory. proc Dos2Unix {f} { # Prodedure body here } The f parameter is set when the Dos2Unix procedure is called. If you called Dos2Unix like this: Dos2Unix myfile.txt then the f parameter gets the value myfile.txt when executing inside Dos2Unix. Command Line ArgumentsFor our program, we want to pass the names of the files to process on the UNIX command line (i.e., when you are invoking the problem from Bash or Cshell). Command line arguments are stored in the argv variable. The foreach loop at the end of the script calls Dos2Unix with each file given on the command line: foreach f $argv { Dos2Unix $f } Testing ConditionsThe Dos2Unix procedure works differently on files and directories. If it is passed the name of a file, then it reads and writes that file to do the end-of-line conversions. If it is passed a directory, then it processes all the files in that directory. But first, it must test the file to see if it is a directory. if {[file isdirectory $f]} { # Process the directory } else { # Process one file } The if command tests the result of its expression. In this case the expression contains a call to another Tcl command, file isdirectory, which returns 1 if the file is a directory. Square brackets are used to delimit the nested command. Curly braces are used to group the expression and the two command bodies (the if-part and the else-part). Looping over a list of valuesIn the case of a directory, Dos2Unix loops over all the files in that directory. The glob command returns the list of files given a file name pattern. The * matches all files. This is joined to the name of the directory in a cross-platform way with the file join command, which uses /, \, or : as the pathname separator on Unix, Windows, and Macintosh, respectively. Finally, we get to loop over the list of file names returned by glob: foreach g [glob [file join $f *]] { Dos2Unix $g } Each time through the foreach loop the loop variable g takes on the next value from the list returned by glob. The easiest way to process the files is to call Dos2Unix recursively. In the recursive call, a whole new set of variables is allocated for Dos2Unix, so there is no conflict between the variables in the different instances of Dos2Unix. (This is standard recursion.) Working with FilesThe heart of the conversion done by Dos2Unix reads the file into memory and writes it back out again. First, we open the file with the open commands: set in [open $f] set out [open $f.new w] The extra w argument to open causes the file to opened for writing. We open a different file named with a trailing ".new" suffix. Tcl lets us easily add stuff after the variable value. The "." in $f.new terminates the variable name and the ".new" is treated as a literal. So, if $f is myfile.txt, then $f.new is myfile.txt.new. Reading and writing the files is done in one combination of commands: puts -nonewline $out [read $in] This reads the whole file into memory and passes it to the puts command for output. By default, puts will append a trailing newline (\n) to its output, but we don't want that in this case so we pass a flag to turn off that behavior. When we are done with the files, we must close the channels. An important side-effect of this is to flush any buffered data to disk: close $in close $out The final step renames the new file to the original name. This effectively deletes the original. The -force flag is required when you are replacing an existing file with file rename. file rename -force $f.new $f End-of-Line CharactersThe heart of the conversion done by Dos2Unix is simply to read the file into memory and write it back out again. Tcl does automatic end-of-line character conversions. In memory all line endings read in (e.g., UNIX-style line feeds (\n), Windows-style carriage-return, line-feed (\r\n), or Macintosh-style carriage-return (\r)) are converted to the newline character (\n). During output, Tcl converts newline characters to the native representation. So, simply by reading and writing the file you convert it to the local convensions. However, for our program we want to convert to Unix-style line endings, so we use the fconfigure command to tune the I/O channel: fconfigure $out -translation lf Other SolutionsOne-LinerThe shortest possible version of this program is simply:puts -nonewline stdout [read stdin] This reads from the standard input channel, stdin, and writes to the standard output channel, stdout. These standard channels are opened for you by Tclsh and Wish. In addition, there is a standard error output channel called stderr. Because Tcl automatically converts end-of-line characters on input and output, the above program will generate a file in the native format of your current system. Reading in BlocksOne problem with the one-liner and the complete program is that it buffers the whole file into memory. Fore really large files this might be a problem. Here is a loop that reads the file in blocks of 32 Kbytes:set blocksize [expr 32 * 1024] while {![eof stdin]} { puts -nonewline stdout [read stdin $blocksize] } Optional File ArgumentsSuppose you want the program to either take the name of a file on the command line, or work on the standard input channel. Here we check if the argument count is 0, in which cased we operate on stdin:if {$argc == 0} { Dos2Unix stdin } else { foreach f $argv { Dos2Unix $f } } Then, inside Dos2Unix you'll have to check for a file named stdin and do something special. (Of course, there is a problem here if you want to convert a file who's name is actually "stdin".) if {$f == "stdin"} { set in stdin set out stdout } else { set in [open $f] set out [open $f.new] } And later you don't need to do the file rename or necessarily close the standard channels. if {$in != "stdin"} { close $in close $out file rename -force $f.new $f } |