ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Understanding Newlines

by Xavier Noria
08/17/2006

Programmers deal with text all the time:

  % perl -wpe1 alice.txt
  There was nothing so very remarkable in that; nor did Alice think
  it so very much out of the way to hear the Rabbit say to itself,
  `Oh dear! Oh dear! I shall be late!'

That's three nicely formatted lines of text. Now look at what's actually in the file alice.txt:

  % perl -w0777e 'print join ".", unpack "C*", <>' alice.txt
  84.104.101.114.101.32.119.97.115.32.110.111.116.104.105.110.103.\
  32.115.111.32.118.101.114.121.32.114.101.109.97.114.107.97.98.\
  108.101.32.105.110.32.116.104.97.116.59.32.110.111.114.32.100.\
  105.100.32.65.108.105.99.101.32.116.104.105.110.107.10.105...

It's just a bunch of codes in a row. Numbers.

What's going on? Computers only understand numbers. That's the way they work; they encode text using numbers. To interpret them as text, software maps between numbers and characters. The map used in the example--ASCII--establishes that number 84 corresponds to letter "T", number 104 to "h", number 101 to "e", and so on. Those mappings are technically called character encodings, and there are many. Most are extensions of ASCII.

Conversely, in:

  print $fh "foo";

... Perl prints the codes that correspond to letters "f", "o", and "o" into $fh. In fact, those letters are numbers already within Perl; that's the way Perl represents strings internally.

Not all codes correspond to letters, though. For instance, you may have noticed that there are no spaces in the list of numbers in the example above, while there are plenty of them in the original text. How do spaces end up in the console? Where do they come from? It turns out that in ASCII, number 32 corresponds to the space character. When printing the text, an empty spot appears whenever the number 32 comes up. This is the same principle that yields a "T" from number 84.

Similarly, when the number 10 appears, a Unix terminal starts a fresh new line and code translation continues. The result of this process is text that renders like a book. Yet, that is an illusion: there are no letters, spaces, or separated lines in files or strings.

Newlines are encoded using numbers as everything else. Forget about the way they look; they're just codes. That's the key point to understand newlines in computers.

What is "\n?"

For historical reasons, no single code is unequivocally interpreted as a newline. ASCII code 10 is technically called "newline," but unfortunately, the actual representation of newlines depends on the operating system or the application context.

The codes used to represent newlines in ASCII-based systems are:

Thus, there are three different conventions. Each ASCII-based platform follows one of them:

If you fire up an editor in three computers running an operating system from each of those three families, then enter x + Return + y in each one and save, the result on disk is different. Following the same nomenclature used earlier, you would see these bytes, respectively:

Text editors do that transparently. How do you produce the right code or codes from Perl (or another programming language that works similarly)? Suppose you want to print "foo" followed by a newline on Linux. According to the previous list, you could write:

  print "foo\012";

That's correct. Now, what if you want to print "foo" followed by a newline on Windows? In theory, you would need to write instead:

  print "foo\015\012"; # but not actually true!

What if you don't know in advance the operating system your script is going to run on? Imagine that you are going to publish a program that must be OS-independent. That is, you want your program to be portable. In principle, you need to write something like:

  if ($^O eq "darwin" || ...) {
      print "foo\012";
  } elsif ($^O eq "MSWin32" || ...) {
      print "foo\015\012"; # not actually true, see below
  } else {
      print "foo\015";
  }

That's ugly, but it can be encapsulated somewhere so that you only need to write:

  print "foo", newline_for_runtime_platform();

That is, in a sense, what "\n" means. Not exactly with that implementation, but the idea is to use "\n" to output a newline in a portable way; Perl knows what to do in each system:

  print "foo\n"; # does the right thing in every system

That's common in C-based languages such as Perl. Other languages have different semantics for "\n". For instance, "\n" is not a portable newline in Java; in order to print foo followed by a newline in a portable way in Java, use method calls such as System.out.println("foo").

Behind the Scenes

The previous section sacrificed accuracy in order to give you the whole picture. You are ready now to learn the details.

There are two important facts about \n in Perl that you must clearly understand:

Given that:

  print $fh "foo\n";

does the right thing on Windows, and given that "\n" is really LF there, you may wonder how the CRLF pair ends up correctly in $fh.

Perl inherits from C the approach to handle line terminators; there is a layer responsible for all I/O operations in Perl. Since 5.8.0, that layer is, by default, PerlIO. When your script prints, PerlIO intervenes and does some magic: if Perl is running on a CRLF platform, it transforms, on the fly, all "\n"s in the stream into CRLF pairs. It is totally transparent, so you won't notice it.

The C code that performs that transformation on Windows lives in the function PerlIOCrlf_write(), defined in perlio.c:

  if (*buf == '\n') {
      /* ... */
      *(b->ptr)++ = 0xd;      /* CR */
      *(b->ptr)++ = 0xa;      /* LF */
      /* ... */
  }

Note that the '\n' in that code is a C char, not a Perl string. Fortunately, the semantics coincide, and thus the condition tests what it has to test.

That transformation goes the other way around when reading text files on Windows or any other CRLF platform. The layer replaces any CRLF pair on the fly by a single '\n' character. That happens in PerlIOCrlf_get_cnt(), as well as in perlio.c:

  if (nl < b->end && *nl == 0xd) {
      test:
      if (nl + 1 < b->end) {
          if (nl[1] == 0xa) {
              *nl = '\n';
              c->nl = nl;
          }
          /* ... */
      }
      /* ... */
  }

Note that this handles only actual CRLF pairs. Isolated CRs or LFs will remain untouched.

Thus, if you read lines from a text file in Windows using the standard line-oriented while loop:

  while (my $line = <$fh>) {
      # ...
  }

No CRLF pair ever gets into $line, only LFs.

All that magic happens on file handles associated with files. By default, Perl opens files in text mode in CRLF platforms, which means no more and no fewer than those transformations occur. You can disable them with binmode(). Other streams--sockets, for example--are in binmode by default.

That's why you need to open images and all non-text files in binmode on Windows. Those conventions about newlines are just for regular text files. They have nothing to do with, say, the bytes used in PNG images. A PNG image could in principle have a CRLF pair by chance somewhere, meaning something else. If you open a PNG image for reading and don't set binmode on its file handle, the IO layer will perform those on-the-fly substitutions and may corrupt data by filtering out some bytes. The same happens with writing. If you have a buffer with bytes representing a song in MP3 format and write it to disk in text mode on Windows, all 0xas will get a 0xd inserted before, and the MP3 will become garbage.

On the other hand, you can activate the CRLF to and from "\n" no matter what the platform is, thanks to the :crlf PerlIO layer:

  open my $fh, "<:crlf", "alice.txt" or die $!;

With that little trick, your script will understand text with either native conventions or CRLF.

What Does "Portability" Mean?

As far as newlines go, a portable program does its job well on the assumption that the newline convention of text data is that of the runtime platform.

Those conventions are only knowable at runtime, perhaps in some other machine running some unknown operating system. It could be a nightmare, but fortunately, good languages give you idioms to accomplish this effortlessly. It is good practice to always write in a portable way.

In Perl, use "\n" to print newlines. Use <> to do line-based loops over file handles associated with text files, or slurp lines in list context. Use chomp() to delete the newline from a line of text. Use "\n" in regular expressions to match line terminators, and ^ and $ for assertions about line boundaries, with /m if necessary.

Set binmode on file handles associated with binary files, even if you develop in a system that does not distinguish text and binary files. Usually, binmode is also necessary if you plan to seek()/tell() and read(). That's because read() also transforms CRLF into "\n" in CRLF platforms in text mode, but seek()/tell() do not and byte offsets may differ.

Do not assume "\n" is "\012"--a common pitfall in sockets programming. That does not necessarily hold true. For instance, if you need a CRLF pair to generate raw HTTP headers, do not use "\r\n" as terminator. The meaning of that string depends on the system. Hardcode it instead, as in "\cM\cJ". Even better, use the variables provided by Socket.pm via the :crlf export tag: $CR, $LF, and $CRLF. These are portable solutions.

Portability, however, does not mean readiness to accept any kind of text--only text with the convention of the runtime platform. Suppose you need to write a portable Perl script that counts the number of lines of the files passed as arguments:

  my $lines = 0;
  ++$lines while <>;
  print "$lines\n";

That program is correct and portable. It portably handles line terminators on reading, delegating to the diamond operator, and it outputs a newline in a portable way, via "\n".

That does not mean it works in any possible situation, of course. For example, suppose that one day a coworker comes to you with a MacBook and says the script worked flawlessly for weeks, but suddenly some file with multiple lines is reported to have a single line. If you understand how newlines work, you'll debug that in a couple of minutes. Otherwise, you'll be lost.

The problem must be that the input file is not using LF for newlines, which is the convention in Mac OS X. Editors such as Vim, Emacs, and TextMate let the user configure newlines so there's always a risk. The diamond operator looks for LFs in Mac OS X. If the file uses CRs, the entire file lives in a single line according to Mac OS X conventions. That coincides with the observed behavior and becomes your conjecture.

Note that if the file used CRLF, the number of lines computed would be correct in Mac OS X (by accident, but correct).

When Conventions Don't Match

With ubiquitous networking, there are plenty of situations where you will find text with a different newline convention than that used by the runtime platform.

If you read a file in Linux coming from a FAT32 partition, you'll likely encounter CRLF pairs. If you read a file in a DOS console written using nano inside Cygwin with default settings, you'll find Unix newlines.

Some file transfer tools such as FTP and version control systems including Subversion, are aware of these issues and perform the necessary housekeeping. If a developer working on Windows XP checks a Perl source file into a Subversion repository, and another developer checks it out in her laptop running Debian, Subversion takes care of newlines and transforms them transparently. Each local copy has the corresponding native convention.

If you receive a file as an email attachment, inside a tarball, via Jabber, via Bluetooth, from a memory stick, or through any other less helpful means, no newline normalization is performed. The canonical example of this is a logfile generated in a Windows machine and emailed to a developer that works with some Unix flavor. He opens the log in a pager and sees strange ^Ms at the end of lines. What on earth?

You now have the theory to understand what's happening. The file used CRLF as the newline convention because it was generated on Windows. The developer opened the file with a Unix pager that expected Unix newline conventions and consequently broke lines at LFs. Fortunately, there's an LF in each pair, so the lines look almost correct in the pager, except that there's also a spurious CR that nobody removed. That character is the one the pager shows as ^M, and that's why it appears at the end of lines.

You have the necessary knowledge to fix this with a one-liner:

  perl -pi.bak -we 's/\015\012/\012/' error.log

On Linux, PerlIO does nothing with newlines and those control characters will pass untouched. A single "\012" represents a line terminator in that system, so in each iteration $_ ends with "\015\012". Clearly, all you need to do is to delete the "\015"s.

Sometimes, though, you want to be able to deal with text no matter what its convention is, and perhaps even if there's a mix of them. For instance, a robust CGI program may want to normalize newlines in order to apply a regex to some text coming from a text area, because the text comes from another machine:

  {
      my $ALT_NL = "\n" eq "\012" ? "\015" : "\012";
      sub normalize_newlines {
          my $text = shift;
          $text =~ s/\015\012|$ALT_NL/\n/go;
          return $text;
      }
  }

If you ever need to guess the newline convention used in some file, this is a start:

  my $filename = shift;
  open my $fh, "<", $filename or die $!;
  binmode $fh; # let line terminators pass untouched in CRLF platforms
  
  $/ = \1024; # read at most that many bytes
  my $buf = <$fh>;
  close $fh;
  
  my $convention = 'unknown';
  if ($buf =~ /\015\012/) {
      $convention = 'CRLF (Windows)';
  } elsif ($buf =~ /\012/) {
      $convention = "LF (Unix)";
  } elsif ($buf =~ /\015/) {
      $convention = 'CR (Mac pre-OSX)';
  }
  print "$convention\n";

If this article accomplished its objective, you'll understand that program right away.

Newlines in Unicode

The Unicode Standard defines even more codes for newlines: CR (Carriage Return, 000D), LF (Line Feed, 000A), CRLF (Carriage Return and Line Feed, 000D,000A), NEL (Next Line, 0085), FF (Form Feed, 000C), LS (Line Separator, 2028), and PS (Paragraph Separator, 2029).

Unicode recommends that your program understand any of these on input, although that's not a conformance requirement and not yet a common practice. Many languages, including Perl, do not yet offer an easy way to follow that recommendation anyway. Perl's readline operator uses $/ to determine record boundaries, which cannot represent an alternation.

A partial solution is to use an extended normalizer:

  {
      my $ALT_NL = "\n" eq "\012" ? "\015" : "\012";
      sub normalize_newlines_for_unicode {
          my $text = shift;
          $text =~ s/\015\012|$ALT_NL|\x{0085}|\x{000C}|\x{2028}|\x{2029}/\n/go;
          return $text;
      }
  }

You could emulate a line-oriented loop for arbitrary Unicode text:

  {
    local $/ = "\012"; # ensure we don't halve CRLFs
    while (my $LF_line = <$fh>) {
        open my $gh, '<', \normalize_newlines_for_unicode($LF_line);
        {
            local $/ = "\n";
            while (my $line = <$gh>) {
                # ...
            }    
        }
    }
  }

but that is a hack. For example, if the text uses only LSs, the outer loop will slurp it all in during the first iteration. Generic solutions cannot allow that possibility. In addition, you may lose the original newline characters, and that may or may not be what you want when working with Unicode text.

A better approach is a low-level solution that processes input character by character--for instance, a PerlIO layer.

As for output, the recommendation is to map any of those newline characters "appropriately," which is a bit vague. As a guideline, you can just print "\n" as usual for regular newlines.

The anchors ^ and $ in Perl's regular expressions do not cope yet with LS and friends. Perhaps they will someday. On the other hand, NEL, LS, and PS do match \s in Unicode strings.

All in all, these newline codes are rather new in practice and quite unlikely to appear "out in the wild" for a while.

Acknowledgments

I would like to thank Nick Ing-Simmons for his clarifications about PerlIO, Enrique Nell for proofreading a draft of this article, and Jarkko Hietaniemi and Sadahiro Tomoyuki for their advice on Unicode.

References

Xavier Noria is a Perl specialist and dynamic languages enthusiast.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.