ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Understanding Newlines
Pages: 1, 2, 3

When Conventions Don't Match

With ubiquitous networking, there are plenty of situations where you will find text with a different newline convention than that used by the runtime platform.

If you read a file in Linux coming from a FAT32 partition, you'll likely encounter CRLF pairs. If you read a file in a DOS console written using nano inside Cygwin with default settings, you'll find Unix newlines.

Some file transfer tools such as FTP and version control systems including Subversion, are aware of these issues and perform the necessary housekeeping. If a developer working on Windows XP checks a Perl source file into a Subversion repository, and another developer checks it out in her laptop running Debian, Subversion takes care of newlines and transforms them transparently. Each local copy has the corresponding native convention.

If you receive a file as an email attachment, inside a tarball, via Jabber, via Bluetooth, from a memory stick, or through any other less helpful means, no newline normalization is performed. The canonical example of this is a logfile generated in a Windows machine and emailed to a developer that works with some Unix flavor. He opens the log in a pager and sees strange ^Ms at the end of lines. What on earth?

You now have the theory to understand what's happening. The file used CRLF as the newline convention because it was generated on Windows. The developer opened the file with a Unix pager that expected Unix newline conventions and consequently broke lines at LFs. Fortunately, there's an LF in each pair, so the lines look almost correct in the pager, except that there's also a spurious CR that nobody removed. That character is the one the pager shows as ^M, and that's why it appears at the end of lines.

You have the necessary knowledge to fix this with a one-liner:

  perl -pi.bak -we 's/\015\012/\012/' error.log

On Linux, PerlIO does nothing with newlines and those control characters will pass untouched. A single "\012" represents a line terminator in that system, so in each iteration $_ ends with "\015\012". Clearly, all you need to do is to delete the "\015"s.

Sometimes, though, you want to be able to deal with text no matter what its convention is, and perhaps even if there's a mix of them. For instance, a robust CGI program may want to normalize newlines in order to apply a regex to some text coming from a text area, because the text comes from another machine:

  {
      my $ALT_NL = "\n" eq "\012" ? "\015" : "\012";
      sub normalize_newlines {
          my $text = shift;
          $text =~ s/\015\012|$ALT_NL/\n/go;
          return $text;
      }
  }

If you ever need to guess the newline convention used in some file, this is a start:

  my $filename = shift;
  open my $fh, "<", $filename or die $!;
  binmode $fh; # let line terminators pass untouched in CRLF platforms
  
  $/ = \1024; # read at most that many bytes
  my $buf = <$fh>;
  close $fh;
  
  my $convention = 'unknown';
  if ($buf =~ /\015\012/) {
      $convention = 'CRLF (Windows)';
  } elsif ($buf =~ /\012/) {
      $convention = "LF (Unix)";
  } elsif ($buf =~ /\015/) {
      $convention = 'CR (Mac pre-OSX)';
  }
  print "$convention\n";

If this article accomplished its objective, you'll understand that program right away.

Newlines in Unicode

The Unicode Standard defines even more codes for newlines: CR (Carriage Return, 000D), LF (Line Feed, 000A), CRLF (Carriage Return and Line Feed, 000D,000A), NEL (Next Line, 0085), FF (Form Feed, 000C), LS (Line Separator, 2028), and PS (Paragraph Separator, 2029).

Unicode recommends that your program understand any of these on input, although that's not a conformance requirement and not yet a common practice. Many languages, including Perl, do not yet offer an easy way to follow that recommendation anyway. Perl's readline operator uses $/ to determine record boundaries, which cannot represent an alternation.

A partial solution is to use an extended normalizer:

  {
      my $ALT_NL = "\n" eq "\012" ? "\015" : "\012";
      sub normalize_newlines_for_unicode {
          my $text = shift;
          $text =~ s/\015\012|$ALT_NL|\x{0085}|\x{000C}|\x{2028}|\x{2029}/\n/go;
          return $text;
      }
  }

You could emulate a line-oriented loop for arbitrary Unicode text:

  {
    local $/ = "\012"; # ensure we don't halve CRLFs
    while (my $LF_line = <$fh>) {
        open my $gh, '<', \normalize_newlines_for_unicode($LF_line);
        {
            local $/ = "\n";
            while (my $line = <$gh>) {
                # ...
            }    
        }
    }
  }

but that is a hack. For example, if the text uses only LSs, the outer loop will slurp it all in during the first iteration. Generic solutions cannot allow that possibility. In addition, you may lose the original newline characters, and that may or may not be what you want when working with Unicode text.

A better approach is a low-level solution that processes input character by character--for instance, a PerlIO layer.

As for output, the recommendation is to map any of those newline characters "appropriately," which is a bit vague. As a guideline, you can just print "\n" as usual for regular newlines.

The anchors ^ and $ in Perl's regular expressions do not cope yet with LS and friends. Perhaps they will someday. On the other hand, NEL, LS, and PS do match \s in Unicode strings.

All in all, these newline codes are rather new in practice and quite unlikely to appear "out in the wild" for a while.

Acknowledgments

I would like to thank Nick Ing-Simmons for his clarifications about PerlIO, Enrique Nell for proofreading a draft of this article, and Jarkko Hietaniemi and Sadahiro Tomoyuki for their advice on Unicode.

References

Xavier Noria is a Perl specialist and dynamic languages enthusiast.


Return to ONLamp.com.



Sponsored by: