Sanitization

/ˌsa-nə-tə-ˈzā-shən/

noun — "making input safe without necessarily changing what it means."

Sanitization is the process of modifying, filtering, escaping, encoding, or transforming data so that it can be safely processed, stored, displayed, or transmitted by a system. Unlike Input Validation, which determines whether data is acceptable, sanitization focuses on making accepted data safe to use within a particular context.

The distinction is subtle but important. Validation asks, "Should this data be allowed?" Sanitization asks, "How can this data be safely handled?" In many systems, both processes work together. Data is first validated against business rules and then sanitized before being stored or rendered.

For example, a comment system may allow users to submit text containing special characters. The input itself is valid, but displaying it directly inside a web page could create security risks if those characters are interpreted as executable markup.

// original input
<script>alert('hello')</script>

// sanitized output
&lt;script&gt;alert('hello')&lt;/script&gt;

In this case, the meaning of the text remains unchanged, but its representation is transformed so that the browser displays it as text rather than executing it as code.

Sanitization appears throughout computing because different systems interpret characters differently. A value that is harmless in one context may become dangerous in another. This is especially true when data crosses boundaries between users, databases, operating systems, programming languages, and web browsers.

Common forms of sanitization include:

HTML escaping
URL encoding
SQL parameterization
Character filtering
Filename normalization
Log output escaping
Unicode normalization

Consider a filename uploaded by a user:

// original filename
../../../config.txt

// sanitized filename
config.txt

Here, sanitization removes path traversal elements that could otherwise allow access to unintended files.

Sanitization is closely associated with application security because many attacks rely on special characters being interpreted in ways developers did not intend. Without proper sanitization, attackers may inject commands, markup, scripts, or malformed data structures into a system.

Historically, vulnerabilities such as Cross-Site Scripting (XSS), command injection, log injection, and various forms of content spoofing have frequently resulted from inadequate sanitization.

One of the most common mistakes is treating sanitization as a universal operation. In reality, sanitization is context-dependent. Data sanitized for HTML output may still be unsafe for SQL queries, shell commands, JSON documents, or XML processing.

// safe for HTML
Tom & Jerry

// safe for URL
Tom%20%26%20Jerry

// safe for JSON
"Tom & Jerry"

Each destination requires a different sanitization strategy because each interpreter follows different parsing rules.

Sanitization also improves reliability. User input often contains unexpected whitespace, control characters, inconsistent encodings, or malformed formatting. Cleaning and normalizing this data reduces ambiguity and helps systems process information consistently.

It is important to note that sanitization is not a replacement for validation. A malformed email address does not become valid simply because unsafe characters are removed. Likewise, a value outside an acceptable range remains invalid even after being sanitized.

A common defensive pattern is:

// defensive workflow

validate(input)
sanitize(input)
store(input)
render(input)

This approach ensures that data satisfies business requirements while also remaining safe within every environment it passes through.

Conceptually, Sanitization acts like a translation layer between raw input and system interpretation. Its purpose is not to decide whether information is correct, but to ensure that the information cannot accidentally or intentionally be interpreted as something else.

A useful rule of thumb is that validation protects a system from bad data, while sanitization protects a system from dangerous interpretations of otherwise acceptable data.

Ultimately, Sanitization is a foundational security and reliability practice. It helps ensure that information remains data rather than unexpectedly becoming instructions, commands, markup, or executable behavior.

See Input Validation, Cross-Site Scripting, SQL Injection, Character Encoding, Software Design

Computing

Security

Cleansing

/ˌsa-nə-tə-ˈzā-shən/

See More