Implementing a Basic Text Sanitization Function in Python with Regular Expressions

5 min read

📚 Quick Review: This practical application is built upon a fundamental programming concept. Review the Theory Lesson here first.


Building a Python Function for Basic HTML Text Sanitization

Understanding the theory behind text sanitization is crucial, but putting it into practice is where the real learning happens. This lesson will guide you through creating a simple yet effective Python function using the built-in re module to sanitize HTML content. We’ll break down each part of the code, explaining the regular expressions and their purpose in stripping potentially harmful elements like script tags and other HTML markup.

The Python Sanitization Function

Here’s the Python code snippet we’ll be dissecting:

import re

def sanitize_text(html_content):
    # Remove script tags and their content
    clean = re.sub(r'<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>', '', html_content)
    # Strip all other HTML tags leaving only pure alphanumeric and code text
    clean = re.sub(r'<[^>]+>', '', clean)
    # Standardize whitespace and returns
    clean = re.sub(r'\s+', ' ', clean).strip()
    return clean

Line-by-Line Code Breakdown

import re

This line imports Python’s built-in regular expression module. The re module provides operations for working with regular expressions, which are essential for pattern matching and manipulation of strings. It’s the core tool we’ll use to identify and remove unwanted parts of our HTML content.

def sanitize_text(html_content):

This defines a function named sanitize_text that takes one argument: html_content. This argument is expected to be a string containing the HTML text that needs to be sanitized. The function will return a cleaned string.

clean = re.sub(r'<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>', '', html_content)

This is the most complex part of the sanitization. It uses re.sub() to find and replace patterns in the html_content string. Let’s break down the regex:

  • r'<script\b': Matches the literal string <script followed by a word boundary (\b) to ensure we match a whole tag, not part of another word.
  • [^<]*: Matches any character that is not <, zero or more times. This handles attributes within the opening <script> tag.
  • (?:(?!<\/script>)<[^<]*)*: This is a non-capturing group ((?:...)) with a negative lookahead ((?!<\/script>)). It’s designed to match any character sequence that is not the closing </script> tag. This ensures that the regex consumes everything between the opening <script> and its corresponding closing </script>, even if there are nested < characters that are not part of a closing script tag.
  • <\/script>: Matches the literal closing </script> tag.

The entire pattern effectively finds and removes <script> tags and all their content, replacing them with an empty string ('').

clean = re.sub(r'<[^>]+>', '', clean)

After removing script tags, this line targets all other HTML tags. The regex r'<[^>]+>' works as follows:

  • <: Matches the literal opening angle bracket.
  • [^>]+: Matches one or more characters that are NOT a closing angle bracket (>). This covers the tag name and any attributes.
  • >: Matches the literal closing angle bracket.

This effectively strips out all remaining HTML tags, leaving only the plain text content.

clean = re.sub(r'\s+', ' ', clean).strip()

This final sanitization step focuses on standardizing whitespace:

  • r'\s+': Matches one or more whitespace characters (spaces, tabs, newlines, etc.).
  • ' ': Replaces all matched whitespace sequences with a single space. This collapses multiple spaces or newlines into a single space.
  • .strip(): This is a string method (not part of re) that removes any leading or trailing whitespace from the resulting string.

The result is a clean string with standardized single spaces between words and no leading/trailing whitespace.

return clean

The function returns the final sanitized string.

Example Usage and Execution Environment

To use this function, you simply need a Python interpreter. You can save the code in a .py file (e.g., sanitizer.py) and then import and call the function:

# Example of how to use the sanitize_text function

html_input = "<h1>Welcome!</h1><p>This is <strong>some</strong> content.</p><script>alert('XSS Attack!');</script><img src='x' onerror='alert("Another attack!")'>" 
malicious_input = "<div>Hello <a href='javascript:alert(\'XSS\')'>Click Me</a> <script>document.cookie='stolen';</script> World!</div>"

# Assuming sanitize_text is defined or imported
sanitized_output1 = sanitize_text(html_input)
sanitized_output2 = sanitize_text(malicious_input)

print("Original 1:", html_input)
print("Sanitized 1:", sanitized_output1)
print("Original 2:", malicious_input)
print("Sanitized 2:", sanitized_output2)

When you run this Python script, you will see the original HTML content alongside its sanitized version, demonstrating how script tags and other HTML elements are effectively removed, leaving only the plain text.

Original 1: <h1>Welcome!</h1><p>This is <strong>some</strong> content.</p><script>alert('XSS Attack!');</script><img src='x' onerror='alert("Another attack!")'>
Sanitized 1: Welcome! This is some content.
Original 2: <div>Hello <a href='javascript:alert(\'XSS\')'>Click Me</a> <script>document.cookie='stolen';</script> World!</div>
Sanitized 2: Hello Click Me World!
💡 Developer Tip: Thoroughly test your sanitization logic with various edge cases, including malformed HTML, unicode characters, and different attack vectors (e.g., event handlers like onerror, CSS expressions, data URIs in attributes) to ensure comprehensive protection. A simple regex might miss sophisticated bypasses.

1 comment

Leave a Reply

Your email address will not be published. Required fields are marked *