Building a Simple Web Application Security Scanner with Python: A Beginner's Guide

Chaitanya RahalkarDecember 13, 2024About 7 min

Building a Simple Web Application Security Scanner with Python: A Beginner's Guide 관련

In this article, you are going to learn to create a basic security tool that can be helpful in identifying common vulnerabilities in web applications. I have two goals here. The first is to empower you with the skills to develop tools that can help e...

In this article, you are going to learn to create a basic security tool that can be helpful in identifying common vulnerabilities in web applications.

I have two goals here. The first is to empower you with the skills to develop tools that can help enhance the overall security posture of your websites. The second is to help you practice some Python programming.

In this guide, you will be building a Python-based security scanner that can detect XSS, SQL injection, and sensitive PII (Personally Identifiable Information).

Types of Vulnerabilities

Generally, we can categorize web security vulnerabilities into the following buckets (for even more buckets, check the OWASP Top 10):

SQL injection: A technique where attackers are able to insert malicious SQL code into SQL queries through unvalidated inputs, allowing them to modify / read database contents.
Cross-Site Scripting (XSS): A technique where attackers inject malicious JavaScript in trusted websites. This allows them to execute the JavaScript code in the context of the browser and steal sensitive information or perform unauthorized operations.
Sensitive information exposure: A security issue where an application unintentionally reveals sensitive data like passwords, API keys and so on through logs, insecure storage, and other vulnerabilities.
Common security misconfigurations: Security issues that occurs due to improper configuration of web servers – like default credentials for administrator accounts, enabled debug mode, publicly available administrator dashboards with weak credentials, and so on.
Basic authentication weaknesses: Security issues that occur due to lapses in password policies, user authentication processes, improper session management, and so on.

Prerequisites

To follow along with this tutorial, you will be needing:

Python 3.x
Basic understanding of HTTP protocols
Basic understanding of web applications
Basic understanding of how XSS, SQL injection, and basic security attacks work

Setting Up Our Development Environment

Let’s install our required dependencies with the following command:

pip install requests beautifulsoup4 urllib3 colorama

We’ll use these dependencies in our code file:

# Required packages
import requests
from bs4 import BeautifulSoup
import urllib.parse
import colorama
import re
from concurrent.futures import ThreadPoolExecutor
import sys
from typing import List, Dict, Set

Building our Core Scanner Class

Once you have the dependencies, it’s time to write the core scanner class.

This class will serve as our main class that will handle the web security scanning functionality. It will track our visited pages and also store our findings.

We have the normalize_url function that we’ll use to ensure that you don’t rescan URLs that have already been seen before. This function will essentially remove the HTTP GET parameters from the URL. For example, https://example.com/page?id=1 will become https://example.com/page after normalizing it.

class WebSecurityScanner:
    def __init__(self, target_url: str, max_depth: int = 3):
        """
        Initialize the security scanner with a target URL and maximum crawl depth.

        Args:
            target_url: The base URL to scan
            max_depth: Maximum depth for crawling links (default: 3)
        """
        self.target_url = target_url
        self.max_depth = max_depth
        self.visited_urls: Set[str] = set()
        self.vulnerabilities: List[Dict] = []
        self.session = requests.Session()

        # Initialize colorama for cross-platform colored output
        colorama.init()

    def normalize_url(self, url: str) -> str:
        """Normalize the URL to prevent duplicate checks"""
        parsed = urllib.parse.urlparse(url)
        return f"{parsed.scheme}://{parsed.netloc}{parsed.path}"

Implementing the Crawler

The first step in our scanner is to implement a web crawler that will discover pages and URLs in a given target application. Make sure you’re writing these functions in our WebSecurityScanner class.

def crawl(self, url: str, depth: int = 0) -> None:
    """
    Crawl the website to discover pages and endpoints.

    Args:
        url: Current URL to crawl
        depth: Current depth in the crawl tree
    """
    if depth > self.max_depth or url in self.visited_urls:
        return

    try:
        self.visited_urls.add(url)
        response = self.session.get(url, verify=False)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all links in the page
        links = soup.find_all('a', href=True)
        for link in links:
            next_url = urllib.parse.urljoin(url, link['href'])
            if next_url.startswith(self.target_url):
                self.crawl(next_url, depth + 1)

    except Exception as e:
        print(f"Error crawling {url}: {str(e)}")

This crawl function helps us perform a depth-first crawl of a website. It will explore all pages of a website while staying within the specified domain.

For example, if you plan to use this scanner on https://google.com, the function will first get all the URLs and then one-by-one check if they belong to the specified domain (that is, google.com). If so, it will recursively continue to scan the seen URL up to a specified depth which is supplied with the depth parameter as an argument to the function. We also have some exception handling to make sure we handle errors smoothly and report any errors during crawling.

Designing and Implementing the Security Checks

Now let’s finally get to the juicy part and implement our security checks. We’ll start first with SQL Injection.

SQL Injection Detection Check

def check_sql_injection(self, url: str) -> None:
    """Test for potential SQL injection vulnerabilities"""
    sql_payloads = ["'", "1' OR '1'='1", "' OR 1=1--", "' UNION SELECT NULL--"]

    for payload in sql_payloads:
        try:
            # Test GET parameters
            parsed = urllib.parse.urlparse(url)
            params = urllib.parse.parse_qs(parsed.query)

            for param in params:
                test_url = url.replace(f"{param}={params[param][0]}", 
                                     f"{param}={payload}")
                response = self.session.get(test_url)

                # Look for SQL error messages
                if any(error in response.text.lower() for error in 
                    ['sql', 'mysql', 'sqlite', 'postgresql', 'oracle']):
                    self.report_vulnerability({
                        'type': 'SQL Injection',
                        'url': url,
                        'parameter': param,
                        'payload': payload
                    })

        except Exception as e:
            print(f"Error testing SQL injection on {url}: {str(e)}")

This function essentially performs basic SQL injection checks by testing the URL against common SQL injection payloads and looking for error messages that might hint at a security vulnerability.

Based on the error message received after performing a simple GET request on the URL, we check whether that message is a database error or not. If it is, we use the report_vulnerability function to report that as a security issue in our final report that this script will generate. For the sake of this example, we are selecting a few commonly tested SQL injection payloads, but you can extend this to test even more.

XSS (Cross-Site Scripting) Check

Now let’s implement the second security check for XSS payloads.

def check_xss(self, url: str) -> None:
    """Test for potential Cross-Site Scripting vulnerabilities"""
    xss_payloads = [
        "<script>alert('XSS')</script>",
        "<img src=x onerror=alert('XSS')>",
        "javascript:alert('XSS')"
    ]

    for payload in xss_payloads:
        try:
            # Test GET parameters
            parsed = urllib.parse.urlparse(url)
            params = urllib.parse.parse_qs(parsed.query)

            for param in params:
                test_url = url.replace(f"{param}={params[param][0]}", 
                                     f"{param}={urllib.parse.quote(payload)}")
                response = self.session.get(test_url)

                if payload in response.text:
                    self.report_vulnerability({
                        'type': 'Cross-Site Scripting (XSS)',
                        'url': url,
                        'parameter': param,
                        'payload': payload
                    })

        except Exception as e:
            print(f"Error testing XSS on {url}: {str(e)}")

This function, just like the SQL injection tester, uses a set of common XSS payloads and applies the same idea. But the key difference here is that we are looking for our injected payload to appear unmodified in our response rather than looking for an error message.

If you are able to see our injected payload, most likely it will be executed in the context of the victim’s browser as a reflected XSS attack.

Sensitive Information Exposure Check

Now let’s implement our final check for sensitive PII.

def check_sensitive_info(self, url: str) -> None:
    """Check for exposed sensitive information"""
    sensitive_patterns = {
        'email': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
        'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
        'api_key': r'api[_-]?key[_-]?([\'"|`])([a-zA-Z0-9]{32,45})\1'
    }

    try:
        response = self.session.get(url)

        for info_type, pattern in sensitive_patterns.items():
            matches = re.finditer(pattern, response.text)
            for match in matches:
                self.report_vulnerability({
                    'type': 'Sensitive Information Exposure',
                    'url': url,
                    'info_type': info_type,
                    'pattern': pattern
                })

    except Exception as e:
        print(f"Error checking sensitive information on {url}: {str(e)}")

This function uses a set of predefined Regex patterns to search for PII like emails, phone numbers, SSNs, and API keys (that are prefixed with api-key-<number>).

Just like the previous two functions, we use the response text for the URL and our Regex patterns to find these PIIs in the response text. If we do find any, we report them with the report_vulnerability function. Make sure to have all these functions defined in the WebSecurityScanner class.

Implementing the Main Scanning Logic

Let’s finally stitch everything together by defining the scan and report_vulnerability function in the WebSecurityScanner class:

def scan(self) -> List[Dict]:
    """
    Main scanning method that coordinates the security checks

    Returns:
        List of discovered vulnerabilities
    """
    print(f"\n{colorama.Fore.BLUE}Starting security scan of {self.target_url}{colorama.Style.RESET_ALL}\n")

    # First, crawl the website
    self.crawl(self.target_url)

    # Then run security checks on all discovered URLs
    with ThreadPoolExecutor(max_workers=5) as executor:
        for url in self.visited_urls:
            executor.submit(self.check_sql_injection, url)
            executor.submit(self.check_xss, url)
            executor.submit(self.check_sensitive_info, url)

    return self.vulnerabilities

def report_vulnerability(self, vulnerability: Dict) -> None:
    """Record and display found vulnerabilities"""
    self.vulnerabilities.append(vulnerability)
    print(f"{colorama.Fore.RED}[VULNERABILITY FOUND]{colorama.Style.RESET_ALL}")
    for key, value in vulnerability.items():
        print(f"{key}: {value}")
    print()

This code defines our scan function which will essentially invoke the crawl function and recursively start crawling the website. With multithreading, we will apply all three security checks on the visited URLs.

We have also defined the report_vulnerability function which will effectively print our vulnerability to the console and also store them in our vulnerabilities array.

Now let’s finally use our scanner by saving it as scanner.py:

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python scanner.py <target_url>")
        sys.exit(1)

    target_url = sys.argv[1]
    scanner = WebSecurityScanner(target_url)
    vulnerabilities = scanner.scan()

    # Print summary
    print(f"\n{colorama.Fore.GREEN}Scan Complete!{colorama.Style.RESET_ALL}")
    print(f"Total URLs scanned: {len(scanner.visited_urls)}")
    print(f"Vulnerabilities found: {len(vulnerabilities)}")

The target URL will be supplied as a system argument and we will get the summary of URLs scanned and vulnerabilities found at the end of our scan. Now let’s discuss how you can extend the scanner and add more features.

Extending the Security Scanner

Here are some ideas to extend this basic security scanner into something even more advanced:

Add more vulnerability checks like CSRF detection, directory traversal, and so on.
Improve reporting with an HTML or PDF output.
Add configuration options for scan intensity and scope of searching (specifying the depth of scans through a CLI argument).
Implementing proper rate limiting.
Adding authentication support for testing URLs that require session-based authentication.

Wrapping Up

Now you know how to build a basic security scanner! This scanner demonstrates a few core concepts of Web Security.

Keep in mind that this tutorial should only be used for educational purposes. There are several professionally designed enterprise-grade applications like Burp Suite and OWASP Zap that can check for hundreds of security vulnerabilities at a much larger scale.

I hope you learned the basics of web security and a bit of Python programming as well.

Building a Simple Web Application Security Scanner with Python: A Beginner's Guide