🔒 feat(security): Add script to detect dangerous Unicode characters (#43)

This commit adds a new script `check-for-unicode/run.sh` that scans files and directories for potentially dangerous Unicode characters. These characters can be exploited in AI systems, cause display/parsing issues, or enable social engineering attacks. The script detects a comprehensive list of harmful Unicode characters, including: - Zero-width and invisible characters - Bidirectional text controls (Trojan Source attacks) - Annotation and formatting characters - Line and paragraph separators - Additional format characters - Variation selectors The script can be used to identify these characters in files and directories, helping to improve the security and reliability of systems that process text data.
2026-03-31 06:33:56 -04:00 · 2025-05-31 20:32:08 -05:00
parent 8cb380870a
commit df0ea596a5
2 changed files with 269 additions and 0 deletions
--- a/check-for-unicode/README.md
+++ b/check-for-unicode/README.md
@@ -0,0 +1,166 @@
+# Check for Unicode
+
+A security-focused script that scans files and directories for potentially dangerous Unicode characters that could be exploited in AI systems or cause display/parsing issues.
+
+## Purpose
+
+This script helps identify hidden or dangerous Unicode characters that can:
+
+- Cause security vulnerabilities in AI systems
+- Create invisible text manipulation
+- Lead to text rendering issues
+- Enable social engineering attacks through character spoofing
+
+## Detected Unicode Characters
+
+The script scans for the following potentially harmful Unicode characters:
+
+### Zero-Width and Invisible Characters
+
+| Unicode                   | Code Point | Description                                         | Risk Level |
+| ------------------------- | ---------- | --------------------------------------------------- | ---------- |
+| Zero Width Space          | U+200B     | Invisible character that can hide malicious content | High       |
+| Zero Width Non-Joiner     | U+200C     | Can break text parsing logic                        | Medium     |
+| Zero Width Joiner         | U+200D     | Can create unexpected character combinations        | Medium     |
+| Word Joiner               | U+2060     | Invisible character that prevents line breaks       | Medium     |
+| Function Application      | U+2061     | Mathematical invisible operator                     | Low        |
+| Invisible Times           | U+2062     | Mathematical invisible operator                     | Low        |
+| Invisible Separator       | U+2063     | Mathematical invisible operator                     | Low        |
+| Invisible Plus            | U+2064     | Mathematical invisible operator                     | Low        |
+| Zero Width No-Break Space | U+FEFF     | Byte Order Mark, can cause parsing issues           | Medium     |
+| Combining Grapheme Joiner | U+034F     | Can create unexpected character combinations        | Medium     |
+
+### Bidirectional Text Controls (Trojan Source - CVE-2021-42574)
+
+| Unicode                    | Code Point | Description                              | Risk Level |
+| -------------------------- | ---------- | ---------------------------------------- | ---------- |
+| Left-to-Right Embedding    | U+202A     | Can manipulate text direction            | High       |
+| Right-to-Left Embedding    | U+202B     | Can manipulate text direction            | High       |
+| Pop Directional Formatting | U+202C     | Ends directional formatting              | High       |
+| Left-to-Right Override     | U+202D     | Forces left-to-right text direction      | High       |
+| Right-to-Left Override     | U+202E     | Can reverse text direction for spoofing  | High       |
+| Left-to-Right Isolate      | U+2066     | Isolates text direction                  | High       |
+| Right-to-Left Isolate      | U+2067     | Isolates text direction                  | High       |
+| First Strong Isolate       | U+2068     | Isolates based on first strong character | High       |
+| Pop Directional Isolate    | U+2069     | Ends directional isolation               | High       |
+| Arabic Letter Mark         | U+061C     | Marks Arabic text direction              | Medium     |
+| Left-to-Right Mark         | U+200E     | Marks left-to-right text direction       | Medium     |
+| Right-to-Left Mark         | U+200F     | Marks right-to-left text direction       | Medium     |
+
+### Annotation and Formatting Characters
+
+| Unicode                           | Code Point | Description                         | Risk Level |
+| --------------------------------- | ---------- | ----------------------------------- | ---------- |
+| Interlinear Annotation Anchor     | U+FFF9     | Can hide annotations                | Medium     |
+| Interlinear Annotation Separator  | U+FFFA     | Separates annotation components     | Medium     |
+| Interlinear Annotation Terminator | U+FFFB     | Terminates annotations              | Medium     |
+| Object Replacement Character      | U+FFFC     | Placeholder for embedded objects    | Medium     |
+| Replacement Character             | U+FFFD     | Used for unknown/invalid characters | Low        |
+
+### Line and Paragraph Separators
+
+| Unicode             | Code Point | Description             | Risk Level |
+| ------------------- | ---------- | ----------------------- | ---------- |
+| Line Separator      | U+2028     | Can break parsing logic | Medium     |
+| Paragraph Separator | U+2029     | Can break parsing logic | Medium     |
+
+### Additional Format Characters
+
+| Unicode                   | Code Point | Description                 | Risk Level |
+| ------------------------- | ---------- | --------------------------- | ---------- |
+| Soft Hyphen               | U+00AD     | Invisible hyphenation point | Low        |
+| Hangul Choseong Filler    | U+115F     | Korean text filler          | Low        |
+| Hangul Jungseong Filler   | U+1160     | Korean text filler          | Low        |
+| Khmer Vowel Inherent Aq   | U+17B4     | Khmer script formatting     | Low        |
+| Khmer Vowel Inherent Aa   | U+17B5     | Khmer script formatting     | Low        |
+| Mongolian Vowel Separator | U+180E     | Mongolian script formatting | Low        |
+| Hangul Filler             | U+3164     | Korean text filler          | Low        |
+
+### Variation Selectors
+
+| Unicode                 | Code Point  | Description                     | Risk Level |
+| ----------------------- | ----------- | ------------------------------- | ---------- |
+| Variation Selector 1-16 | U+FE00-FE0F | Can change character appearance | Medium     |
+
+## Usage
+
+### Quick Run (Remote)
+
+```bash
+bash -c "$(wget -qLO - https://raw.githubusercontent.com/bigbeartechworld/big-bear-scripts/master/check-for-unicode/run.sh)"
+```
+
+### Local Usage
+
+#### Scan a single file:
+
+```bash
+./run.sh /path/to/file.txt
+```
+
+#### Scan a directory recursively:
+
+```bash
+./run.sh /path/to/directory
+```
+
+#### Scan current directory:
+
+```bash
+./run.sh .
+```
+
+## Example Output
+
+```
+Scanning: ./suspicious_file.txt
+  Warning: Non-UTF8 file detected
+  [!] Found dangerous Unicode: U+200b
+
+Scanning: ./clean_file.txt
+
+Scanning: ./another_file.md
+  [!] Found dangerous Unicode: U+202e
+```
+
+## Features
+
+- **Recursive Directory Scanning**: Automatically scans all files in subdirectories
+- **File Encoding Detection**: Warns about non-UTF8 files that might contain hidden characters
+- **Comprehensive Unicode Detection**: Checks for 50+ different types of potentially dangerous Unicode characters including:
+  - Zero-width and invisible characters
+  - Bidirectional text controls (Trojan Source attacks)
+  - Annotation and formatting characters
+  - Line and paragraph separators
+  - Variation selectors
+- **CVE-2021-42574 Protection**: Specifically detects Trojan Source attack vectors
+- **Clear Output**: Shows which files are being scanned and exactly which Unicode characters are found
+- **Cross-Platform**: Works on Linux, macOS, and other Unix-like systems
+
+## Requirements
+
+- Bash shell
+- `grep` with Perl regex support (`--perl-regexp`)
+- `file` command for encoding detection
+- `find` command for directory traversal
+
+## Security Considerations
+
+This script is particularly useful for:
+
+- **Code Review**: Detecting hidden characters in source code
+- **Content Moderation**: Identifying potentially malicious text submissions
+- **AI System Security**: Preventing Unicode-based prompt injection attacks
+- **Data Validation**: Ensuring clean text data in databases and files
+
+## Exit Codes
+
+- `0`: Scan completed successfully (may or may not have found Unicode characters)
+- `1`: Invalid usage (no file/directory specified)
+
+## Notes
+
+- The script uses Perl-compatible regular expressions for accurate Unicode detection
+- All files are scanned regardless of extension
+- Binary files may produce warnings but will still be scanned
+- Large directories may take some time to process completely
--- a/check-for-unicode/run.sh
+++ b/check-for-unicode/run.sh
@@ -0,0 +1,103 @@
+#!/usr/bin/env bash
+
+# List of dangerous Unicode characters for AI systems and security
+harmful_unicodes=(
+    # Zero-width and invisible characters
+    "\u200B"  # Zero Width Space
+    "\u200C"  # Zero Width Non-Joiner
+    "\u200D"  # Zero Width Joiner
+    "\u2060"  # Word Joiner
+    "\u2061"  # Function Application
+    "\u2062"  # Invisible Times
+    "\u2063"  # Invisible Separator
+    "\u2064"  # Invisible Plus
+    "\uFEFF"  # Zero Width No-Break Space (BOM)
+    "\u034F"  # Combining Grapheme Joiner
+    
+    # Bidirectional text controls (Trojan Source attacks - CVE-2021-42574)
+    "\u202A"  # Left-to-Right Embedding
+    "\u202B"  # Right-to-Left Embedding
+    "\u202C"  # Pop Directional Formatting
+    "\u202D"  # Left-to-Right Override
+    "\u202E"  # Right-to-Left Override
+    "\u2066"  # Left-to-Right Isolate
+    "\u2067"  # Right-to-Left Isolate
+    "\u2068"  # First Strong Isolate
+    "\u2069"  # Pop Directional Isolate
+    "\u061C"  # Arabic Letter Mark
+    "\u200E"  # Left-to-Right Mark
+    "\u200F"  # Right-to-Left Mark
+    
+    # Annotation and formatting characters
+    "\uFFF9"  # Interlinear Annotation Anchor
+    "\uFFFA"  # Interlinear Annotation Separator
+    "\uFFFB"  # Interlinear Annotation Terminator
+    "\uFFFC"  # Object Replacement Character
+    "\uFFFD"  # Replacement Character
+    
+    # Line and paragraph separators
+    "\u2028"  # Line Separator
+    "\u2029"  # Paragraph Separator
+    
+    # Additional format characters
+    "\u00AD"  # Soft Hyphen
+    "\u115F"  # Hangul Choseong Filler
+    "\u1160"  # Hangul Jungseong Filler
+    "\u17B4"  # Khmer Vowel Inherent Aq
+    "\u17B5"  # Khmer Vowel Inherent Aa
+    "\u180E"  # Mongolian Vowel Separator
+    "\u3164"  # Hangul Filler
+    
+    # Variation selectors (can change character appearance)
+    "\uFE00"  # Variation Selector-1
+    "\uFE01"  # Variation Selector-2
+    "\uFE02"  # Variation Selector-3
+    "\uFE03"  # Variation Selector-4
+    "\uFE04"  # Variation Selector-5
+    "\uFE05"  # Variation Selector-6
+    "\uFE06"  # Variation Selector-7
+    "\uFE07"  # Variation Selector-8
+    "\uFE08"  # Variation Selector-9
+    "\uFE09"  # Variation Selector-10
+    "\uFE0A"  # Variation Selector-11
+    "\uFE0B"  # Variation Selector-12
+    "\uFE0C"  # Variation Selector-13
+    "\uFE0D"  # Variation Selector-14
+    "\uFE0E"  # Variation Selector-15
+    "\uFE0F"  # Variation Selector-16
+)
+
+if [ $# -eq 0 ]; then
+    echo "Usage: $0 <file/directory>"
+    exit 1
+fi
+
+target="$1"
+
+search_file() {
+    file="$1"
+    echo "Scanning: $file"
+    
+    # Check file encoding
+    if ! file -bi "$file" | grep -q 'utf-8'; then
+        echo "  Warning: Non-UTF8 file detected"
+    fi
+
+    # Search for each harmful character
+    for code in "${harmful_unicodes[@]}"; do
+        if grep --perl-regexp -q "$code" "$file"; then
+            hex=$(printf "%04x" "0x${code:2:4}")
+            echo "  [!] Found dangerous Unicode: U+$hex"
+        fi
+    done
+}
+
+export -f search_file
+export harmful_unicodes
+
+# Handle directories recursively
+if [ -d "$target" ]; then
+    find "$target" -type f -exec bash -c 'search_file "$0"' {} \;
+else
+    search_file "$target"
+fi