🔒 feat(security): Add script to detect dangerous Unicode characters (#43)

This commit adds a new script `check-for-unicode/run.sh` that scans files and directories for potentially dangerous Unicode characters. These characters can be exploited in AI systems, cause display/parsing issues, or enable social engineering attacks.

The script detects a comprehensive list of harmful Unicode characters, including:

- Zero-width and invisible characters
- Bidirectional text controls (Trojan Source attacks)
- Annotation and formatting characters
- Line and paragraph separators
- Additional format characters
- Variation selectors

The script can be used to identify these characters in files and directories, helping to improve the security and reliability of systems that process text data.
This commit is contained in:
Christopher
2025-05-31 20:32:08 -05:00
committed by GitHub
parent 8cb380870a
commit df0ea596a5
2 changed files with 269 additions and 0 deletions

166
check-for-unicode/README.md Normal file
View File

@@ -0,0 +1,166 @@
# Check for Unicode
A security-focused script that scans files and directories for potentially dangerous Unicode characters that could be exploited in AI systems or cause display/parsing issues.
## Purpose
This script helps identify hidden or dangerous Unicode characters that can:
- Cause security vulnerabilities in AI systems
- Create invisible text manipulation
- Lead to text rendering issues
- Enable social engineering attacks through character spoofing
## Detected Unicode Characters
The script scans for the following potentially harmful Unicode characters:
### Zero-Width and Invisible Characters
| Unicode | Code Point | Description | Risk Level |
| ------------------------- | ---------- | --------------------------------------------------- | ---------- |
| Zero Width Space | U+200B | Invisible character that can hide malicious content | High |
| Zero Width Non-Joiner | U+200C | Can break text parsing logic | Medium |
| Zero Width Joiner | U+200D | Can create unexpected character combinations | Medium |
| Word Joiner | U+2060 | Invisible character that prevents line breaks | Medium |
| Function Application | U+2061 | Mathematical invisible operator | Low |
| Invisible Times | U+2062 | Mathematical invisible operator | Low |
| Invisible Separator | U+2063 | Mathematical invisible operator | Low |
| Invisible Plus | U+2064 | Mathematical invisible operator | Low |
| Zero Width No-Break Space | U+FEFF | Byte Order Mark, can cause parsing issues | Medium |
| Combining Grapheme Joiner | U+034F | Can create unexpected character combinations | Medium |
### Bidirectional Text Controls (Trojan Source - CVE-2021-42574)
| Unicode | Code Point | Description | Risk Level |
| -------------------------- | ---------- | ---------------------------------------- | ---------- |
| Left-to-Right Embedding | U+202A | Can manipulate text direction | High |
| Right-to-Left Embedding | U+202B | Can manipulate text direction | High |
| Pop Directional Formatting | U+202C | Ends directional formatting | High |
| Left-to-Right Override | U+202D | Forces left-to-right text direction | High |
| Right-to-Left Override | U+202E | Can reverse text direction for spoofing | High |
| Left-to-Right Isolate | U+2066 | Isolates text direction | High |
| Right-to-Left Isolate | U+2067 | Isolates text direction | High |
| First Strong Isolate | U+2068 | Isolates based on first strong character | High |
| Pop Directional Isolate | U+2069 | Ends directional isolation | High |
| Arabic Letter Mark | U+061C | Marks Arabic text direction | Medium |
| Left-to-Right Mark | U+200E | Marks left-to-right text direction | Medium |
| Right-to-Left Mark | U+200F | Marks right-to-left text direction | Medium |
### Annotation and Formatting Characters
| Unicode | Code Point | Description | Risk Level |
| --------------------------------- | ---------- | ----------------------------------- | ---------- |
| Interlinear Annotation Anchor | U+FFF9 | Can hide annotations | Medium |
| Interlinear Annotation Separator | U+FFFA | Separates annotation components | Medium |
| Interlinear Annotation Terminator | U+FFFB | Terminates annotations | Medium |
| Object Replacement Character | U+FFFC | Placeholder for embedded objects | Medium |
| Replacement Character | U+FFFD | Used for unknown/invalid characters | Low |
### Line and Paragraph Separators
| Unicode | Code Point | Description | Risk Level |
| ------------------- | ---------- | ----------------------- | ---------- |
| Line Separator | U+2028 | Can break parsing logic | Medium |
| Paragraph Separator | U+2029 | Can break parsing logic | Medium |
### Additional Format Characters
| Unicode | Code Point | Description | Risk Level |
| ------------------------- | ---------- | --------------------------- | ---------- |
| Soft Hyphen | U+00AD | Invisible hyphenation point | Low |
| Hangul Choseong Filler | U+115F | Korean text filler | Low |
| Hangul Jungseong Filler | U+1160 | Korean text filler | Low |
| Khmer Vowel Inherent Aq | U+17B4 | Khmer script formatting | Low |
| Khmer Vowel Inherent Aa | U+17B5 | Khmer script formatting | Low |
| Mongolian Vowel Separator | U+180E | Mongolian script formatting | Low |
| Hangul Filler | U+3164 | Korean text filler | Low |
### Variation Selectors
| Unicode | Code Point | Description | Risk Level |
| ----------------------- | ----------- | ------------------------------- | ---------- |
| Variation Selector 1-16 | U+FE00-FE0F | Can change character appearance | Medium |
## Usage
### Quick Run (Remote)
```bash
bash -c "$(wget -qLO - https://raw.githubusercontent.com/bigbeartechworld/big-bear-scripts/master/check-for-unicode/run.sh)"
```
### Local Usage
#### Scan a single file:
```bash
./run.sh /path/to/file.txt
```
#### Scan a directory recursively:
```bash
./run.sh /path/to/directory
```
#### Scan current directory:
```bash
./run.sh .
```
## Example Output
```
Scanning: ./suspicious_file.txt
Warning: Non-UTF8 file detected
[!] Found dangerous Unicode: U+200b
Scanning: ./clean_file.txt
Scanning: ./another_file.md
[!] Found dangerous Unicode: U+202e
```
## Features
- **Recursive Directory Scanning**: Automatically scans all files in subdirectories
- **File Encoding Detection**: Warns about non-UTF8 files that might contain hidden characters
- **Comprehensive Unicode Detection**: Checks for 50+ different types of potentially dangerous Unicode characters including:
- Zero-width and invisible characters
- Bidirectional text controls (Trojan Source attacks)
- Annotation and formatting characters
- Line and paragraph separators
- Variation selectors
- **CVE-2021-42574 Protection**: Specifically detects Trojan Source attack vectors
- **Clear Output**: Shows which files are being scanned and exactly which Unicode characters are found
- **Cross-Platform**: Works on Linux, macOS, and other Unix-like systems
## Requirements
- Bash shell
- `grep` with Perl regex support (`--perl-regexp`)
- `file` command for encoding detection
- `find` command for directory traversal
## Security Considerations
This script is particularly useful for:
- **Code Review**: Detecting hidden characters in source code
- **Content Moderation**: Identifying potentially malicious text submissions
- **AI System Security**: Preventing Unicode-based prompt injection attacks
- **Data Validation**: Ensuring clean text data in databases and files
## Exit Codes
- `0`: Scan completed successfully (may or may not have found Unicode characters)
- `1`: Invalid usage (no file/directory specified)
## Notes
- The script uses Perl-compatible regular expressions for accurate Unicode detection
- All files are scanned regardless of extension
- Binary files may produce warnings but will still be scanned
- Large directories may take some time to process completely

103
check-for-unicode/run.sh Normal file
View File

@@ -0,0 +1,103 @@
#!/usr/bin/env bash
# List of dangerous Unicode characters for AI systems and security
harmful_unicodes=(
# Zero-width and invisible characters
"\u200B" # Zero Width Space
"\u200C" # Zero Width Non-Joiner
"\u200D" # Zero Width Joiner
"\u2060" # Word Joiner
"\u2061" # Function Application
"\u2062" # Invisible Times
"\u2063" # Invisible Separator
"\u2064" # Invisible Plus
"\uFEFF" # Zero Width No-Break Space (BOM)
"\u034F" # Combining Grapheme Joiner
# Bidirectional text controls (Trojan Source attacks - CVE-2021-42574)
"\u202A" # Left-to-Right Embedding
"\u202B" # Right-to-Left Embedding
"\u202C" # Pop Directional Formatting
"\u202D" # Left-to-Right Override
"\u202E" # Right-to-Left Override
"\u2066" # Left-to-Right Isolate
"\u2067" # Right-to-Left Isolate
"\u2068" # First Strong Isolate
"\u2069" # Pop Directional Isolate
"\u061C" # Arabic Letter Mark
"\u200E" # Left-to-Right Mark
"\u200F" # Right-to-Left Mark
# Annotation and formatting characters
"\uFFF9" # Interlinear Annotation Anchor
"\uFFFA" # Interlinear Annotation Separator
"\uFFFB" # Interlinear Annotation Terminator
"\uFFFC" # Object Replacement Character
"\uFFFD" # Replacement Character
# Line and paragraph separators
"\u2028" # Line Separator
"\u2029" # Paragraph Separator
# Additional format characters
"\u00AD" # Soft Hyphen
"\u115F" # Hangul Choseong Filler
"\u1160" # Hangul Jungseong Filler
"\u17B4" # Khmer Vowel Inherent Aq
"\u17B5" # Khmer Vowel Inherent Aa
"\u180E" # Mongolian Vowel Separator
"\u3164" # Hangul Filler
# Variation selectors (can change character appearance)
"\uFE00" # Variation Selector-1
"\uFE01" # Variation Selector-2
"\uFE02" # Variation Selector-3
"\uFE03" # Variation Selector-4
"\uFE04" # Variation Selector-5
"\uFE05" # Variation Selector-6
"\uFE06" # Variation Selector-7
"\uFE07" # Variation Selector-8
"\uFE08" # Variation Selector-9
"\uFE09" # Variation Selector-10
"\uFE0A" # Variation Selector-11
"\uFE0B" # Variation Selector-12
"\uFE0C" # Variation Selector-13
"\uFE0D" # Variation Selector-14
"\uFE0E" # Variation Selector-15
"\uFE0F" # Variation Selector-16
)
if [ $# -eq 0 ]; then
echo "Usage: $0 <file/directory>"
exit 1
fi
target="$1"
search_file() {
file="$1"
echo "Scanning: $file"
# Check file encoding
if ! file -bi "$file" | grep -q 'utf-8'; then
echo " Warning: Non-UTF8 file detected"
fi
# Search for each harmful character
for code in "${harmful_unicodes[@]}"; do
if grep --perl-regexp -q "$code" "$file"; then
hex=$(printf "%04x" "0x${code:2:4}")
echo " [!] Found dangerous Unicode: U+$hex"
fi
done
}
export -f search_file
export harmful_unicodes
# Handle directories recursively
if [ -d "$target" ]; then
find "$target" -type f -exec bash -c 'search_file "$0"' {} \;
else
search_file "$target"
fi