starred/big-bear-scripts-bigbeartechworld

mirror of https://github.com/bigbeartechworld/big-bear-scripts.git synced 2026-03-31 06:33:56 -04:00

Files

Christopher 6c56f29c0a ✨ feat: Implement Unicode Security Scanner (#44 )

This commit introduces a comprehensive Unicode Security Scanner that
detects a wide range of dangerous Unicode characters that can be used
in security attacks, including Trojan Source attacks (CVE-2021-42574)
and other invisible characters. The scanner uses a list of harmful
Unicode patterns in hex format for efficient grep-based scanning.

2025-06-01 11:22:15 -05:00

README.md

✨ feat: Implement Unicode Security Scanner (#44 )

2025-06-01 11:22:15 -05:00

run.sh

✨ feat: Implement Unicode Security Scanner (#44 )

2025-06-01 11:22:15 -05:00

README.md

Check for Unicode

A security-focused script that scans files and directories for potentially dangerous Unicode characters that could be exploited in AI systems or cause display/parsing issues.

Purpose

This script helps identify hidden or dangerous Unicode characters that can:

Cause security vulnerabilities in AI systems
Create invisible text manipulation
Lead to text rendering issues
Enable social engineering attacks through character spoofing

Detected Unicode Characters

The script scans for the following potentially harmful Unicode characters:

Zero-Width and Invisible Characters

Unicode	Code Point	Description	Risk Level
Zero Width Space	U+200B	Invisible character that can hide malicious content	High
Zero Width Non-Joiner	U+200C	Can break text parsing logic	Medium
Zero Width Joiner	U+200D	Can create unexpected character combinations	Medium
Word Joiner	U+2060	Invisible character that prevents line breaks	Medium
Function Application	U+2061	Mathematical invisible operator	Low
Invisible Times	U+2062	Mathematical invisible operator	Low
Invisible Separator	U+2063	Mathematical invisible operator	Low
Invisible Plus	U+2064	Mathematical invisible operator	Low
Zero Width No-Break Space	U+FEFF	Byte Order Mark, can cause parsing issues	Medium
Combining Grapheme Joiner	U+034F	Can create unexpected character combinations	Medium

Bidirectional Text Controls (Trojan Source - CVE-2021-42574)

Unicode	Code Point	Description	Risk Level
Left-to-Right Embedding	U+202A	Can manipulate text direction	High
Right-to-Left Embedding	U+202B	Can manipulate text direction	High
Pop Directional Formatting	U+202C	Ends directional formatting	High
Left-to-Right Override	U+202D	Forces left-to-right text direction	High
Right-to-Left Override	U+202E	Can reverse text direction for spoofing	High
Left-to-Right Isolate	U+2066	Isolates text direction	High
Right-to-Left Isolate	U+2067	Isolates text direction	High
First Strong Isolate	U+2068	Isolates based on first strong character	High
Pop Directional Isolate	U+2069	Ends directional isolation	High
Arabic Letter Mark	U+061C	Marks Arabic text direction	Medium
Left-to-Right Mark	U+200E	Marks left-to-right text direction	Medium
Right-to-Left Mark	U+200F	Marks right-to-left text direction	Medium

Annotation and Formatting Characters

Unicode	Code Point	Description	Risk Level
Interlinear Annotation Anchor	U+FFF9	Can hide annotations	Medium
Interlinear Annotation Separator	U+FFFA	Separates annotation components	Medium
Interlinear Annotation Terminator	U+FFFB	Terminates annotations	Medium
Object Replacement Character	U+FFFC	Placeholder for embedded objects	Medium
Replacement Character	U+FFFD	Used for unknown/invalid characters	Low

Line and Paragraph Separators

Unicode	Code Point	Description	Risk Level
Line Separator	U+2028	Can break parsing logic	Medium
Paragraph Separator	U+2029	Can break parsing logic	Medium

Additional Format Characters

Unicode	Code Point	Description	Risk Level
Soft Hyphen	U+00AD	Invisible hyphenation point	Low
Hangul Choseong Filler	U+115F	Korean text filler	Low
Hangul Jungseong Filler	U+1160	Korean text filler	Low
Khmer Vowel Inherent Aq	U+17B4	Khmer script formatting	Low
Khmer Vowel Inherent Aa	U+17B5	Khmer script formatting	Low
Mongolian Vowel Separator	U+180E	Mongolian script formatting	Low
Hangul Filler	U+3164	Korean text filler	Low

Variation Selectors

Unicode	Code Point	Description	Risk Level
Variation Selector 1-16	U+FE00-FE0F	Can change character appearance	Medium

Usage

Quick Run (Remote)

bash -c "$(wget -qLO - https://raw.githubusercontent.com/bigbeartechworld/big-bear-scripts/master/check-for-unicode/run.sh)" -- .

Local Usage

Scan a single file:

./run.sh /path/to/file.txt

Scan a directory recursively:

./run.sh /path/to/directory

Scan current directory:

./run.sh .

Example Output

Scanning: ./suspicious_file.txt
  Warning: Non-UTF8 file detected
  [!] Found dangerous Unicode: U+200b

Scanning: ./clean_file.txt

Scanning: ./another_file.md
  [!] Found dangerous Unicode: U+202e

Features

Recursive Directory Scanning: Automatically scans all files in subdirectories
File Encoding Detection: Warns about non-UTF8 files that might contain hidden characters
Comprehensive Unicode Detection: Checks for 50+ different types of potentially dangerous Unicode characters including:
- Zero-width and invisible characters
- Bidirectional text controls (Trojan Source attacks)
- Annotation and formatting characters
- Line and paragraph separators
- Variation selectors
CVE-2021-42574 Protection: Specifically detects Trojan Source attack vectors
Clear Output: Shows which files are being scanned and exactly which Unicode characters are found
Cross-Platform: Works on Linux, macOS, and other Unix-like systems

Requirements

Bash shell
grep with Perl regex support (--perl-regexp)
file command for encoding detection
find command for directory traversal

Security Considerations

This script is particularly useful for:

Code Review: Detecting hidden characters in source code
Content Moderation: Identifying potentially malicious text submissions
AI System Security: Preventing Unicode-based prompt injection attacks
Data Validation: Ensuring clean text data in databases and files

Exit Codes

0: Scan completed successfully (may or may not have found Unicode characters)
1: Invalid usage (no file/directory specified)

Notes

The script uses Perl-compatible regular expressions for accurate Unicode detection
All files are scanned regardless of extension
Binary files may produce warnings but will still be scanned
Large directories may take some time to process completely