Rob van der Woude's Scripting Pages

VBScript Scripting Techniques > Files > File Encoding

File Encoding

  1. Check a file's Byte Order Mark (BOM) to determine its text encoding

 

ADODB.Stream
VBScript Code:
' Based on information from
' https://en.wikipedia.org/wiki/Byte_order_mark
'
'Encoding       Hex BOM
'========       =======
'BOCU-1         FB EE 28
'GB-18030       84 31 95 33
'SCSU           0E FE FF
'UTF-1          F7 64 4C
'UTF-7          2B 2F 76 (38|39|2B|2F)
'UTF-8          EF BB BF
'UTF-16 (BE)    FE FF
'UTF-16 (LE)    FF FE
'UTF-32 (BE)    00 00 FE FF
'UTF-32 (LE)    FF FE 00 00
'UTF-EBCDIC     DD 73 66 73

Option Explicit

Const adTypeBinary = 1
Const adTypeText   = 2

Dim i, intRC
Dim dicBOMs, objFSO, objStream
Dim strBOM, strFile, strHead, strType, strUTF7

If WScript.Arguments.Unnamed.Count <> 1 Then Syntax
If WScript.Arguments.Named.Count    > 0 Then Syntax

intRC   = 0
strFile = WScript.Arguments.Unnamed(0)
strType = "Unknown"
strUTF7 = "38;39;2B;2F" ' Allowed values for 4th byte of UTF-7 BOM

Set objFSO = CreateObject( "Scripting.FileSystemObject" )
If Not objFSO.FileExists( strFile ) Then Syntax
Set objFSO = Nothing

Set dicBOMs = CreateObject( "Scripting.Dictionary" )
dicBOMs.Add "0000FEFF", "UTF-32 (BE)"
dicBOMs.Add "0EFEFF",   "SCSU"
dicBOMs.Add "2B2F76",   "UTF-7" ' First 3 bytes of BOM only, 4th byte can have several values
dicBOMs.Add "84319533", "GB-18030"
dicBOMs.Add "DD736673", "UTF-EBCDIC"
dicBOMs.Add "EFBBBF",   "UTF-8"
dicBOMs.Add "F7644C",   "UTF-1"
dicBOMs.Add "FBEE28",   "BOCU-1"
dicBOMs.Add "FEFF",     "UTF-16 (BE)"
dicBOMs.Add "FFFE",     "UTF-16 (LE)"
dicBOMs.Add "FFFE0000", "UTF-32 (LE)"

On Error Resume Next
Set objStream = CreateObject( "ADODB.Stream" )
objStream.Open
objStream.Type = adTypeBinary
objStream.LoadFromFile strFile
If Err Then intRC = 1
objStream.Position = 0
strHead = ""
For i = 0 To 3
        strHead = strHead & UCase( Right( "0" & Hex( AscB( objStream.Read( 1 ) ) ), 2 ) )
        If Err Then intRC = 1
Next
objStream.Close
Set objStream = Nothing
On Error Goto 0

If intRC = 1 Then Syntax

For i = 8 To 4 Step -2 ' Try the longest match (4 bytes) first, next try 3 bytes, finally try 2 bytes
        If strType = "Unknown" Then
                strBOM = Left( strHead, i )
                If dicBOMs.Exists( strBOM ) Then
                        If dicBOMs( strBOM ) = "UTF-7" Then
                                If InStr( strUTF7, Right( strHead, 2 ) ) Then strType = "UTF-7"
                        Else
                                strType = dicBOMs( strBOM )
                        End If
                End If
        End If
Next

If strType = "Unknown" Then intRC = 1

WScript.Echo "File Name     : " & strFile & vbcrlf _
           & "First 4 bytes : " & strHead & vbcrlf _
           & "Matching BOM  : " & strBOM  & vbcrlf _
           & "File Encoding : " & strType

WScript.Quit intRC


Sub Syntax
        Dim strMsg
        strMsg = vbCrLf _
               & "CheckBOM.vbs,  Version 1.00" _
               & vbCrLf _
               & "Check a file's Byte Order Mark (BOM) to determine its text encoding" _
               & vbCrLf & vbCrLf _
               & "Usage:  CheckBOM.vbs  textfilename" _
               & vbCrLf & vbCrLf _
               & "Note:   The file encoding is displayed on screen, e.g. ""UTF-7"" or" _
               & vbCrLf _
               & "        ""UTF-32 (LE)"", or ""Unknown"" if not recognized." _
               & vbCrLf _
               & "        Check this script's source code for a list of recognized BOMs." _
               & vbCrLf & vbCrLf _
               & "Written by Rob van der Woude" _
               & vbCrLf _
               & "http://www.robvanderwoude.com"
        WScript.Echo strMsg
        WScript.Quit 1
End Sub

Requirements:
Windows version: 2000 and later
Network: any
Client software: MDAC 2.8 for Windows 2000
Script Engine: any
Summarized: Works in Windows 2000 or later, Windows 2000 requires MDAC 2.8.
 
[Back to the top of this page]

page last uploaded: 2017-07-26, 14:49