テキストファイルのエンコードを自動検出する方法は？

Question

2011-06-24 08:07:02 +0000 2011-06-24 08:07:02 +0000

74

テキストファイルのエンコードを自動検出する方法は？

変形文字セットでエンコードされたプレーンテキストファイルがたくさんあります。

それらを全てUTF-8に変換したいのですが、iconvを実行する前に元のエンコーディングを知る必要があります。ほとんどのブラウザではエンコーディングにAuto Detectオプションが付いているのですが、テキストファイルが多すぎて一つ一つ確認できません。

元のエンコーディングを知っているだけで、iconv -f DETECTED_CHARSET -t utf-8でテキストを変換することができます。

プレーンテキストファイルのエンコーディングを検出するユーティリティはありますか？100%完璧である必要はありません。100万個のファイルの中に100個のファイルが間違って変換されていても気にしません。

ソース

Xiè Jìléi http://superuser.stackexchange.com/users/19926

回答 (9)

30

2013-06-18 12:44:37 +0000

Debian ベースの Linux では、uchardet パッケージ Debian / Ubuntu ) がコマンドラインツールを提供しています。パッケージの説明は以下を参照してください。

universal charset detection library - cli utility
 .
 uchardet is a C language binding of the original C++ implementation
 of the universal charset detection library by Mozilla.
 .
 uchardet is a encoding detector library, which takes a sequence of
 bytes in an unknown character encoding without any additional
 information, and attempts to determine the encoding of the text.
 .
 The original code of universalchardet is available at
 http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet
 .
 Techniques used by universalchardet are described at
 http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

ソース

Xavier http://superuser.stackexchange.com/users/19926

16

2011-06-24 08:38:40 +0000

Linuxの場合は enca 、Solarisの場合は auto_ef となります。

ソース

cularis http://superuser.stackexchange.com/users/19926

2

2013-10-11 16:06:44 +0000

Mozilla には、ウェブページの自動検出のための素晴らしいコードベースがあります。 http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/

アルゴリズムの詳細な説明。 http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

アルゴリズムの詳細な説明

ソース

Martin Hennings http://superuser.stackexchange.com/users/19926

2

2018-11-06 15:42:35 +0000

普段からEmacsを使っている人には、以下のような便利な機能があるかもしれません(変換を手動で検査したり、検証したりすることができます)。

さらに、Emacsの文字セット自動検出は、他の文字セット自動検出ツール(chardetなど)よりもはるかに効率的であることがよくわかります。

(setq paths (mapcar 'file-truename '(
 "path/to/file1"
 "path/to/file2"
 "path/to/file3"
)))

(dolist (path paths)
  (find-file path)
  (set-buffer-file-coding-system 'utf-8-unix)
  )

あとは、このスクリプトを引数にしてEmacsを呼び出すだけで（"-l “オプションを参照してください）、仕事は完了します。

ソース

Yves Lhuillier http://superuser.stackexchange.com/users/19926

1

2015-10-28 17:34:06 +0000

isutf8 (moreutils パッケージから) が動作しました。

ソース

Ronan http://superuser.stackexchange.com/users/19926

1

2014-01-23 16:12:16 +0000

chardet (python 2.?) に戻ると、この呼び出しで十分かもしれません:

python -c 'import chardet,sys; print chardet.detect(sys.stdin.read())' < file
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}

完璧には程遠いですが….

echo "öasd" | iconv -t ISO-8859-1 | python -c 'import chardet,sys; print chardet.detect(sys.stdin.read())'
{'confidence': 0.5, 'encoding': 'windows-1252'}
``` 0x1&

ソース

estani http://superuser.stackexchange.com/users/19926

1

2011-09-03 00:48:04 +0000

UTFCastは試してみる価値があります。私にはうまくいきませんでしたが（私のファイルがひどいからかもしれませんが）、いい感じです。 http://www.addictivetips.com/windows-tips/how-to-batch-convert-text-files-to-utf-8-encoding/

ソース

Sameer http://superuser.stackexchange.com/users/19926

0

2019-07-12 16:39:09 +0000

-->

また、ファイル -i が未知の

を与える場合には、以下のように文字コードを推測することができるこの php コマンドを使用することができます。

phpでは以下のように確認することができます。

明示的にエンコーディングリストを指定する :

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), 'UTF-8, ASCII, JIS, EUC-JP, SJIS, iso-8859-1') . PHP_EOL;"

より正確な「mb_list_encodings」。

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), mb_list_encodings()) . PHP_EOL;"

最初の例では、一致する可能性のあるエンコーディングのリスト(リストの順番を検出する)を入れていることがわかります。より正確な結果を得るためには、. mb_list_encodings()_

Note mb_* functions require php-mbstring

apt-get install php-mbstring

See answer . https://stackoverflow.com/a/57010566/3382822

ソース

Mohamed23gharbi http://superuser.stackexchange.com/users/19926

テキストファイルのエンコードを自動検出する方法は？

回答 (9)

関連する質問