chardet package¶
Submodules¶
chardet.big5freq module¶
chardet.big5prober module¶
- class chardet.big5prober.Big5Prober[source]¶
Bases:
MultiByteCharSetProber
- property charset_name: str¶
- property language: str¶
chardet.chardetect module¶
chardet.chardistribution module¶
- class chardet.chardistribution.Big5DistributionAnalysis[source]¶
Bases:
CharDistributionAnalysis
- class chardet.chardistribution.CharDistributionAnalysis[source]¶
Bases:
object
- ENOUGH_DATA_THRESHOLD = 1024¶
- MINIMUM_DATA_THRESHOLD = 3¶
- SURE_NO = 0.01¶
- SURE_YES = 0.99¶
- class chardet.chardistribution.EUCJPDistributionAnalysis[source]¶
Bases:
CharDistributionAnalysis
- class chardet.chardistribution.EUCKRDistributionAnalysis[source]¶
Bases:
CharDistributionAnalysis
- class chardet.chardistribution.EUCTWDistributionAnalysis[source]¶
Bases:
CharDistributionAnalysis
- class chardet.chardistribution.GB2312DistributionAnalysis[source]¶
Bases:
CharDistributionAnalysis
- class chardet.chardistribution.JOHABDistributionAnalysis[source]¶
Bases:
CharDistributionAnalysis
- class chardet.chardistribution.SJISDistributionAnalysis[source]¶
Bases:
CharDistributionAnalysis
chardet.charsetgroupprober module¶
- class chardet.charsetgroupprober.CharSetGroupProber(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.NONE: 0>)[source]¶
Bases:
CharSetProber
- property charset_name: str | None¶
- property language: str | None¶
chardet.charsetprober module¶
- class chardet.charsetprober.CharSetProber(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.NONE: 0>)[source]¶
Bases:
object
- SHORTCUT_THRESHOLD = 0.95¶
- property charset_name: str | None¶
- static filter_international_words(buf: bytes | bytearray) bytearray [source]¶
We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [-ÿ] marker: everything else [^a-zA-Z-ÿ] The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character. This filter applies to all scripts which do not use English characters.
- property language: str | None¶
- static remove_xml_tags(buf: bytes | bytearray) bytes [source]¶
Returns a copy of
buf
that retains only the sequences of English alphabet and high byte characters that are not between <> characters. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used byLatin1Prober
.
- property state: ProbingState¶
chardet.codingstatemachine module¶
- class chardet.codingstatemachine.CodingStateMachine(sm: dict)[source]¶
Bases:
object
A state machine to verify a byte sequence for a particular encoding. For each byte the detector receives, it will feed that byte to every active state machine available, one byte at a time. The state machine changes its state based on its previous state and the byte it receives. There are 3 states in a state machine that are of interest to an auto-detector:
- START state: This is the state to start with, or a legal byte sequence
(i.e. a valid code point) for character has been identified.
- ME state: This indicates that the state machine identified a byte sequence
that is specific to the charset it is designed for and that there is no other possible encoding which can contain this byte sequence. This will to lead to an immediate positive answer for the detector.
- ERROR state: This indicates the state machine identified an illegal byte
sequence for that encoding. This will lead to an immediate negative answer for this encoding. Detector will exclude this encoding from consideration from here on.
- property language: str¶
chardet.compat module¶
chardet.constants module¶
chardet.cp949prober module¶
- class chardet.cp949prober.CP949Prober[source]¶
Bases:
MultiByteCharSetProber
- property charset_name: str¶
- property language: str¶
chardet.escprober module¶
- class chardet.escprober.EscCharSetProber(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.NONE: 0>)[source]¶
Bases:
CharSetProber
This CharSetProber uses a “code scheme” approach for detecting encodings, whereby easily recognizable escape or shift sequences are relied on to identify these encodings.
- property charset_name: str | None¶
- property language: str | None¶
chardet.escsm module¶
chardet.eucjpprober module¶
chardet.euckrfreq module¶
chardet.euckrprober module¶
- class chardet.euckrprober.EUCKRProber[source]¶
Bases:
MultiByteCharSetProber
- property charset_name: str¶
- property language: str¶
chardet.euctwfreq module¶
chardet.euctwprober module¶
- class chardet.euctwprober.EUCTWProber[source]¶
Bases:
MultiByteCharSetProber
- property charset_name: str¶
- property language: str¶
chardet.gb2312freq module¶
chardet.gb2312prober module¶
- class chardet.gb2312prober.GB2312Prober[source]¶
Bases:
MultiByteCharSetProber
- property charset_name: str¶
- property language: str¶
chardet.hebrewprober module¶
- class chardet.hebrewprober.HebrewProber[source]¶
Bases:
CharSetProber
- FINAL_KAF = 234¶
- FINAL_MEM = 237¶
- FINAL_NUN = 239¶
- FINAL_PE = 243¶
- FINAL_TSADI = 245¶
- LOGICAL_HEBREW_NAME = 'windows-1255'¶
- MIN_FINAL_CHAR_DISTANCE = 5¶
- MIN_MODEL_DISTANCE = 0.01¶
- NORMAL_KAF = 235¶
- NORMAL_MEM = 238¶
- NORMAL_NUN = 240¶
- NORMAL_PE = 244¶
- NORMAL_TSADI = 246¶
- SPACE = 32¶
- VISUAL_HEBREW_NAME = 'ISO-8859-8'¶
- property charset_name: str¶
- property language: str¶
- set_model_probers(logical_prober: SingleByteCharSetProber, visual_prober: SingleByteCharSetProber) None [source]¶
- property state: ProbingState¶
chardet.jisfreq module¶
chardet.jpcntx module¶
- class chardet.jpcntx.EUCJPContextAnalysis[source]¶
Bases:
JapaneseContextAnalysis
chardet.langbulgarianmodel module¶
chardet.langcyrillicmodel module¶
chardet.langgreekmodel module¶
chardet.langhebrewmodel module¶
chardet.langhungarianmodel module¶
chardet.langthaimodel module¶
chardet.latin1prober module¶
chardet.mbcharsetprober module¶
chardet.mbcsgroupprober module¶
- class chardet.mbcsgroupprober.MBCSGroupProber(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.NONE: 0>)[source]¶
Bases:
CharSetGroupProber
chardet.mbcssm module¶
chardet.sbcharsetprober module¶
- class chardet.sbcharsetprober.SingleByteCharSetModel(charset_name, language, char_to_order_map, language_model, typical_positive_ratio, keep_ascii_letters, alphabet)[source]¶
Bases:
NamedTuple
- alphabet: str¶
Alias for field number 6
- char_to_order_map: Dict[int, int]¶
Alias for field number 2
- charset_name: str¶
Alias for field number 0
- keep_ascii_letters: bool¶
Alias for field number 5
- language: str¶
Alias for field number 1
- language_model: Dict[int, Dict[int, int]]¶
Alias for field number 3
- typical_positive_ratio: float¶
Alias for field number 4
- class chardet.sbcharsetprober.SingleByteCharSetProber(model: SingleByteCharSetModel, is_reversed: bool = False, name_prober: CharSetProber | None = None)[source]¶
Bases:
CharSetProber
- NEGATIVE_SHORTCUT_THRESHOLD = 0.05¶
- POSITIVE_SHORTCUT_THRESHOLD = 0.95¶
- SAMPLE_SIZE = 64¶
- SB_ENOUGH_REL_THRESHOLD = 1024¶
- property charset_name: str | None¶
- property language: str | None¶
chardet.sbcsgroupprober module¶
- class chardet.sbcsgroupprober.SBCSGroupProber[source]¶
Bases:
CharSetGroupProber
chardet.sjisprober module¶
chardet.universaldetector module¶
Module containing the UniversalDetector detector class, which is the primary
class a user of chardet
should use.
- author:
Mark Pilgrim (initial port to Python)
- author:
Shy Shalom (original C code)
- author:
Dan Blanchard (major refactoring for 3.0)
- author:
Ian Cordasco
- class chardet.universaldetector.UniversalDetector(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, should_rename_legacy: bool = False)[source]¶
Bases:
object
The
UniversalDetector
class underlies thechardet.detect
function and coordinates all of the different charset probers.To get a
dict
containing an encoding and its confidence, you can simply run:u = UniversalDetector() u.feed(some_bytes) u.close() detected = u.result
- ESC_DETECTOR = re.compile(b'(\x1b|~{)')¶
- HIGH_BYTE_DETECTOR = re.compile(b'[\x80-\xff]')¶
- ISO_WIN_MAP = {'iso-8859-1': 'Windows-1252', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254'}¶
- LEGACY_MAP = {'ascii': 'Windows-1252', 'euc-kr': 'CP949', 'gb2312': 'GB18030', 'iso-8859-1': 'Windows-1252', 'iso-8859-9': 'Windows-1254', 'tis-620': 'ISO-8859-11', 'utf-16le': 'UTF-16'}¶
- MINIMUM_THRESHOLD = 0.2¶
- WIN_BYTE_DETECTOR = re.compile(b'[\x80-\x9f]')¶
- property charset_probers: List[CharSetProber]¶
- close() dict [source]¶
Stop analyzing the current document and come up with a final prediction.
- Returns:
The
result
attribute, adict
with the keys encoding, confidence, and language.
- feed(byte_str: bytes | bytearray) None [source]¶
Takes a chunk of a document and feeds it through all of the relevant charset probers.
After calling
feed
, you can check the value of thedone
attribute to see if you need to continue feeding theUniversalDetector
more data, or if it has made a prediction (in theresult
attribute).Note
You should always call
close
when you’re done feeding in your document ifdone
is not alreadyTrue
.
- property has_win_bytes: bool¶
- property input_state: int¶
chardet.utf8prober module¶
Module contents¶
- class chardet.UniversalDetector(lang_filter: ~chardet.enums.LanguageFilter = <LanguageFilter.ALL: 31>, should_rename_legacy: bool = False)[source]¶
Bases:
object
The
UniversalDetector
class underlies thechardet.detect
function and coordinates all of the different charset probers.To get a
dict
containing an encoding and its confidence, you can simply run:u = UniversalDetector() u.feed(some_bytes) u.close() detected = u.result
- ESC_DETECTOR = re.compile(b'(\x1b|~{)')¶
- HIGH_BYTE_DETECTOR = re.compile(b'[\x80-\xff]')¶
- ISO_WIN_MAP = {'iso-8859-1': 'Windows-1252', 'iso-8859-13': 'Windows-1257', 'iso-8859-2': 'Windows-1250', 'iso-8859-5': 'Windows-1251', 'iso-8859-6': 'Windows-1256', 'iso-8859-7': 'Windows-1253', 'iso-8859-8': 'Windows-1255', 'iso-8859-9': 'Windows-1254'}¶
- LEGACY_MAP = {'ascii': 'Windows-1252', 'euc-kr': 'CP949', 'gb2312': 'GB18030', 'iso-8859-1': 'Windows-1252', 'iso-8859-9': 'Windows-1254', 'tis-620': 'ISO-8859-11', 'utf-16le': 'UTF-16'}¶
- MINIMUM_THRESHOLD = 0.2¶
- WIN_BYTE_DETECTOR = re.compile(b'[\x80-\x9f]')¶
- property charset_probers: List[CharSetProber]¶
- close() dict [source]¶
Stop analyzing the current document and come up with a final prediction.
- Returns:
The
result
attribute, adict
with the keys encoding, confidence, and language.
- feed(byte_str: bytes | bytearray) None [source]¶
Takes a chunk of a document and feeds it through all of the relevant charset probers.
After calling
feed
, you can check the value of thedone
attribute to see if you need to continue feeding theUniversalDetector
more data, or if it has made a prediction (in theresult
attribute).Note
You should always call
close
when you’re done feeding in your document ifdone
is not alreadyTrue
.
- property has_win_bytes: bool¶
- property input_state: int¶
- reset() None [source]¶
Reset the UniversalDetector and all of its probers back to their initial states. This is called by
__init__
, so you only need to call this directly in between analyses of different documents.
- result: dict¶
- chardet.detect(byte_str: bytes | bytearray, should_rename_legacy: bool = False) dict [source]¶
Detect the encoding of the given byte string.
- Parameters:
byte_str (
bytes
orbytearray
) – The byte sequence to examine.should_rename_legacy (
bool
) – Should we rename legacy encodings to their more modern equivalents?
- chardet.detect_all(byte_str: bytes | bytearray, ignore_threshold: bool = False, should_rename_legacy: bool = False) List[dict] [source]¶
Detect all the possible encodings of the given byte string.
- Parameters:
byte_str (
bytes
orbytearray
) – The byte sequence to examine.ignore_threshold (
bool
) – Include encodings that are belowUniversalDetector.MINIMUM_THRESHOLD
in results.should_rename_legacy (
bool
) – Should we rename legacy encodings to their more modern equivalents?