Identifying extended ASCII subset

Discussion in 'C++' started by kristofvdw@matt.es, Nov 7, 2005.

  1. Guest

    Hi,

    I have to treat a given text file, but haven't got a clue which
    extended ASCII set it is using.
    Opening the file in Windows' Notepad or in DOS, all accented letters
    and symbols are wrong.
    Any idea how to identify the subset used?
    Is there some text editor which can cycle easy through all known
    subsets, or even better: cycle subsets automatically until found a
    given test-string with some accents and symbols?
    If someone knows a solution which involves VB, C++, XML or whatever
    please don't hesitate sharing it with me.

    TIA,
    K
     
    , Nov 7, 2005
    #1
    1. Advertising

  2. Jim Mack Guest

    wrote:
    > Hi,
    >
    > I have to treat a given text file, but haven't got a clue which
    > extended ASCII set it is using.
    > Opening the file in Windows' Notepad or in DOS, all accented letters
    > and symbols are wrong.
    > Any idea how to identify the subset used?
    > Is there some text editor which can cycle easy through all known
    > subsets, or even better: cycle subsets automatically until found a
    > given test-string with some accents and symbols?



    If you expect a computer to do this for you, you're probably dreaming. Since the actual character codes don't change, only the visual representations, someone has to look at the result to make a judgement.

    If you have OCR code that will work on a memory bitmap, you could conceivably draw out the characters using a given code page and try to OCR the result, but even then I don't see any way to tell one 'close' result from another.

    What is it you need to do to the text, that requires you to know what the codes represent?

    --

    Jim Mack
    MicroDexterity Inc
    www.microdexterity.com
     
    Jim Mack, Nov 7, 2005
    #2
    1. Advertising

  3. On Mon, 07 Nov 2005 05:08:37 -0800, kristofvdw wrote:
    > I have to treat a given text file, but haven't got a clue which
    > extended ASCII set it is using.


    Files contain bytes. Bytes are numerical values. There are no ASCII sets
    or extended ASCII sets, AFA files are concerned. It's all in _our_ minds.
    To make your program understand and tell one set from another, you need to
    basically *teach* it the same "algorithm" _you_ are using to differentiate
    those sets.

    > [...]


    And avoid cross-posing to too many newsgroups at once. It makes your post
    that more irrelevant in many newsgroups.

    V
     
    Victor Bazarov, Nov 7, 2005
    #3
  4. In article <>,
    <> wrote:
    >I have to treat a given text file, but haven't got a clue which
    >extended ASCII set it is using.
    >Opening the file in Windows' Notepad or in DOS, all accented letters
    >and symbols are wrong.
    >Any idea how to identify the subset used?


    You can get Mozilla's character set guesser:

    http://www.mozilla.org/projects/intl/chardet.html

    There's a Java version too:

    http://jchardet.sourceforge.net/

    -- Richard
     
    Richard Tobin, Nov 7, 2005
    #4
  5. Peter Flynn Guest

    wrote:

    > Hi,
    >
    > I have to treat a given text file, but haven't got a clue which
    > extended ASCII set it is using.
    > Opening the file in Windows' Notepad or in DOS, all accented letters
    > and symbols are wrong.
    > Any idea how to identify the subset used?
    > Is there some text editor which can cycle easy through all known
    > subsets, or even better: cycle subsets automatically until found a
    > given test-string with some accents and symbols?
    > If someone knows a solution which involves VB, C++, XML or whatever
    > please don't hesitate sharing it with me.


    Open the file is a hexadecimal editor, pick some of the characters,
    and use the Unicode charts (www.unicode.org) to identify what
    encoding they are.

    Or just ask whoever created it.

    ///Peter
     
    Peter Flynn, Nov 7, 2005
    #5
  6. Guest

    mmm, you're right there; automating would be quite difficult and
    probable even take longer than browsing the sets manually... any tool
    you know to do so?

    The data are our clients, gotten through legacy-software. Now I'm
    putting the data in an Oracle DB, but it's impossible to get
    information on which coding the program uses. Lots of names and
    addresses have accents in them, which we can't afford to loose.
     
    , Nov 8, 2005
    #6
  7. Guest

    Thanks for the suggestion, I'll look into that.
    Unfortionately, the universal_charset_detector isn't built yet, and
    doesn't support rare sets, so I don't have much hope...
     
    , Nov 8, 2005
    #7
  8. Jim Mack Guest

    wrote:
    > mmm, you're right there; automating would be quite difficult and
    > probable even take longer than browsing the sets manually... any tool
    > you know to do so?
    >
    > The data are our clients, gotten through legacy-software. Now I'm
    > putting the data in an Oracle DB, but it's impossible to get
    > information on which coding the program uses. Lots of names and
    > addresses have accents in them, which we can't afford to loose.


    Do you know for sure that there is more than one character-set encoding in use? And what would you change these to, once you knew what they represented?

    Is this something you have to do just once, or is there a continuing need? For a one-time use, manually cycling through your choices may not be that painful.

    If this is truly an 'extended ASCII' file, which might be a legacy DOS file, you could try an OEM character set. There are several OEM code pages, but CP 437 is the most common. Just using an OEM font (like Ms Terminal or FoxPrint) will reveal whether this is the case. If it is, then applying the API OemToCharBuff will do the translation into the current code page.

    --
    Jim
     
    Jim Mack, Nov 8, 2005
    #8
  9. Guest

    Apparently, the problem is worse than expected.
    As Peter suggested, I took a look at the hex-codes.
    I discovered some apparent extended characters refered to the basic
    ASCII codes!
    For example, a name with "Ç" (code 199/hex C7) got exported as "G"
    (code 71/hex 47).
    So, when exporting from an apparent extended ASCII set, it uses a basic
    ASCII set, overlapping extended codes at 128 (for the example:
    199-128=71).
    What a moron! The programmer who managed to achieve this!

    Thanks all for your contributions, I now have to search for the
    original programmer and kill him...
     
    , Nov 8, 2005
    #9
  10. On Tue, 8 Nov 2005, Jim Mack wrote, seen in comp.text.xml:

    > If this is truly an 'extended ASCII' file, which might be a legacy
    > DOS file, you could try an OEM character set. There are several OEM
    > code pages, but CP 437 is the most common.


    In the USA, perhaps; but CP850 is the DOS codepage for a multinational
    situation, at least in basically latin-1 usage - and had been for
    quite some time.

    [f'ups proposed]
     
    Alan J. Flavell, Nov 8, 2005
    #10
  11. J French Guest

    On 7 Nov 2005 05:08:37 -0800, wrote:

    >Hi,
    >
    >I have to treat a given text file, but haven't got a clue which
    >extended ASCII set it is using.


    The .es in your name is interesting

    How much do you know about where this 'legacy' data came from ?

    Was it Windows, was it DOS ... or maybe something mainframe-ish ?

    What is the 'context' - for example a Turkish directory printed in
    Spain ?
     
    J French, Nov 8, 2005
    #11
  12. Guest

    I suspect the original is from an IBM mainframe in EBCDIC, but we only
    get a flat text file exportation.
    Additionally, we have a tough deal getting trough to the original
    programmers, so we'd have to work with what they provide us...
     
    , Nov 8, 2005
    #12
  13. J French Guest

    On 8 Nov 2005 05:14:52 -0800, wrote:

    >I suspect the original is from an IBM mainframe in EBCDIC, but we only
    >get a flat text file exportation.
    >Additionally, we have a tough deal getting trough to the original
    >programmers, so we'd have to work with what they provide us...


    The original programmers will just mislead you

    - you need to look into 'inferential logic'

    Like re-inventing the rules that make sense of the mess

    BTW - this sounds like a classic case of data transfer saboutage
     
    J French, Nov 8, 2005
    #13
  14. In article <>,
    Jim Mack <> wrote:

    >If you expect a computer to do this for you, you're probably dreaming.
    >Since the actual character codes don't change, only the visual
    >representations, someone has to look at the result to make a judgement.


    It's not that bad. By comparing the frequencies of individual
    characters, and pairs and triples and so on, against those found in
    known documents, it should be possible to achieve good enough accuracy
    for many purposes.

    If the data is really random, not even a human will be able to
    answer the question.

    -- Richard
     
    Richard Tobin, Nov 8, 2005
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. brrrdog
    Replies:
    0
    Views:
    867
    brrrdog
    Jul 9, 2003
  2. Geoff Warnock
    Replies:
    2
    Views:
    8,087
    Daniel Tryba
    Mar 9, 2005
  3. Replies:
    12
    Views:
    4,158
    Richard Tobin
    Nov 8, 2005
  4. Bob Hartung
    Replies:
    5
    Views:
    8,617
    shan23
    May 28, 2009
  5. James O'Brien
    Replies:
    3
    Views:
    306
    Ben Morrow
    Mar 5, 2004
Loading...

Share This Page