Document identification

Discussion in 'Ruby' started by M. Eteum, Jun 1, 2005.

  1. M. Eteum

    M. Eteum Guest

    Dear Ruby Guru:
    Is there a way to identify any documents from its header? I have a
    bunch of document collected over the year from multi platform system,
    Mac, Windows, and various unix/linux variant where some of the document
    does not have file extension. Are there a list that tells us what header
    should we expect for certain documents e.g. txt, rtf, pdf, jpg, mpg,
    word, excel, visio, etc ...

    Thanks
    M. Eteum, Jun 1, 2005
    #1
    1. Advertising

  2. M. Eteum wrote:
    > Is there a way to identify any documents from its header? I have a
    > bunch of document collected over the year from multi platform system,
    > Mac, Windows, and various unix/linux variant where some of the document
    > does not have file extension. Are there a list that tells us what header
    > should we expect for certain documents e.g. txt, rtf, pdf, jpg, mpg,
    > word, excel, visio, etc ...


    Hi,

    On a Unix system you could use the "file" command, it is able to detect
    file types even when there's no extension.
    I don't know if a Ruby module exists for this purpose though.

    Regards,
    Robin
    Robin Stocker, Jun 1, 2005
    #2
    1. Advertising

  3. On 6/1/05, Robin Stocker <> wrote:
    > M. Eteum wrote:
    > > Is there a way to identify any documents from its header? I have a
    > > bunch of document collected over the year from multi platform system,
    > > Mac, Windows, and various unix/linux variant where some of the document
    > > does not have file extension. Are there a list that tells us what heade=

    r
    > > should we expect for certain documents e.g. txt, rtf, pdf, jpg, mpg,
    > > word, excel, visio, etc ...

    > On a Unix system you could use the "file" command, it is able to detect
    > file types even when there's no extension.
    > I don't know if a Ruby module exists for this purpose though.


    Not yet. ;) I do plan on adding it to MIME::Types in the future.

    -austin
    --=20
    Austin Ziegler *
    * Alternate:
    Austin Ziegler, Jun 1, 2005
    #3
  4. M. Eteum

    M. Eteum Guest

    Robin Stocker wrote:
    > M. Eteum wrote:
    >
    >> Is there a way to identify any documents from its header? I have
    >> a bunch of document collected over the year from multi platform
    >> system, Mac, Windows, and various unix/linux variant where some of the
    >> document does not have file extension. Are there a list that tells us
    >> what header should we expect for certain documents e.g. txt, rtf, pdf,
    >> jpg, mpg, word, excel, visio, etc ...

    >
    >
    > Hi,
    >
    > On a Unix system you could use the "file" command, it is able to detect
    > file types even when there's no extension.
    > I don't know if a Ruby module exists for this purpose though.
    >
    > Regards,
    > Robin
    >
    >

    Thanks for the reply.

    I'm running on Windows as well as MAC. We exchange files between both
    OS. Ruby modules that can handle this function would have been nice but
    I'll take anything for now.

    Thanks again
    M. Eteum, Jun 1, 2005
    #4
  5. M. Eteum

    M. Eteum Guest

    Austin Ziegler wrote:
    > On 6/1/05, Robin Stocker <> wrote:
    >
    >>M. Eteum wrote:
    >>
    >>> Is there a way to identify any documents from its header? I have a
    >>>bunch of document collected over the year from multi platform system,
    >>>Mac, Windows, and various unix/linux variant where some of the document
    >>>does not have file extension. Are there a list that tells us what header
    >>>should we expect for certain documents e.g. txt, rtf, pdf, jpg, mpg,
    >>>word, excel, visio, etc ...

    >>
    >>On a Unix system you could use the "file" command, it is able to detect
    >>file types even when there's no extension.
    >>I don't know if a Ruby module exists for this purpose though.

    >
    >
    > Not yet. ;) I do plan on adding it to MIME::Types in the future.
    >
    > -austin


    Super! Oh by the way, do you know if Perl or Python has it? I'm quite
    desperate to find the solution, therefore I'll take any solution while
    waiting for the Ruby modules.

    Thanks
    M. Eteum, Jun 1, 2005
    #5
  6. ke, 2005-06-01 kello 19:00, M. Eteum kirjoitti:
    > Dear Ruby Guru:
    > Is there a way to identify any documents from its header? I have a
    > bunch of document collected over the year from multi platform system,
    > Mac, Windows, and various unix/linux variant where some of the document
    > does not have file extension. Are there a list that tells us what header
    > should we expect for certain documents e.g. txt, rtf, pdf, jpg, mpg,
    > word, excel, visio, etc ...
    >
    > Thanks


    Hello,

    If you have shared-mime-info database installed
    ( http://freedesktop.org/wiki/Software_2fshared_2dmime_2dinfo )
    you can use this: http://www.code-monkey.de/projects/mimeInfoRb.html
    Or my extended version: http://dark.fhtr.org/mime_info_rb.tar.gz

    >From the README:


    MimeInfo class provides an interface to query freedesktop.org's
    shared-mime-info database. It can be used to guess a filename's
    Mimetype and to get the description for the Mimetype.

    require 'mime_info'

    info = MimeInfo.get('foo.xml') #=> Mimetype['text/xml']
    info.description
    #=> "eXtensible Markup Language document"
    info.description("de") #=> "XML-Dokument"

    info2 = MimeInfo.get('foo.rb') #=> Mimetype['application/x-ruby']
    info2.description #=> "Ruby script"
    info2.is_a? Mimetype['text/plain'] #=> true

    t = Mimetype['audio/x-mp3'] #=> Mimetype['audio/x-mp3']
    t.description #=> "MP3 audio"
    t.description('cy') #=> "Sain MP3"
    t.descriptions['fr'] #=> "audio MP3"
    t == Mimetype['audio']['x-mp3'] #=> true
    t.is_a? Mimetype['audio'] #=> true
    t.ancestors #=> [Mimetype['audio/x-mp3'], Mimetype['audio'],
    # Mimetype['application/octet-stream'], Mimetype,
    # Module, Object, Kernel]


    HTH,

    Ilmari
    Ilmari Heikkinen, Jun 1, 2005
    #6
  7. On 6/1/05, Ilmari Heikkinen <> wrote:
    > ke, 2005-06-01 kello 19:00, M. Eteum kirjoitti:
    > > Dear Ruby Guru:
    > > Is there a way to identify any documents from its header? I have =

    a
    > > bunch of document collected over the year from multi platform system,
    > > Mac, Windows, and various unix/linux variant where some of the document
    > > does not have file extension. Are there a list that tells us what heade=

    r
    > > should we expect for certain documents e.g. txt, rtf, pdf, jpg, mpg,
    > > word, excel, visio, etc ...
    > >
    > > Thanks


    Most of this is covered by MIME::Types on RubyForge. However, the OP
    indicated that the problem was related to NOT having proper filename
    extensions. The OP wants to look for magic numbers and strings.

    -austin
    --=20
    Austin Ziegler *
    * Alternate:
    Austin Ziegler, Jun 1, 2005
    #7
  8. ke, 2005-06-01 kello 23:33, Austin Ziegler kirjoitti:
    > On 6/1/05, Ilmari Heikkinen <> wrote:
    > > ke, 2005-06-01 kello 19:00, M. Eteum kirjoitti:
    > > > Dear Ruby Guru:
    > > > Is there a way to identify any documents from its header? I have a
    > > > bunch of document collected over the year from multi platform system,
    > > > Mac, Windows, and various unix/linux variant where some of the document
    > > > does not have file extension. Are there a list that tells us what header
    > > > should we expect for certain documents e.g. txt, rtf, pdf, jpg, mpg,
    > > > word, excel, visio, etc ...
    > > >
    > > > Thanks

    >
    > Most of this is covered by MIME::Types on RubyForge. However, the OP
    > indicated that the problem was related to NOT having proper filename
    > extensions. The OP wants to look for magic numbers and strings.
    >


    Shared-mime-info does this aswell. Though it may fare worse than file in
    some cases.

    kig@bauhaus:~$ mv fire.avi fire
    kig@bauhaus:~$ irb
    irb(main):001:0> require 'mime_info'
    => true
    irb(main):002:0> MimeInfo.get('fire')
    => Mimetype['video/x-msvideo']
    Ilmari Heikkinen, Jun 1, 2005
    #8
  9. M. Eteum <> wrote:
    >
    > Super! Oh by the way, do you know if Perl or Python has it? I'm quite
    > desperate to find the solution, therefore I'll take any solution while
    > waiting for the Ruby modules.


    Your best bet would be to find a windows port of unix's 'file' (Mac OSX
    is definitely bound to have it). Sadly, it's a very hard thing to google
    for :)

    martin
    Martin DeMello, Jun 2, 2005
    #9
  10. Martin DeMello <> wrote:
    > M. Eteum <> wrote:
    > >
    > > Super! Oh by the way, do you know if Perl or Python has it? I'm quite
    > > desperate to find the solution, therefore I'll take any solution while
    > > waiting for the Ruby modules.

    >
    > Your best bet would be to find a windows port of unix's 'file' (Mac OSX
    > is definitely bound to have it). Sadly, it's a very hard thing to google
    > for :)


    You're in luck - gnuwin32 includes a port of file.

    http://gnuwin32.sourceforge.net/summary.html

    All you need to do is a = `file.exe #{filename}`

    martin
    Martin DeMello, Jun 2, 2005
    #10
  11. M. Eteum

    M. Eteum Guest

    Ilmari Heikkinen wrote:
    > ke, 2005-06-01 kello 23:33, Austin Ziegler kirjoitti:
    >
    >>On 6/1/05, Ilmari Heikkinen <> wrote:
    >>
    >>>ke, 2005-06-01 kello 19:00, M. Eteum kirjoitti:
    >>>
    >>>>Dear Ruby Guru:
    >>>> Is there a way to identify any documents from its header? I have a
    >>>>bunch of document collected over the year from multi platform system,
    >>>>Mac, Windows, and various unix/linux variant where some of the document
    >>>>does not have file extension. Are there a list that tells us what header
    >>>>should we expect for certain documents e.g. txt, rtf, pdf, jpg, mpg,
    >>>>word, excel, visio, etc ...
    >>>>
    >>>>Thanks

    >>
    >>Most of this is covered by MIME::Types on RubyForge. However, the OP
    >>indicated that the problem was related to NOT having proper filename
    >>extensions. The OP wants to look for magic numbers and strings.
    >>

    >
    >
    > Shared-mime-info does this aswell. Though it may fare worse than file in
    > some cases.
    >
    > kig@bauhaus:~$ mv fire.avi fire
    > kig@bauhaus:~$ irb
    > irb(main):001:0> require 'mime_info'
    > => true
    > irb(main):002:0> MimeInfo.get('fire')
    > => Mimetype['video/x-msvideo']
    >
    >
    >
    >

    Thanks, but where do you get the mime_info.rb? I'm running "ruby 1.8.2
    (2004-12-25) [i386-mswin32]" and it seems it does not have the necessary
    files.

    Thanks
    M. Eteum, Jun 2, 2005
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sorin Sandu

    Client identification

    Sorin Sandu, Apr 9, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    385
    Guest
    Apr 9, 2004
  2. Sorin Sandu

    Re: Client identification

    Sorin Sandu, Apr 9, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    342
    Bruno Sirianni
    Apr 9, 2004
  3. Daniel Jorge
    Replies:
    5
    Views:
    601
  4. guy

    identification of NET objects

    guy, Sep 6, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    353
    Eliyahu Goldin
    Sep 6, 2005
  5. thomson

    UnManaged Resource Identification

    thomson, Feb 1, 2006, in forum: ASP .Net
    Replies:
    7
    Views:
    529
    thomson
    Feb 3, 2006
Loading...

Share This Page