statistical software and data transformation?

Discussion in 'Ruby' started by Robert, Sep 19, 2007.

  1. Robert

    Robert Guest

    How many ruby-ists have to do statistical analysis or data cleaning
    prior to analysis?

    Is it not something that is done often by web developers?

    What is the well known software out there for statistical software or
    data transformation software? That is open source, or at least free of
    charge? I mean besides R, I think I understand what R's strengths and
    limitations are.

    There is a number of applications at
    http://directory.fsf.org/math/stats/

    but I do not know how mature they are (except for the one I submitted
    (vilno)).

    Is there currently a successful project for incredibly user-friendly
    open source statistical software, usually using a GUI, to compete with
    SAS (JMP) or SPSS? ( R is more for research statistics, with a tough
    learning curve.).

    Appreciate your feedback,

    Robert
    Robert, Sep 19, 2007
    #1
    1. Advertising

  2. Robert wrote:
    > How many ruby-ists have to do statistical analysis or data cleaning
    > prior to analysis?
    >
    > Is it not something that is done often by web developers?
    >
    > What is the well known software out there for statistical software or
    > data transformation software? That is open source, or at least free of
    > charge? I mean besides R, I think I understand what R's strengths and
    > limitations are.
    >
    > There is a number of applications at
    > http://directory.fsf.org/math/stats/
    >
    > but I do not know how mature they are (except for the one I submitted
    > (vilno)).
    >
    > Is there currently a successful project for incredibly user-friendly
    > open source statistical software, usually using a GUI, to compete with
    > SAS (JMP) or SPSS? ( R is more for research statistics, with a tough
    > learning curve.).
    >
    > Appreciate your feedback,
    >
    > Robert


    I do a lot of data cleaning/pre-processing. Most of it is numerical data
    rather than more "traditional" business data mining like
    name/address/zip code stuff. My main current modus operandi is

    1. Do the data extraction in Perl. I'd use Ruby, but
    a) I learned Perl years ago and just learned Ruby about a year ago
    b) There are no other Ruby programmers around for backup.

    2. Load the extracted data into a PostgreSQL database. I used to use
    Access, then migrated to SQL Server, and now I'm on PostgreSQL.

    3. Do SQL queries for the easy stuff and R (via RODBC) for the fancy stuff.

    Mind you, I've been doing this with minor alterations in the tools for
    something like 15 years, so I haven't really dug into the way other
    folks do it. But there are starting to be projects, both open source and
    commercial, in the so-called ETL (Extract, Transfer, Load) arena, that
    promise to revolutionize this type of work. One name that sticks in my
    mind in open source is Pentaho, but I have not had a chance to check it
    out. Most of the big ETL products are Java-based, IIRC.

    As for the learning curve of R, there *are* a few GUI front-ends that
    take some of the sting out of it, but the basic underlying *philosophy*
    of R is that it *is* a language (and a damn good one!) for
    scientific/statistical/graphical computing. The GUI builders expect you
    to start with the GUI and learn the language, rather than continue using
    the GUI like you would Excel, Minitab, or some of the other packages.
    That said, the most complete and user-friendly is probably R Commander
    (Rcmdr), which works on both Windows and Linux R.

    This is something I'd like to see built in Rails -- you've got the RDBMS
    back ends, the AJAX and MVC GUI tools, the ORM, etc. There is an
    interface to R from Ruby, but IIRC the bridge logic between the two
    languages currently only works on Linux -- there's no way yet for a
    Windows Ruby program to hook up with the R DLL. There are some R DCOM
    interfaces, though -- that might be the way to do it on a Windows machine.

    By the way, I think the Windows R UI is *far* superior to the one on
    Linux. The Linux version hasn't changed substantially from its origin --
    it's a simple xterm -- X windows application.
    M. Edward (Ed) Borasky, Sep 20, 2007
    #2
    1. Advertising

  3. On Sep 19, 2007, at 9:21 PM, M. Edward (Ed) Borasky wrote:

    > Robert wrote:
    >> How many ruby-ists have to do statistical analysis or data cleaning
    >> prior to analysis?
    >>
    >> Is it not something that is done often by web developers?
    >>
    >> What is the well known software out there for statistical software or
    >> data transformation software? That is open source, or at least =20
    >> free of
    >> charge? I mean besides R, I think I understand what R's strengths and
    >> limitations are.
    >>
    >> There is a number of applications at
    >> http://directory.fsf.org/math/stats/
    >>
    >> but I do not know how mature they are (except for the one I submitted
    >> (vilno)).
    >>
    >> Is there currently a successful project for incredibly user-friendly
    >> open source statistical software, usually using a GUI, to compete =20
    >> with
    >> SAS (JMP) or SPSS? ( R is more for research statistics, with a tough
    >> learning curve.).
    >>
    >> Appreciate your feedback,
    >>
    >> Robert

    >
    > I do a lot of data cleaning/pre-processing. Most of it is numerical =20=


    > data
    > rather than more "traditional" business data mining like
    > name/address/zip code stuff. My main current modus operandi is
    >
    > 1. Do the data extraction in Perl. I'd use Ruby, but
    > a) I learned Perl years ago and just learned Ruby about a year ago
    > b) There are no other Ruby programmers around for backup.


    We're all hurt Ed. You know how we enjoy those, "Help me extract =20
    this data with a one-liner=85" posts. ;)

    James Edward Gray II=
    James Edward Gray II, Sep 20, 2007
    #3
  4. On 9/20/07, M. Edward (Ed) Borasky <> wrote:
    >
    > Mind you, I've been doing this with minor alterations in the tools for
    > something like 15 years, so I haven't really dug into the way other
    > folks do it. But there are starting to be projects, both open source and
    > commercial, in the so-called ETL (Extract, Transfer, Load) arena, that
    > promise to revolutionize this type of work. One name that sticks in my
    > mind in open source is Pentaho, but I have not had a chance to check it
    > out. Most of the big ETL products are Java-based, IIRC.


    On the Ruby front there may be ActiveWarehouse :

    http://activewarehouse.rubyforge.org/etl/

    I haven't had a chance to play with it. It seems a bit Rails-focused.
    Level of activity is high.


    Best regards,

    --
    John Mettraux -///- http://jmettraux.openwfe.org
    John Mettraux, Sep 20, 2007
    #4
  5. James Edward Gray II wrote:
    > On Sep 19, 2007, at 9:21 PM, M. Edward (Ed) Borasky wrote:
    >
    >> Robert wrote:
    >>> How many ruby-ists have to do statistical analysis or data cleaning
    >>> prior to analysis?
    >>>
    >>> Is it not something that is done often by web developers?
    >>>
    >>> What is the well known software out there for statistical software or
    >>> data transformation software? That is open source, or at least free of
    >>> charge? I mean besides R, I think I understand what R's strengths and
    >>> limitations are.
    >>>
    >>> There is a number of applications at
    >>> http://directory.fsf.org/math/stats/
    >>>
    >>> but I do not know how mature they are (except for the one I submitted
    >>> (vilno)).
    >>>
    >>> Is there currently a successful project for incredibly user-friendly
    >>> open source statistical software, usually using a GUI, to compete with
    >>> SAS (JMP) or SPSS? ( R is more for research statistics, with a tough
    >>> learning curve.).
    >>>
    >>> Appreciate your feedback,
    >>>
    >>> Robert

    >>
    >> I do a lot of data cleaning/pre-processing. Most of it is numerical data
    >> rather than more "traditional" business data mining like
    >> name/address/zip code stuff. My main current modus operandi is
    >>
    >> 1. Do the data extraction in Perl. I'd use Ruby, but
    >> a) I learned Perl years ago and just learned Ruby about a year ago
    >> b) There are no other Ruby programmers around for backup.

    >
    > We're all hurt Ed. You know how we enjoy those, "Help me extract this
    > data with a one-liner…" posts. ;)
    >
    > James Edward Gray II
    >

    Hey, I *started* with "nawk" ;)
    M. Edward (Ed) Borasky, Sep 20, 2007
    #5
  6. John Mettraux wrote:
    > On 9/20/07, M. Edward (Ed) Borasky <> wrote:
    >> Mind you, I've been doing this with minor alterations in the tools for
    >> something like 15 years, so I haven't really dug into the way other
    >> folks do it. But there are starting to be projects, both open source and
    >> commercial, in the so-called ETL (Extract, Transfer, Load) arena, that
    >> promise to revolutionize this type of work. One name that sticks in my
    >> mind in open source is Pentaho, but I have not had a chance to check it
    >> out. Most of the big ETL products are Java-based, IIRC.

    >
    > On the Ruby front there may be ActiveWarehouse :
    >
    > http://activewarehouse.rubyforge.org/etl/
    >
    > I haven't had a chance to play with it. It seems a bit Rails-focused.
    > Level of activity is high.
    >
    >
    > Best regards,
    >

    Yeah ... I've seen that too. Then again, when it comes to databases and
    Ruby, what *isn't* Rails-focused?

    Well ... Nitro ... Iowa ... etc. ... :) I'm playing with Og at the
    moment, but not with big datasets.
    M. Edward (Ed) Borasky, Sep 20, 2007
    #6
  7. Robert

    Anthony Eden Guest

    On 9/20/07, John Mettraux <> wrote:
    > On 9/20/07, M. Edward (Ed) Borasky <> wrote:
    > >
    > > Mind you, I've been doing this with minor alterations in the tools for
    > > something like 15 years, so I haven't really dug into the way other
    > > folks do it. But there are starting to be projects, both open source and
    > > commercial, in the so-called ETL (Extract, Transfer, Load) arena, that
    > > promise to revolutionize this type of work. One name that sticks in my
    > > mind in open source is Pentaho, but I have not had a chance to check it
    > > out. Most of the big ETL products are Java-based, IIRC.

    >
    > On the Ruby front there may be ActiveWarehouse :
    >
    > http://activewarehouse.rubyforge.org/etl/
    >
    > I haven't had a chance to play with it. It seems a bit Rails-focused.
    > Level of activity is high.


    FWIW, ActiveWarehouse has a Rails plugin on one side but it also has
    an ETL Gem called, not surprisingly, ActiveWarehouse ETL. The
    documentation is available here:

    http://activewarehouse.rubyforge.org/docs/activewarehouse-etl.html

    We (the contributors) have worked hard to make something that is
    pretty easy to extend and that attempts to be idiomatic Ruby as much
    as possible. Take a look and feel free to join the ActiveWarehouse
    discussion list.

    V/r
    Anthony

    --
    Cell: 321 473-4966
    Current Location: Berlin, Germany
    Anthony Eden, Sep 20, 2007
    #7
  8. Robert

    Alex Fenton Guest

    Robert wrote:
    > How many ruby-ists have to do statistical analysis or data cleaning
    > prior to analysis?


    I use Ruby quite a lot at work for data cleaning, transformation and
    also for generating SPSS syntax. For example, I used it to create a long
    set of commands for linking together waves of the longitudinal British
    Household Panel Study.

    > What is the well known software out there for statistical software or
    > data transformation software?


    Possibly R Commander, which is a Tk interface onto R:
    http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/

    I haven't ever used it myself, though; it seems to have a good feature
    set but missing some things I use in SPSS eg Probit models.

    > Is there currently a successful project for incredibly user-friendly
    > open source statistical software, usually using a GUI, to compete with
    > SAS (JMP) or SPSS? ( R is more for research statistics, with a tough
    > learning curve.).


    Not that I know of. I agree re R - on numerous attempts I've never
    managed to get anywhere with it (I have 8 years programming experience
    and a postgrad in Research Methods). It also seems much more geared to
    the needs of natural rather than social science.

    There are things I don't like about SPSS too, apart from price - some
    interface aspects, and its syntax. I've written GUI software in Ruby for
    qualitative data analysis, but my inclination to create a competitor to
    SPSS on the quant side (eg a GUI round ruby's R bindings) is limited.
    It's partly a frank appreciation of the difficulty of the task, and
    partly down to the fact that SPSS is provided "free" to UK academics by
    nationwide licensing agreements with universities.

    alex
    Alex Fenton, Sep 20, 2007
    #8
  9. Alex Fenton wrote:
    > I use Ruby quite a lot at work for data cleaning, transformation and
    > also for generating SPSS syntax. For example, I used it to create a long
    > set of commands for linking together waves of the longitudinal British
    > Household Panel Study.


    Interesting ... how does Ruby compare with other languages for this
    purpose? We might be getting SPSS and if it's as bizarre as I remember,
    I'm going to need some way of preserving my sanity while using it.

    >> Possibly R Commander, which is a Tk interface onto R:

    > http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/
    >
    > I haven't ever used it myself, though; it seems to have a good feature
    > set but missing some things I use in SPSS eg Probit models.


    Yes, R Commander is pretty good for a beginner, but it's a crutch IMHO.
    R and its ancestor S were deliberately designed to be programming
    languages and interactive environments.

    > Not that I know of. I agree re R - on numerous attempts I've never
    > managed to get anywhere with it (I have 8 years programming experience
    > and a postgrad in Research Methods). It also seems much more geared to
    > the needs of natural rather than social science.


    Outside of "pure statistics", the two most highly-developed application
    areas for R are biology (http://www.bioconductor.org) and quantitative
    finance aka "program trading". Quantitative finance, however, tends to
    jump on bandwagons and jump off onto the "next big thing" quickly as well.

    It used to be you'd walk into a quant shop and they'd all be coding in
    APL. Then you'd walk into the place a year later and they'd have
    something else. So the "golden days" of R among quants may have passed.
    I think they're into OCaml these days. Or is it Haskell? :)

    > There are things I don't like about SPSS too, apart from price - some
    > interface aspects, and its syntax.


    I was talking to a colleague about this just yesterday. I left Minitab
    for R for two reasons:

    1. It didn't have a real programming language, and
    2. The system as distributed couldn't do a non-linear regression out of
    the box.

    SPSS has been around a *long* time. As far as I can remember, the only
    thing older was the UCLA Bio-Med package from the early 1960s! Does it
    still read like a hodge-podge of FORTRAN, macro assembler, JCL and such?

    > I've written GUI software in Ruby for
    > qualitative data analysis, but my inclination to create a competitor to
    > SPSS on the quant side (eg a GUI round ruby's R bindings) is limited.
    > It's partly a frank appreciation of the difficulty of the task, and
    > partly down to the fact that SPSS is provided "free" to UK academics by
    > nationwide licensing agreements with universities.


    There are a couple of other GUI projects for R. There is an "R-gui"
    mailing list where they all hang out. But it's hard to argue with the
    basic philosophy. R is *supposed* to be a programming language, not a
    statistics package. For that matter, Ruby is *supposed* to be a
    programming language, too. :)

    I've been a programmer for a long time and it didn't take me long to
    learn R. In a sense, S and R are dialects of Lisp, so if you're used to
    procedural languages as opposed to functional languages, you'll have a
    steeper learning curve. And if you're used to object-oriented
    programming as done in Smalltalk, Java or Ruby, you'll find R's
    "objects" and "classes" totally different. They're a bit like Common
    Lisp's CLOS in some senses, but not enough that you'd be able to
    transfer any preconceived notions. I don't tend to use them -- I'm
    perfectly happy with a "define-functions-from-the-bottom-up" programming
    style I learned from Lisp 1.5.
    M. Edward (Ed) Borasky, Sep 21, 2007
    #9
  10. Robert

    Alex Fenton Guest

    M. Edward (Ed) Borasky wrote:
    > Alex Fenton wrote:
    >> I use Ruby quite a lot at work for data cleaning, transformation and
    >> also for generating SPSS syntax. For example, I used it to create a long
    >> set of commands for linking together waves of the longitudinal British
    >> Household Panel Study.

    >
    > Interesting ... how does Ruby compare with other languages for this
    > purpose? We might be getting SPSS and if it's as bizarre as I remember,
    > I'm going to need some way of preserving my sanity while using it.


    Ruby works nicely for generating SPSS syntax, mainly because of its
    highly functional String/Hash/Array/Regexp classes. For preparing data,
    Excel's also useful, because it includes basic statistical functions (eg
    normal distribution), and because you can copy-n-paste data from a
    spreadsheet into SPSS's Data Editor.

    You might want to evaluate Stata as an alternative to SPSS. I haven't
    used it but several more quantitatively-oriented researchers I know
    speak well of it.

    >> Not that I know of. I agree re R - on numerous attempts I've never
    >> managed to get anywhere with it (I have 8 years programming experience
    >> and a postgrad in Research Methods). It also seems much more geared to
    >> the needs of natural rather than social science.

    >
    > Outside of "pure statistics", the two most highly-developed application
    > areas for R are biology (http://www.bioconductor.org) and quantitative
    > finance aka "program trading". Quantitative finance, however, tends to
    > jump on bandwagons and jump off onto the "next big thing" quickly as well.


    I guess in the "softer" end of social science where I work the data sets
    are relatively small and the analyses not computationally intensive. So
    GUI ease-of-use for occasional users is a important distinguishing feature.

    > SPSS has been around a *long* time. As far as I can remember, the only
    > thing older was the UCLA Bio-Med package from the early 1960s! Does it
    > still read like a hodge-podge of FORTRAN, macro assembler, JCL and such?


    Don't know its heritage, but it's a ugly baby... here's some code to
    create a composite key of a four-digit year and a one-digit UK region code:

    STRING REGION_YEAR(A6).
    COMPUTE REGION_YEAR = CONCAT(
    STRING(Region, F1), '_', STRING(Year, F4)
    ).
    EXECUTE.

    > I've been a programmer for a long time and it didn't take me long to
    > learn R. In a sense, S and R are dialects of Lisp, so if you're used to
    > procedural languages as opposed to functional languages, you'll have a
    > steeper learning curve. And if you're used to object-oriented
    > programming as done in Smalltalk, Java or Ruby, you'll find R's
    > "objects" and "classes" totally different.


    Interesting - I've been programming eight years, but my experience is
    almost all in Ruby, Perl and Javascript, with a bit of C++. Probably why
    I find the SPSS and R syntax so uncomfortable.

    alex
    Alex Fenton, Sep 24, 2007
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Software for statistical analysis

    , Sep 26, 2006, in forum: C Programming
    Replies:
    1
    Views:
    390
  2. Replies:
    0
    Views:
    426
  3. John Henry
    Replies:
    0
    Views:
    250
    John Henry
    Apr 27, 2007
  4. Robert
    Replies:
    1
    Views:
    109
    M. Edward (Ed) Borasky
    Jul 25, 2007
  5. datashaping

    Statistical software - source code available

    datashaping, Mar 5, 2007, in forum: Perl Misc
    Replies:
    0
    Views:
    107
    datashaping
    Mar 5, 2007
Loading...

Share This Page