Help me understand use of regular expressions to validate data

Discussion in 'Perl Misc' started by Ted, May 29, 2006.

  1. Ted

    Ted Guest

    The context here is I need to create a script that validates data in
    fields in plain text files where fields may be surrounded by double
    quotes and may be separated by commas or tabs. In fact, one supplier
    of a data feed we use has been known to switch between comma separated
    values and tab delimited values, often without warning.

    In one of the FAQs, I found the following regular expressions, but I
    have some questions.

    if (/\D/) { print "has nondigits\n" }
    if (/^\d+$/) { print "is a whole number\n" }
    if (/^-?\d+$/) { print "is an integer\n" }
    if (/^[+-]?\d+$/) { print "is a +/- integer\n" }
    if (/^-?\d+\.?\d*$/) { print "is a real number\n" }
    if (/^-?(?:\d+(?:\.\d*)?|\.\d+)$/) { print "is a decimal number\n" }
    if (/^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/)
    { print "a C float\n" }

    The first question is "What string is the regular expression applied
    to?"

    I can recognize '\d+' as representing an arbitrary number of digits,
    but what are '^' and '$' for ?

    I don't care about distinctions between float decimal and real numbers.
    However, I may have a need to distinguish between float and double
    precision numbers. If that need materializes, how might I modify one
    of the regular expressions above to allow me to determine if the value
    in a given variable is necessarily a double (assuming that any single
    precision number can be treated as if it is a double precision number:
    for the purpose of converting strings from a text file into an
    appropriate number).

    >From what I have read, I expect I can use '\w' to test whether or not a

    variable contains a string consisting only of alpha numeric characters.
    Is that right? What would I use to test, using a regular expression,
    whether a given string contains only alphanumeric characters, and that
    the total number of characters is less than or equal to 8? What about
    testing for a string containing precisely 4 letters and 3 digits?

    I will also need to be able to check to see whether or not a given
    string represents a valid date or timestamp.

    To put this back into my context, I'd be reading in the text file,
    splitting each record into its fields. I'd also read in, from a
    different file, information regarding the number of fields and the type
    of each field. I'd then verify that there is the correct number of
    fields and that each field has a valid string that contains the right
    kind of data for that field. I still haven't decided how to handle the
    fact that one of our suppliers sometimes switches between commas and
    tabs, sometimes without warning. Suggestions are welcome, though.

    Sorry if this seems basic, but it has been eons since I last looked at
    regular expressions, and I have not found sufficient detail in the
    documentation I have found.

    Thanks,

    Ted
    Ted, May 29, 2006
    #1
    1. Advertising

  2. Ted

    Juha Laiho Guest

    "Ted" <> said:
    >In one of the FAQs, I found the following regular expressions, but I
    >have some questions.
    >
    > if (/\D/) { print "has nondigits\n" }
    > if (/^\d+$/) { print "is a whole number\n" }
    > if (/^-?\d+$/) { print "is an integer\n" }
    > if (/^[+-]?\d+$/) { print "is a +/- integer\n" }
    > if (/^-?\d+\.?\d*$/) { print "is a real number\n" }
    > if (/^-?(?:\d+(?:\.\d*)?|\.\d+)$/) { print "is a decimal number\n" }
    > if (/^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/)
    > { print "a C float\n" }
    >
    >The first question is "What string is the regular expression applied
    >to?"


    $_ - the "default" variable. This is set in various contexts; see
    description in "perldoc perlvar".

    >I can recognize '\d+' as representing an arbitrary number of digits,
    >but what are '^' and '$' for ?


    Start and end of variable. So, plain /\d+/ would match a23b as well as
    23, whereas /^\d+$/ requires that the variable only contains digits.
    Described in "perldoc perlre", by the way.

    >I don't care about distinctions between float decimal and real numbers.
    > However, I may have a need to distinguish between float and double
    >precision numbers.


    For that I don't have an answer.

    >From what I have read, I expect I can use '\w' to test whether or not a
    >variable contains a string consisting only of alpha numeric characters.


    No, with /\w/ would be true whenever the variable contains at least one
    "word character". To ensure that you only have word characters, you
    could use /^\w+$/ . If you also allow empty strings, then /^\w*$/ would
    be the correct one.

    > Is that right? What would I use to test, using a regular expression,
    >whether a given string contains only alphanumeric characters, and that
    >the total number of characters is less than or equal to 8?


    /^\w{0,8}$/ or {1,8}, if you don't want empty strings. \w includes
    the underscore character, so you'll have to tune if you want to disallow
    it.

    >What about testing for a string containing precisely 4 letters and 3 digits?

    /^[:alpha:]{4}\d{3}$/

    >I will also need to be able to check to see whether or not a given
    >string represents a valid date or timestamp.


    Please start by defining all possible things you'd like to consider as
    valid dates/timestamps. Then, if you also want to parse the actual
    timestamps (i.e. know what the time/date is, in addition to just storing
    the data), check that no two allowed formats can be confused with
    each other.

    >Sorry if this seems basic, but it has been eons since I last looked at
    >regular expressions, and I have not found sufficient detail in the
    >documentation I have found.


    "perldoc perlre", distributed with your perl interpreter, and online at
    http://www.perl.com/doc/manual/html/pod/perlre.html . All my answers
    above are from the data in that one document (except what was related to
    $_).
    --
    Wolf a.k.a. Juha Laiho Espoo, Finland
    (GC 3.0) GIT d- s+: a C++ ULSH++++$ P++@ L+++ E- W+$@ N++ !K w !O !M V
    PS(+) PE Y+ PGP(+) t- 5 !X R !tv b+ !DI D G e+ h---- r+++ y++++
    "...cancel my subscription to the resurrection!" (Jim Morrison)
    Juha Laiho, May 29, 2006
    #2
    1. Advertising

  3. On Mon, 29 May 2006, Juha Laiho wrote:

    > "Ted" <> said:
    >
    > >From what I have read, I expect I can use '\w' to test whether or not a
    > >variable contains a string consisting only of alpha numeric characters.

    >
    > No, with /\w/ would be true whenever the variable contains at least one
    > "word character". To ensure that you only have word characters, you
    > could use /^\w+$/ . If you also allow empty strings, then /^\w*$/ would
    > be the correct one.


    Pretty much, but, as the documentation (perldoc perlre) says, \w
    includes also the underscore (OK, you said that later on); also, if
    "use locale" is in effect, it includes whatever characters the locale
    defines to be alphabetic.

    regards
    Alan J. Flavell, May 29, 2006
    #3
  4. Ted <> wrote:

    > The context here is I need to create a script that validates data in



    The common idiom for validating data is:

    anchor the start
    anchor the end
    write a pattern in between that accounts for everything that
    you want to allow

    Then if the pattern matches the string, valid data, else invalid data.


    > fields in plain text files where fields may be surrounded by double
    > quotes and may be separated by commas or tabs. In fact, one supplier
    > of a data feed we use has been known to switch between comma separated
    > values and tab delimited values, often without warning.



    In that case, I would attempt to detect what separator is being
    used, then normalize it before proceeding to splitting out the
    fields for individual validation.


    > In one of the FAQs, I found the following regular expressions, but I
    > have some questions.
    >
    > if (/\D/) { print "has nondigits\n" }
    > if (/^\d+$/) { print "is a whole number\n" }
    > if (/^-?\d+$/) { print "is an integer\n" }
    > if (/^[+-]?\d+$/) { print "is a +/- integer\n" }
    > if (/^-?\d+\.?\d*$/) { print "is a real number\n" }
    > if (/^-?(?:\d+(?:\.\d*)?|\.\d+)$/) { print "is a decimal number\n" }
    > if (/^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/)
    > { print "a C float\n" }
    >
    > The first question is "What string is the regular expression applied
    > to?"



    You should check Perl's std docs *before* posting to the Perl newsgroup.

    The description of the m// operator in perlop.pod says what string
    will be searched by default, and how to make it look somewhere
    besides that default place if you wish to.

    If no string is specified via the =~ or !~ operator,
    the $_ string is searched.


    > I can recognize '\d+' as representing an arbitrary number of digits,



    It does not match zero digits, so not quite an "arbitrary number".


    > but what are '^' and '$' for ?



    Once again, going to the docs is faster, more authoritative, and
    helps you to avoid wearing out your welcome before you get to
    questions that cannot be answered by a cursory search of the
    documentation.

    perldoc perlre

    ^ Match the beginning of the line
    $ Match the end of the line (or before newline at the end)


    (my code below ignores that parenthetical, \z might be better
    than $ for your application...)


    >>From what I have read, I expect I can use '\w' to test whether or not a

    > variable contains a string consisting only of alpha numeric characters.

    ^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^
    ^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^
    > Is that right?



    No.

    First it does not match only alphanumerics, just as perlre.pod says:

    \w Match a "word" character (alphanumeric plus "_")

    Secondly, \w can be used to test if the string (which may not be
    in a "variable") _contains_ an alphanumeric or "_" character.

    To get to "consisting only of", you need to apply the idiom:

    /^\w+$/

    or

    /^[a-zA-Z0-9_]+$/

    or, if you really want only alphanumerics

    /^[a-zA-Z0-9]+$/


    > What would I use to test, using a regular expression,
    > whether a given string contains only alphanumeric characters, and that
    > the total number of characters is less than or equal to 8?



    /^\w{0,8}$/

    but your spec is probably incomplete, so I think you probably want:

    /^\w{1,8}$/

    instead.


    > What about
    > testing for a string containing precisely 4 letters and 3 digits?



    One part regex, two parts NOT a regex:

    /^[a-zA-Z0-9]{7}$/ and tr/a-zA-Z// == 4 and tr/0-9// == 3


    > I will also need to be able to check to see whether or not a given
    > string represents a valid date or timestamp.



    You are going to need to give more precise criteria for "valid" here.

    In most of _my_ applications I usually use:

    /^\d\d\d\d-\d\d-\d\d$/

    and call it good enough.

    If you want 2006-02-30 or 2006-13-01 to be invalid, or if you want
    \d\d\d\d-02-29 to be valid for some years and invalid for other
    years, then I'd start looking for a module on CPAN...


    > I still haven't decided how to handle the
    > fact that one of our suppliers sometimes switches between commas and
    > tabs, sometimes without warning. Suggestions are welcome, though.



    Insufficient information.

    When commas are used, can you have commas in fields?

    When tabs are used, can you have tabs in fields?

    If the format allows seperators in quoted fields, then how are
    quotes represented in quoted fields?

    Is there a fixed and expected number of fields in a record?

    If not, then can you at least expect the _same_ number of fields
    in any particular file?



    You can perhaps "guess".

    Read the first 10 or 20 records and calculate the tabs/commas ratio
    for each, then see if most of the ratios are are greater or less
    than one.

    Certainly not robust or fool-proof, but would probably work on most data...


    > Sorry if this seems basic,



    "basic" is nothing to apologize for. There is no "minimum complexity"
    expected for posting here.

    Asking things that can be answered straightaway by a cursory search
    of Perl's standard documentation however is another matter.

    Have you seen the Posting Guidelines that are posted here frequently?


    > but it has been eons since I last looked at
    > regular expressions, and I have not found sufficient detail in the
    > documentation I have found.



    If you tell us what documentation you have found, then we might be
    able to tell you about some that you have not found...

    Have you found "perlop.pod" and "perlre.pod" for instance?


    See also:

    perldoc perlrequick

    perldoc perlretut

    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, May 29, 2006
    #4
  5. Ted

    Henry Law Guest

    Ted wrote:
    > The context here is I need to create a script that validates data in
    > fields in plain text files where fields may be surrounded by double
    > quotes and may be separated by commas or tabs. <snip>


    Lots of good advice in other posts, Ted. Here's what I'd add

    1. Like the good programmer I hope you are, you need to be very
    precise with the specifications of what is and is not "valid".
    For example in your question about distinguishing between
    single and double-precision numbers, can you state a rule
    that would allow you to distinguish reliably between them?
    If so then you, or a combination of you and the group, can
    code it.

    2. Don't forget CPAN. Go to http://search.cpan.org and look for
    some modules that may help. I did so on your behalf and thought
    Data::Validate and also maybe Data::FormValidator::Constraints::Dates
    looked relevant. Have a look for yourself and see what I've missed.

    3. In a complex situation like this I don't think you should expect
    to do all your validation simply and elegantly. Particularly if
    your data has some real funnies, like switching between delimiters
    in mid-stream! Also your date stamps may be quite idiosyncratic,
    such that the standard modules don't understand them. Be prepared
    to do some parsing on the data fields, and perhaps some substitution,
    before using some standard module or snippet of code.

    4. If what you're doing is really Extract/Transform/Load then bear in
    mind that the manufacturers of ETL tools make a good living out of
    the fact that it's astoundingly difficult to code up rules that will
    squash real-world data into the Procrustean bed of a fixed
    format! Good luck.

    --

    Henry Law <>< Manchester, England
    Henry Law, May 29, 2006
    #5
  6. Ted

    Eric Bohlman Guest

    Henry Law <> wrote in news:1148940322.7138.0
    @proxy01.news.clara.net:

    > 2. Don't forget CPAN. Go to http://search.cpan.org and look for
    > some modules that may help. I did so on your behalf and thought
    > Data::Validate and also maybe Data::FormValidator::Constraints::Dates
    > looked relevant. Have a look for yourself and see what I've missed.


    He should also take a look at Regexp::Common which provides an impressive
    collection of already-written and exhaustively-tested (no kidding!) regular
    expressions for matching common data formats.
    Eric Bohlman, May 30, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ossie
    Replies:
    2
    Views:
    417
    Chris Uppal
    Feb 14, 2004
  2. bruce
    Replies:
    4
    Views:
    733
    Cameron Laird
    Sep 22, 2006
  3. Ben Finney
    Replies:
    0
    Views:
    427
    Ben Finney
    Sep 22, 2006
  4. Replies:
    17
    Views:
    211
    Robert Klemme
    Jun 22, 2007
  5. Noman Shapiro
    Replies:
    0
    Views:
    215
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page