extract character strings from displayed web page.

A

A Causal

I'm an experienced C programmer, but I have never worked with any sort
of internet programming. I would like to write a program to search for
certain character strings in a currently displayed web page, and then
get the string that immediatly follows the one that I searched for. It
seems like an easy thing to do, after all the stuff that I want is
staring me right in the face, but I have no idea where that stuff is
stored or how to access it.


Thanks

Ron
 
I

Irrwahn Grausewitz

I'm an experienced C programmer, but I have never worked with any sort
of internet programming. I would like to write a program to search for
certain character strings in a currently displayed web page, and then
get the string that immediatly follows the one that I searched for. It
seems like an easy thing to do, after all the stuff that I want is
staring me right in the face, but I have no idea where that stuff is
stored or how to access it.

As this is highly OS/application dependent, you should ask this in a
newsgroup dedicated to that, as comp.lang.c is only about portable
ISO-C. See http://www.angelfire.com/ms3/bchambless0/welcome_to_clc.html


Regards
 
T

Tristan Miller

Greetings.

A Causal said:
I'm an experienced C programmer, but I have never worked with any sort
of internet programming. I would like to write a program to search for
certain character strings in a currently displayed web page, and then
get the string that immediatly follows the one that I searched for. It
seems like an easy thing to do, after all the stuff that I want is
staring me right in the face, but I have no idea where that stuff is
stored or how to access it.

Interfacing with web browsers or the http protocol is not something which is
built into C, so there is no standard answer to your query. It will depend
on your particular compiler, operating system, and/or whatever third-party
libraries you use. If you assume that the user has already saved the HTML
file to disk, however, then it's just a regular text file which C can
process.

Note that C isn't particularly well-suited for intensive text processing,
though; unless it's being integrated in a much larger C program, it would
be better and faster to write the sort of application you describe using
some regexp-based tool such as sed or perl.

Regards,
Tristan
 
C

Christopher Benson-Manica

Tristan Miller said:
Note that C isn't particularly well-suited for intensive text processing,
though; unless it's being integrated in a much larger C program, it would
be better and faster to write the sort of application you describe using
some regexp-based tool such as sed or perl.

Why do you say that? (note that this is an honest question, not a challenge)
 
T

Tristan Miller

Greetings.

Why do you say that? (note that this is an honest question, not a
challenge)

It's simply a question of specialization of tools. It's certainly possible
to drive in a nail using the blunt end of a screwdriver, though it would be
faster and less accident-prone to use a hammer. Likewise, building
applications (even small ones) which deal almost exclusively with text
processing is usually more efficient (with respect to development time and
ease of debugging, not necessarily execution speed) when using a language
specifically devoted to that task. A program to do regular expression
search-and-replacement on multiple files is literally four characters long
in sed (not counting the filenames and regular expressions themselves); the
corresponding program in C would necessarily be several lines long, even if
one used a third-party regexp library. You would need to include the
regexp and stdio headers, define the main function, declare a file pointer,
open each file in argv[] for reading (including error checking), loop
through each line of the file, do the regexp replacement, write out the new
line, close the file, and finally return from main. Sure, the compiled C
program might run a hundred times faster than the corresponding interpreted
sed or perl code, but if it's just a one-off program, you've just wasted
five minutes to write the C program plus 0.00001 seconds to run it versus
spending five seconds to write the sed program plus 0.001 seconds to run
it.

Regards,
Tristan
 
T

Tristan Miller

Greetings.

You've clearly never hit your thumb with a hammer ;) Seriously, on
reading your original post, I thought you were speaking of execution
efficiency, which
you were not. No complaints from me in that case... Would you say, then,
that C is pretty good for text processing as far as execution efficiency
is concerned?

Optimization for speed and memory use is compiler-dependent, but generally
speaking, yes, a well-written algorithm in compiled C will be faster at
running text processing applications than the same application executed in
interpreted sed. With C, you're simply "closer to the hardware", plus
there's no need to load in a potentially huge interpreter every time you
want to run your application.

In this day and age, however, you aren't going to gain that much in text
processing even if you do use C. The bottleneck in the application is more
likely to be the inherent sloth of I/O rather than inefficient code. I
work in natural-language processing, and I can attest that even those
researchers who routinely process text corpora ranging into the gigabytes
don't flinch at using high-level text- or logic-oriented languages like
Perl or Prolog to munge the data. We tend to use C more for plain old
number-crunching, as with the large co-occurrence matrices the
aforementioned mungers may produce.

Regards,
Tristan
 
D

Derk Gwen

(e-mail address removed) (A Causal) wrote:
# I'm an experienced C programmer, but I have never worked with any sort
# of internet programming. I would like to write a program to search for
# certain character strings in a currently displayed web page, and then
# get the string that immediatly follows the one that I searched for. It
# seems like an easy thing to do, after all the stuff that I want is
# staring me right in the face, but I have no idea where that stuff is
# stored or how to access it.

You'll need some library to open the socket and fetch the page; this is not
part of standard C. You'll also need to decide exactly what you mean by 'string'
and 'after' if you are fetching (as normal) an HTML page; you can also find
libraries to parse HTML if you need that. If you don't need to parse the HTML,
you can just read the socket stream with stdio and use state table or
strstr() or other such techniques to scan the input.

You can also use something other than C. Scripting languages can do this kind
of stuff in half a dozen lines.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,045
Latest member
DRCM

Latest Threads

Top