An interview question by MS

L

Lambda

In an interview, I was asked:

Define a function to split a string with some token.

I said let the function return a string array.

He asked me how do i know the array size.
Yes, I have to traverse the string to find how large the array is for
the first time.
And traverse the string when actually doing the split for the second
time.

I think this is maybe not the correct answer.

How should i handle such situation?

I find there is a 'strtok(s, ct)' function in the standard library.
The first cll in a sequence has a non-NULL s.
Each subsequent call, indicated by a NULL value of s.
strtok returns NULL when no further token is found.

Why indeed this function defined this way?
Why not return a string array directly?
Maybe this is the correct answer??
 
R

Richard Heathfield

Lambda said:
In an interview, I was asked:

Define a function to split a string with some token.

I said let the function return a string array.

Strictly speaking, that's impossible. What you *can* do is return a pointer
to the first element in an array (which is what many people actually mean
when they say "return an array"). But I'm not convinced that this is the
best solution.

Note that "a string array" is in itself rather ambiguous. I would read it
as "an array of strings", i.e. either an array of arrays containing
strings or an array of pointers to char, each of which points at (the
first character of) a string. But you might have intended to return one
string per call, the same way that strtok does (in which case why not just
use strtok or a variant thereof?). I'll assume, however, that you meant
"an array of strings".
He asked me how do i know the array size.
Yes, I have to traverse the string to find how large the array is for
the first time.
And traverse the string when actually doing the split for the second
time.

No, you don't need to do this in two passes. You can build the array
dynamically, extending it as and when required.

How to tell the *caller* the array size bears thinking about. One way is to
wrap the whole thing in a struct, and either have your function accept a
pointer to such a struct or, perhaps better, allocate the memory for it
yourself and return a pointer to it.

The way I would do this is to start by encapsulating the tokenisation in
such a struct, and writing a function to create and initialise it. I would
then write a function that accepts a pointer to such a struct, together
with the string to be tokenised and the token delimiter list. It would
then do the tokenisation, creating strings as required, and loading them
into the array.

At every stage, I would think about the most appropriate way to encapsulate
each lower-level concept as I come across it (or, more likely, I'd re-use
something from my personal library that already encapsulates it). I would
anticipate the high-level tokenisation function doing little more than
calling existing custom routines in the right order and for the right
number of times.

<snip>
 
R

Richard Tobin

Lambda said:
He asked me how do i know the array size.
Yes, I have to traverse the string to find how large the array is for
the first time.
And traverse the string when actually doing the split for the second
time.

That's a plausible way to do it. But since you are going to have to
dynamically allocate the array, you could allocate a small size
initially and call realloc() to grow it when needed. In fact, you
could allocate 1 entry initially and call realloc() every time,
relying on the library to have a reasonable realloc() implementation.
I find there is a 'strtok(s, ct)' function in the standard library.
The first cll in a sequence has a non-NULL s.
Each subsequent call, indicated by a NULL value of s.
strtok returns NULL when no further token is found.

Why indeed this function defined this way?
Why not return a string array directly?
Maybe this is the correct answer??

It has the advantage of not allocating memory, which may be useful.
It has the substantial disadvantage that it keeps hidden state - the
current position in the string - so you can't interleave calls to
strtok() with different strings. Some implementations have a version
strtok_r() with an extra argument to avoid this.

-- Richard
 
S

santosh

In an interview, I was asked:

Define a function to split a string with some token.

The Standard library strtok does this, but if you're going to reinvent
the functionality you can do better.
I said let the function return a string array.

This is one possibility.
He asked me how do i know the array size.
Yes, I have to traverse the string to find how large the array is for
the first time.
And traverse the string when actually doing the split for the second
time.

I think this is maybe not the correct answer.

There is no "the correct" answer. Barring obviously wrong methods, each
design may be appropriate to a certain situation. strtok() design is
one possibility, though as I noted, you can improve upon it if you are
writing from scratch.

In general, other than modifying the original string like strtok(),
there are two methods. One is for the splitting function to allocate
memory itself and return the token. In this case the caller is usually
left with the responsibility of freeing the memory, but it allows for
efficient use of memory(?). The other method is to have the caller
provide a buffer to store the token into. This allows for memory to be
reused easily. Both methods can also be combined.

There are also other less favourable options like using static and
global arrays etc., usually bad design, particularly for a library
function.
How should i handle such situation?

Did they ask you to write the function then and there during the
interview?

Usually the best general answer is to discuss the pros and cons of
various methods. It shows that you are aware of the subject which is
actually a problem not just confined to strtok().
I find there is a 'strtok(s, ct)' function in the standard library.
The first cll in a sequence has a non-NULL s.
Each subsequent call, indicated by a NULL value of s.
strtok returns NULL when no further token is found.

Why indeed this function defined this way?

You should ask the original developers whoever they were.
Why not return a string array directly?
Maybe this is the correct answer??

As I said there is no one correct answer, though there are many wrong
answers.
 
D

Duncan Muirhead

In an interview, I was asked:

Define a function to split a string with some token.
I find there is a 'strtok(s, ct)' function in the standard library.
The first cll in a sequence has a non-NULL s.
Each subsequent call, indicated by a NULL value of s.
strtok returns NULL when no further token is found.

Why indeed this function defined this way?
Why not return a string array directly?
Maybe this is the correct answer??
Note that the separators can vary between calls to
strtok on the same base string. They could for example depend
on previous tokens (for example some strings might contain
numerical fields separated by spaces, others (eg times) might
contain numerical fields separated by ':'). This would be
awkward to arrange with a splitter that processed the whole
string in a oner.
 
R

robertwessel2

In an interview, I was asked:

Define a function to split a string with some token.

I said let the function return a string array.

He asked me how do i know the array size.
Yes, I have to traverse the string to find how large the array is for
the first time.
And traverse the string when actually doing the split for the second
time.

I think this is maybe not the correct answer.

How should i handle such situation?

I find there is a 'strtok(s, ct)' function in the standard library.
The first cll in a sequence has a non-NULL s.
Each subsequent call, indicated by a NULL value of s.
strtok returns NULL when no further token is found.

Why indeed this function defined this way?
Why not return a string array directly?
Maybe this is the correct answer??


I find it highly improbably that an MS interview question was intended
to elicit a C response. Surely the desired response was to create and
return a C++ container object in some fashion.
 
S

santosh

I find it highly improbably that an MS interview question was intended
to elicit a C response. Surely the desired response was to create and
return a C++ container object in some fashion.

MS is over C++. They would probably want a C#/CLI/.NET managed class
implemented in it's own Sandbox. :)
 
L

Lambda

I find it highly improbably that an MS interview question was intended
to elicit a C response. Surely the desired response was to create and
return a C++ container object in some fashion.


This question is very simple with Java or C++, and I believe it's
simple with C#.
I can use List in Java or vector<string> in C++.

When I asked him may i use that languages, he said that languages hide
the details
And insist on implementing with C :(

I'm interviewed by MS only once, maybe this is not a typical question.
From my experience, MS like to ask detail questions.
I must write the code in paper in a short time, after i discuss with
him on
what the function interface will be, there is little time for me to
complete the program. :(
 
P

Philip Potter

Lambda said:
In an interview, I was asked:

Define a function to split a string with some token.

I said let the function return a string array.

He asked me how do i know the array size.

Noone seems to have comment on this so far, so here goes:

As Richard Heathfield said elsethread, it is impossible to return a
genuine array, though you can return a pointer pointing to the first
element of an array (for further information, see chapter 6 of the
comp.lang.c FAQ.)

Such a pointer to the first element of an array of strings (each of
which is really a pointer to the first element of an array of chars)
would have type char **.

When you return that pointer, it has lost all information about how big
the array was. There are two main methods for communicating the size of
an array pointed to by a pointer: 1) have some 'sentinel' value within
the array which represents the last element, and 2) communicate the size
of the array through some other channel, perhaps by returning a struct
containing the pointer and a size_t size element.

Examples of method 1) include:
* Null-terminated strings. The '\0' character is the sentinel.
* The GLib function g_strdupv():
gchar **g_strdupv (gchar **str_array);
gchar is some character type; g_strdupv copies a NULL-terminated array
of strings of gchar. The NULL-pointer is the sentinel.

Examples of method 2) include:
* The qsort() standard library function.
* The printf() standard library function. Here the length of the
variable-length argument list is communicated through the format string.
(An argument list isn't an array, but this is the same idea.)

Each method has advantages and disadvantages.
 
C

Christopher Benson-Manica

[comp.lang.c] Richard Heathfield said:
No, you don't need to do this in two passes. You can build the array
dynamically, extending it as and when required.

I would think, though, that the work of traversing the string twice
would be less than the probable extra calls to realloc() - would you
agree?
How to tell the *caller* the array size bears thinking about. One way is to
wrap the whole thing in a struct, and either have your function accept a
pointer to such a struct or, perhaps better, allocate the memory for it
yourself and return a pointer to it.

I would say that the best solution would be to write both functions
(with the second delegating to the first) and allow the caller to
invoke whichever function is more convenient.
 
B

Ben Pfaff

Lambda said:
I find there is a 'strtok(s, ct)' function in the standard library.

strtok() has at least these problems:

* It merges adjacent delimiters. If you use a comma as your
delimiter, then "a,,b,c" will be divided into three tokens,
not four. This is often the wrong thing to do. In fact, it
is only the right thing to do, in my experience, when the
delimiter set contains white space (for dividing a string
into "words") or it is known in advance that there will be
no adjacent delimiters.

* The identity of the delimiter is lost, because it is
changed to a null terminator.

* It modifies the string that it tokenizes. This is bad
because it forces you to make a copy of the string if
you want to use it later. It also means that you can't
tokenize a string literal with it; this is not
necessarily something you'd want to do all the time but
it is surprising.

* It can only be used once at a time. If a sequence of
strtok() calls is ongoing and another one is started,
the state of the first one is lost. This isn't a
problem for small programs but it is easy to lose track
of such things in hierarchies of nested functions in
large programs. In other words, strtok() breaks
encapsulation.
 
R

Richard Heathfield

Christopher Benson-Manica said:
[comp.lang.c] Richard Heathfield said:
No, you don't need to do this in two passes. You can build the array
dynamically, extending it as and when required.

I would think, though, that the work of traversing the string twice
would be less than the probable extra calls to realloc() - would you
agree?

A two-pass algorithm lacks elegance (although of course we might do it in
two passes anyway, for any of various reasons). I would certainly be
swayed in the direction of single-pass if I had reason to think that the
inputs were likely to be colossal (or volatile).

<snip>
 
S

santosh

Christopher Benson-Manica said:
[comp.lang.c] Richard Heathfield said:
No, you don't need to do this in two passes. You can build the array
dynamically, extending it as and when required.

I would think, though, that the work of traversing the string twice
would be less than the probable extra calls to realloc() - would you
agree?

A two-pass algorithm lacks elegance (although of course we might do it
in two passes anyway, for any of various reasons). I would certainly
be swayed in the direction of single-pass if I had reason to think
that the inputs were likely to be colossal (or volatile).

If the input is volatile, (ignoring for the moment the likelyhood of
this), I think even a one-pass algorithm won't be guaranteed to produce
meaningful results.
 
R

Richard Heathfield

santosh said:
On Wednesday 07 Nov 2007 12:10 am Richard Heathfield


If the input is volatile, (ignoring for the moment the likelyhood of
this), I think even a one-pass algorithm won't be guaranteed to produce
meaningful results.

Well, evidently 'volatile' was a poor word choice, given the existence of
the C keyword of that name! To clarify: I'm thinking, for example, of data
coming in over stdin or a network connection, where you can't simply "play
back" the data whenever you like. (Also, it was a pretty lousy
parenthetical comment for another reason: it widens the discussion away
from mere strings, without saying so.)
 
S

santosh

santosh said:


Well, evidently 'volatile' was a poor word choice, given the existence
of the C keyword of that name! To clarify: I'm thinking, for example,
of data coming in over stdin or a network connection, where you can't
simply "play back" the data whenever you like. (Also, it was a pretty
lousy parenthetical comment for another reason: it widens the
discussion away from mere strings, without saying so.)

OK. That makes sense. Yes, a two pass algorithm is not applicable unless
you manage to buffer the data.
 
R

Richard Tobin

No, you don't need to do this in two passes. You can build the array
dynamically, extending it as and when required.
[/QUOTE]
I would think, though, that the work of traversing the string twice
would be less than the probable extra calls to realloc() - would you
agree?

Not necessarily. You only call realloc() at most once per token,
rather than once per character. And most implementations don't
allocate the exact size requested, so most calls to realloc() are
likely to be no-ops with a bit of overhead, even if you call realloc()
for each token.

-- Richard
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top