Python3.0 has more duplication in source code than Python2.5

T

Terry

I used a CPD (copy/paste detector) in PMD to analyze the code
duplication in Python source code. I found that Python3.0 contains
more duplicated code than the previous versions. The CPD tool is far
from perfect, but I still feel the analysis makes some sense.

|Source Code | NLOC | Dup60 | Dup30 | Rate60 | Rate 30
|
Python1.5(Core) 19418 1072 3023 6% 16%
Python2.5(Core) 35797 1656 6441 5% 18%
Python3.0(Core) 40737 3460 9076 8% 22%
Apache(server) 18693 1114 2553 6% 14%

NLOC: The net lines of code
Dup60: Lines of code that has 60 continuous tokens duplicated to other
code (counted twice or more)
Dup30: 30 tokens duplicated
Rate60: Dup60/NLOC
Rate30: Dup30/NLOC

We can see that the common duplicated rate is tended to be stable. But
Python3.0 is slightly bigger than that. Consider the small increase in
NLOC, the duplication rate of Python3.0 might be too big.

Does that say something about the code quality of Python3.0?
 
M

Martin v. Löwis

Does that say something about the code quality of Python3.0?

Not necessarily. IIUC, copying a single file with 2000 lines
completely could already account for that increase.

It would be interesting to see what specific files have gained
large numbers of additional files, compared to 2.5.

Regards,
Martin
 
T

Terry

Not necessarily. IIUC, copying a single file with 2000 lines
completely could already account for that increase.

It would be interesting to see what specific files have gained
large numbers of additional files, compared to 2.5.

Regards,
Martin

But the duplication are always not very big, from about 100 lines
(rare) to less the 5 lines. As you can see the Rate30 is much bigger
than Rate60, that means there are a lot of small duplications.
 
D

Diez B. Roggisch

Terry said:
But the duplication are always not very big, from about 100 lines
(rare) to less the 5 lines. As you can see the Rate30 is much bigger
than Rate60, that means there are a lot of small duplications.

Do you by any chance have a few examples of these? There is a lot of
idiomatic code in python to e.g. acquire and release the GIL or doing
refcount-stuff. If that happens to be done with rather generic names as
arguments, I can well imagine that as being the cause.

Diez
 
T

Terry

Do you by any chance have a few examples of these? There is a lot of
idiomatic code in python to e.g. acquire and release the GIL or doing
refcount-stuff. If that happens to be done with rather generic names as
arguments, I can well imagine that as being the cause.

Diez

Example 1:
Found a 64 line (153 tokens) duplication in the following files:
Starting at line 73 of D:\DOWNLOADS\Python-3.0\Python\thread_pth.h
Starting at line 222 of D:\DOWNLOADS\Python-3.0\Python
\thread_pthread.h

return (long) threadid;
#else
return (long) *(long *) &threadid;
#endif
}

static void
do_PyThread_exit_thread(int no_cleanup)
{
dprintf(("PyThread_exit_thread called\n"));
if (!initialized) {
if (no_cleanup)
_exit(0);
else
exit(0);
}
}

void
PyThread_exit_thread(void)
{
do_PyThread_exit_thread(0);
}

void
PyThread__exit_thread(void)
{
do_PyThread_exit_thread(1);
}

#ifndef NO_EXIT_PROG
static void
do_PyThread_exit_prog(int status, int no_cleanup)
{
dprintf(("PyThread_exit_prog(%d) called\n", status));
if (!initialized)
if (no_cleanup)
_exit(status);
else
exit(status);
}

void
PyThread_exit_prog(int status)
{
do_PyThread_exit_prog(status, 0);
}

void
PyThread__exit_prog(int status)
{
do_PyThread_exit_prog(status, 1);
}
#endif /* NO_EXIT_PROG */

#ifdef USE_SEMAPHORES

/*
* Lock support.
*/

PyThread_type_lock
PyThread_allocate_lock(void)
{
 
T

Terry

Do you by any chance have a few examples of these? There is a lot of
idiomatic code in python to e.g. acquire and release the GIL or doing
refcount-stuff. If that happens to be done with rather generic names as
arguments, I can well imagine that as being the cause.

Diez

Example 2:
Found a 16 line (106 tokens) duplication in the following files:
Starting at line 4970 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c
Starting at line 5015 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c
Starting at line 5073 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c
Starting at line 5119 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c

PyErr_Format(PyExc_TypeError,
"GeneratorExp field \"generators\" must be a list, not a %.200s", tmp-
ob_type->tp_name);
goto failed;
}
len = PyList_GET_SIZE(tmp);
generators = asdl_seq_new(len, arena);
if (generators == NULL) goto failed;
for (i = 0; i < len; i++) {
comprehension_ty value;
res = obj2ast_comprehension
(PyList_GET_ITEM(tmp, i), &value, arena);
if (res != 0) goto failed;
asdl_seq_SET(generators, i, value);
}
Py_XDECREF(tmp);
tmp = NULL;
} else {
PyErr_SetString(PyExc_TypeError, "required
field \"generators\" missing from GeneratorExp");
 
T

Terry

Do you by any chance have a few examples of these? There is a lot of
idiomatic code in python to e.g. acquire and release the GIL or doing
refcount-stuff. If that happens to be done with rather generic names as
arguments, I can well imagine that as being the cause.

Diez

Example of a small one (61 token duplicated):
Found a 19 line (61 tokens) duplication in the following files:
Starting at line 132 of D:\DOWNLOADS\Python-3.0\Python\modsupport.c
Starting at line 179 of D:\DOWNLOADS\Python-3.0\Python\modsupport.c

PyTuple_SET_ITEM(v, i, w);
}
if (itemfailed) {
/* do_mkvalue() should have already set an error */
Py_DECREF(v);
return NULL;
}
if (**p_format != endchar) {
Py_DECREF(v);
PyErr_SetString(PyExc_SystemError,
"Unmatched paren in format");
return NULL;
}
if (endchar)
++*p_format;
return v;
}

static PyObject *
 
T

Terry

Do you by any chance have a few examples of these? There is a lot of
idiomatic code in python to e.g. acquire and release the GIL or doing
refcount-stuff. If that happens to be done with rather generic names as
arguments, I can well imagine that as being the cause.

Diez

Example of a even small one (30 token duplicated):
Found a 11 line (30 tokens) duplication in the following files:
Starting at line 2551 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c
Starting at line 3173 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c

if (PyObject_SetAttrString(result, "ifs", value) == -1)
goto failed;
Py_DECREF(value);
return result;
failed:
Py_XDECREF(value);
Py_XDECREF(result);
return NULL;
}

PyObject*
 
T

Terry

Do you by any chance have a few examples of these? There is a lot of
idiomatic code in python to e.g. acquire and release the GIL or doing
refcount-stuff. If that happens to be done with rather generic names as
arguments, I can well imagine that as being the cause.

Diez

And I'm not saying that you can not have duplication in code. But it
seems that the stable & successful software releases tend to have
relatively stable duplication rate.
 
M

Martin v. Löwis

And I'm not saying that you can not have duplication in code. But it
seems that the stable & successful software releases tend to have
relatively stable duplication rate.

So if some software has an instable duplication rate, it probably
means that it is either not stable, or not successful.

In the case of Python 3.0, it's fairly obvious which one it is:
it's not stable. Indeed, Python 3.0 is a significant change from
Python 2.x. Of course, anybody following the Python 3 development
process could have told you see even without any code metrics.

I still find the raw numbers fairly useless. What matters more to
me is what specific code duplications have been added. Furthermore,
your Dup30 classification is not important to me, but I'm rather
after the nearly 2000 new chunks of code that has more than 60
subsequent tokens duplicated.

Regards,
Martin
 
M

Martin v. Löwis

But the duplication are always not very big, from about 100 lines
(rare) to less the 5 lines. As you can see the Rate30 is much bigger
than Rate60, that means there are a lot of small duplications.

I don't find that important for code quality. It's the large chunks
that I would like to see de-duplicated (unless, of course, they are
in generated code, in which case I couldn't care less).

Unfortunately, none of the examples you have posted so far are
- large chunks, and
- new in 3.0.

Regards,
Martin
 
J

Jeroen Ruigrok van der Werven

-On [20090207 18:25] said:
This analysis overlooks the fact that 3.0 _was_ a major change, and is
likely to grow cut-and-paste solutions to some problems as we switch to
Unicode strings from byte strings.

You'd best hope the copied section was thoroughly reviewed otherwise you're
duplicating a flaw across X other sections. And then you also best hope that
whoever finds said flaw and fixes it is also smart enough to check for
similar constructs around the code base.
 
S

Steve Holden

Jeroen said:
-On [20090207 18:25] said:
This analysis overlooks the fact that 3.0 _was_ a major change, and is
likely to grow cut-and-paste solutions to some problems as we switch to
Unicode strings from byte strings.

You'd best hope the copied section was thoroughly reviewed otherwise you're
duplicating a flaw across X other sections. And then you also best hope that
whoever finds said flaw and fixes it is also smart enough to check for
similar constructs around the code base.
This is probably preferable to five different developers solving the
same problem five different ways and introducing three *different* bugs, no?

regards
Steve
 
M

Martin v. Löwis

This is probably preferable to five different developers solving the
same problem five different ways and introducing three *different* bugs, no?

With the examples presented, I'm not convinced that there is actually
significant code duplication going on in the first place.

Regards,
Martin
 
J

Jeroen Ruigrok van der Werven

-On [20090207 21:07] said:
This is probably preferable to five different developers solving the
same problem five different ways and introducing three *different* bugs, no?

I guess the answer would be 'that depends', but in most cases you would be
correct, yes.
 
M

Martin v. Löwis

yet the general tone of the responses has been more defensive than i would
have expected. i don't really understand why. nothing really terrible,
given the extremes you get on the net in general, but still a little
disappointing.

I think this is fairly easy to explain. The OP closes with the question
"Does that say something about the code quality of Python3.0?"
thus suggesting that the quality of Python 3 is poor.

Nobody likes to hear that the quality of his work is poor. He then goes
on saying

"But it seems that the stable & successful software releases tend to
have relatively stable duplication rate."

suggesting that Python 3.0 cannot be successful, because it doesn't have
a relatively stable duplication rate.

Nobody likes to hear that a project one has put many month into cannot
be successful.

Hence the defensive responses.
i'm not saying there is such a solution. i'm not even saying that there
is certainly a problem. i'm just making the quiet observation that the
original information is interesting, might be useful, and should be
welcomed.

The information is interesting. I question whether it is useful as-is,
as it doesn't tell me *what* code got duplicated (and it seems it is
also incorrect, since it includes analysis of generated code). While I
can welcome the information, I cannot welcome the conclusion that the
OP apparently draws from them.

Regards,
Martin
 
T

Terry

This isn't really fair because Python-ast.c is auto generated. ;)

Oops! I don't know that! Then the analysis will not be valid, since
too many duplications are from there.
 
T

Terry

Oops! I don't know that! Then the analysis will not be valid, since
too many duplications are from there.

Hey!

I have to say sorry because I found I made a mistake. Because Python-
ast.c is auto-generated and shouldn't be counted here, the right
duplication rate of Python3.0 is very small (5%).
And I found the duplications are quite trivial, I wound not say that
all of them are acceptable, but certainly not a strong enough evident
for code quality.

I have made the same analysis to some commercial source code, the
dup60 rate is quite often significantly larger than 15%.
 
G

Gabriel Genellina

I don't think code duplication rate has strong relationship towards code
quality.

Not directly; but large chunks of identical code repeated in many places
aren't a good sign. I'd question myself if all of them are equally tested?
What if someone fixes a bug - will the change be propagated everywhere?
Should the code be refactored?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top