Python3.0 has more duplication in source code than Python2.5

Terry · Feb 7, 2009

I used a CPD (copy/paste detector) in PMD to analyze the code
duplication in Python source code. I found that Python3.0 contains
more duplicated code than the previous versions. The CPD tool is far
from perfect, but I still feel the analysis makes some sense.

|Source Code | NLOC | Dup60 | Dup30 | Rate60 | Rate 30
|
Python1.5(Core) 19418 1072 3023 6% 16%
Python2.5(Core) 35797 1656 6441 5% 18%
Python3.0(Core) 40737 3460 9076 8% 22%
Apache(server) 18693 1114 2553 6% 14%

NLOC: The net lines of code
Dup60: Lines of code that has 60 continuous tokens duplicated to other
code (counted twice or more)
Dup30: 30 tokens duplicated
Rate60: Dup60/NLOC
Rate30: Dup30/NLOC

We can see that the common duplicated rate is tended to be stable. But
Python3.0 is slightly bigger than that. Consider the small increase in
NLOC, the duplication rate of Python3.0 might be too big.

Does that say something about the code quality of Python3.0?

Martin v. Löwis · Feb 7, 2009

Does that say something about the code quality of Python3.0?

Not necessarily. IIUC, copying a single file with 2000 lines
completely could already account for that increase.

It would be interesting to see what specific files have gained
large numbers of additional files, compared to 2.5.

Regards,
Martin

Terry · Feb 7, 2009

Not necessarily. IIUC, copying a single file with 2000 lines
completely could already account for that increase.

It would be interesting to see what specific files have gained
large numbers of additional files, compared to 2.5.

Regards,
Martin

But the duplication are always not very big, from about 100 lines
(rare) to less the 5 lines. As you can see the Rate30 is much bigger
than Rate60, that means there are a lot of small duplications.

Diez B. Roggisch · Feb 7, 2009

Terry said:
But the duplication are always not very big, from about 100 lines
(rare) to less the 5 lines. As you can see the Rate30 is much bigger
than Rate60, that means there are a lot of small duplications.

Do you by any chance have a few examples of these? There is a lot of
idiomatic code in python to e.g. acquire and release the GIL or doing
refcount-stuff. If that happens to be done with rather generic names as
arguments, I can well imagine that as being the cause.

Diez

Terry · Feb 7, 2009

Do you by any chance have a few examples of these? There is a lot of
idiomatic code in python to e.g. acquire and release the GIL or doing
refcount-stuff. If that happens to be done with rather generic names as
arguments, I can well imagine that as being the cause.

Diez

Example 1:
Found a 64 line (153 tokens) duplication in the following files:
Starting at line 73 of D:\DOWNLOADS\Python-3.0\Python\thread_pth.h
Starting at line 222 of D:\DOWNLOADS\Python-3.0\Python
\thread_pthread.h

return (long) threadid;
#else
return (long) *(long *) &threadid;
#endif
}

static void
do_PyThread_exit_thread(int no_cleanup)
{
dprintf(("PyThread_exit_thread called\n"));
if (!initialized) {
if (no_cleanup)
_exit(0);
else
exit(0);
}
}

void
PyThread_exit_thread(void)
{
do_PyThread_exit_thread(0);
}

void
PyThread__exit_thread(void)
{
do_PyThread_exit_thread(1);
}

#ifndef NO_EXIT_PROG
static void
do_PyThread_exit_prog(int status, int no_cleanup)
{
dprintf(("PyThread_exit_prog(%d) called\n", status));
if (!initialized)
if (no_cleanup)
_exit(status);
else
exit(status);
}

void
PyThread_exit_prog(int status)
{
do_PyThread_exit_prog(status, 0);
}

void
PyThread__exit_prog(int status)
{
do_PyThread_exit_prog(status, 1);
}
#endif /* NO_EXIT_PROG */

#ifdef USE_SEMAPHORES

/*
* Lock support.
*/

PyThread_type_lock
PyThread_allocate_lock(void)
{

Terry · Feb 7, 2009

Do you by any chance have a few examples of these? There is a lot of
idiomatic code in python to e.g. acquire and release the GIL or doing
refcount-stuff. If that happens to be done with rather generic names as
arguments, I can well imagine that as being the cause.

Diez

Example 2:
Found a 16 line (106 tokens) duplication in the following files:
Starting at line 4970 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c
Starting at line 5015 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c
Starting at line 5073 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c
Starting at line 5119 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c

PyErr_Format(PyExc_TypeError,
"GeneratorExp field \"generators\" must be a list, not a %.200s", tmp-

ob_type->tp_name);

goto failed;
}
len = PyList_GET_SIZE(tmp);
generators = asdl_seq_new(len, arena);
if (generators == NULL) goto failed;
for (i = 0; i < len; i++) {
comprehension_ty value;
res = obj2ast_comprehension
(PyList_GET_ITEM(tmp, i), &value, arena);
if (res != 0) goto failed;
asdl_seq_SET(generators, i, value);
}
Py_XDECREF(tmp);
tmp = NULL;
} else {
PyErr_SetString(PyExc_TypeError, "required
field \"generators\" missing from GeneratorExp");

Terry · Feb 7, 2009

Do you by any chance have a few examples of these? There is a lot of
idiomatic code in python to e.g. acquire and release the GIL or doing
refcount-stuff. If that happens to be done with rather generic names as
arguments, I can well imagine that as being the cause.

Diez

Example of a small one (61 token duplicated):
Found a 19 line (61 tokens) duplication in the following files:
Starting at line 132 of D:\DOWNLOADS\Python-3.0\Python\modsupport.c
Starting at line 179 of D:\DOWNLOADS\Python-3.0\Python\modsupport.c

PyTuple_SET_ITEM(v, i, w);
}
if (itemfailed) {
/* do_mkvalue() should have already set an error */
Py_DECREF(v);
return NULL;
}
if (**p_format != endchar) {
Py_DECREF(v);
PyErr_SetString(PyExc_SystemError,
"Unmatched paren in format");
return NULL;
}
if (endchar)
++*p_format;
return v;
}

static PyObject *

Terry · Feb 7, 2009

Do you by any chance have a few examples of these? There is a lot of
idiomatic code in python to e.g. acquire and release the GIL or doing
refcount-stuff. If that happens to be done with rather generic names as
arguments, I can well imagine that as being the cause.

Diez

Example of a even small one (30 token duplicated):
Found a 11 line (30 tokens) duplication in the following files:
Starting at line 2551 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c
Starting at line 3173 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c

if (PyObject_SetAttrString(result, "ifs", value) == -1)
goto failed;
Py_DECREF(value);
return result;
failed:
Py_XDECREF(value);
Py_XDECREF(result);
return NULL;
}

PyObject*

Terry · Feb 7, 2009

Do you by any chance have a few examples of these? There is a lot of
idiomatic code in python to e.g. acquire and release the GIL or doing
refcount-stuff. If that happens to be done with rather generic names as
arguments, I can well imagine that as being the cause.

Diez

And I'm not saying that you can not have duplication in code. But it
seems that the stable & successful software releases tend to have
relatively stable duplication rate.

Benjamin Peterson · Feb 7, 2009

Starting at line 5119 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c

This isn't really fair because Python-ast.c is auto generated.

Martin v. LÃ¶wis · Feb 7, 2009

And I'm not saying that you can not have duplication in code. But it

seems that the stable & successful software releases tend to have
relatively stable duplication rate.

So if some software has an instable duplication rate, it probably
means that it is either not stable, or not successful.

In the case of Python 3.0, it's fairly obvious which one it is:
it's not stable. Indeed, Python 3.0 is a significant change from
Python 2.x. Of course, anybody following the Python 3 development
process could have told you see even without any code metrics.

I still find the raw numbers fairly useless. What matters more to
me is what specific code duplications have been added. Furthermore,
your Dup30 classification is not important to me, but I'm rather
after the nearly 2000 new chunks of code that has more than 60
subsequent tokens duplicated.

Regards,
Martin

Martin v. LÃ¶wis · Feb 7, 2009

But the duplication are always not very big, from about 100 lines

(rare) to less the 5 lines. As you can see the Rate30 is much bigger
than Rate60, that means there are a lot of small duplications.

I don't find that important for code quality. It's the large chunks
that I would like to see de-duplicated (unless, of course, they are
in generated code, in which case I couldn't care less).

Unfortunately, none of the examples you have posted so far are
- large chunks, and
- new in 3.0.

Regards,
Martin

Jeroen Ruigrok van der Werven · Feb 7, 2009

-On [20090207 18:25] said:
This analysis overlooks the fact that 3.0 _was_ a major change, and is
likely to grow cut-and-paste solutions to some problems as we switch to
Unicode strings from byte strings.

You'd best hope the copied section was thoroughly reviewed otherwise you're
duplicating a flaw across X other sections. And then you also best hope that
whoever finds said flaw and fixes it is also smart enough to check for
similar constructs around the code base.

Steve Holden · Feb 7, 2009

Jeroen said:
-On [20090207 18:25] said:

This analysis overlooks the fact that 3.0 _was_ a major change, and is
likely to grow cut-and-paste solutions to some problems as we switch to
Unicode strings from byte strings.

Click to expand...

You'd best hope the copied section was thoroughly reviewed otherwise you're
duplicating a flaw across X other sections. And then you also best hope that
whoever finds said flaw and fixes it is also smart enough to check for
similar constructs around the code base.

This is probably preferable to five different developers solving the
same problem five different ways and introducing three *different* bugs, no?

regards
Steve

Martin v. LÃ¶wis · Feb 7, 2009

This is probably preferable to five different developers solving the

same problem five different ways and introducing three *different* bugs, no?

With the examples presented, I'm not convinced that there is actually
significant code duplication going on in the first place.

Regards,
Martin

Jeroen Ruigrok van der Werven · Feb 7, 2009

-On [20090207 21:07] said:
This is probably preferable to five different developers solving the
same problem five different ways and introducing three *different* bugs, no?

I guess the answer would be 'that depends', but in most cases you would be
correct, yes.

Martin v. Löwis · Feb 7, 2009

yet the general tone of the responses has been more defensive than i would

have expected. i don't really understand why. nothing really terrible,
given the extremes you get on the net in general, but still a little
disappointing.

I think this is fairly easy to explain. The OP closes with the question
"Does that say something about the code quality of Python3.0?"
thus suggesting that the quality of Python 3 is poor.

Nobody likes to hear that the quality of his work is poor. He then goes
on saying

"But it seems that the stable & successful software releases tend to
have relatively stable duplication rate."

suggesting that Python 3.0 cannot be successful, because it doesn't have
a relatively stable duplication rate.

Nobody likes to hear that a project one has put many month into cannot
be successful.

Hence the defensive responses.

i'm not saying there is such a solution. i'm not even saying that there
is certainly a problem. i'm just making the quiet observation that the
original information is interesting, might be useful, and should be
welcomed.

The information is interesting. I question whether it is useful as-is,
as it doesn't tell me *what* code got duplicated (and it seems it is
also incorrect, since it includes analysis of generated code). While I
can welcome the information, I cannot welcome the conclusion that the
OP apparently draws from them.

Regards,
Martin

Terry · Feb 8, 2009

This isn't really fair because Python-ast.c is auto generated.

Oops! I don't know that! Then the analysis will not be valid, since
too many duplications are from there.

Terry · Feb 8, 2009

Oops! I don't know that! Then the analysis will not be valid, since
too many duplications are from there.

Hey!

I have to say sorry because I found I made a mistake. Because Python-
ast.c is auto-generated and shouldn't be counted here, the right
duplication rate of Python3.0 is very small (5%).
And I found the duplications are quite trivial, I wound not say that
all of them are acceptable, but certainly not a strong enough evident
for code quality.

I have made the same analysis to some commercial source code, the
dup60 rate is quite often significantly larger than 15%.

Gabriel Genellina · Feb 8, 2009

I don't think code duplication rate has strong relationship towards code
quality.

Not directly; but large chunks of identical code repeated in many places
aren't a good sign. I'd question myself if all of them are equally tested?
What if someone fixes a bug - will the change be propagated everywhere?
Should the code be refactored?

how to remove code duplication	23	Aug 11, 2008
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
How do I add exception-handling to this code and make sure it compiles and runs properly?	4	Jun 20, 2006
one infinite leap and you thought the bible code was somthing!!	0	Feb 1, 2005
[ANN] Sipper 2.0.0 Released	1	Jun 24, 2009
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Apr 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Feb 15, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008

Python3.0 has more duplication in source code than Python2.5

Terry

Martin v. Löwis

Terry

Diez B. Roggisch

Terry

Terry

Terry

Terry

Terry

Benjamin Peterson

Martin v. LÃ¶wis

Martin v. LÃ¶wis

Jeroen Ruigrok van der Werven

Steve Holden

Martin v. LÃ¶wis

Jeroen Ruigrok van der Werven

Martin v. Löwis

Terry

Terry

Gabriel Genellina

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads