Python3.0 has more duplication in source code than Python2.5

Discussion in 'Python' started by Terry, Feb 7, 2009.

  1. Terry

    Terry Guest

    I used a CPD (copy/paste detector) in PMD to analyze the code
    duplication in Python source code. I found that Python3.0 contains
    more duplicated code than the previous versions. The CPD tool is far
    from perfect, but I still feel the analysis makes some sense.

    |Source Code | NLOC | Dup60 | Dup30 | Rate60 | Rate 30
    |
    Python1.5(Core) 19418 1072 3023 6% 16%
    Python2.5(Core) 35797 1656 6441 5% 18%
    Python3.0(Core) 40737 3460 9076 8% 22%
    Apache(server) 18693 1114 2553 6% 14%

    NLOC: The net lines of code
    Dup60: Lines of code that has 60 continuous tokens duplicated to other
    code (counted twice or more)
    Dup30: 30 tokens duplicated
    Rate60: Dup60/NLOC
    Rate30: Dup30/NLOC

    We can see that the common duplicated rate is tended to be stable. But
    Python3.0 is slightly bigger than that. Consider the small increase in
    NLOC, the duplication rate of Python3.0 might be too big.

    Does that say something about the code quality of Python3.0?
     
    Terry, Feb 7, 2009
    #1
    1. Advertising

  2. > Does that say something about the code quality of Python3.0?

    Not necessarily. IIUC, copying a single file with 2000 lines
    completely could already account for that increase.

    It would be interesting to see what specific files have gained
    large numbers of additional files, compared to 2.5.

    Regards,
    Martin
     
    Martin v. Löwis, Feb 7, 2009
    #2
    1. Advertising

  3. Terry

    Terry Guest

    On 2月7æ—¥, 下åˆ3æ—¶36分, "Martin v.. Löwis" <> wrote:
    > > Does that say something about the code quality of Python3.0?

    >
    > Not necessarily. IIUC, copying a single file with 2000 lines
    > completely could already account for that increase.
    >
    > It would be interesting to see what specific files have gained
    > large numbers of additional files, compared to 2.5.
    >
    > Regards,
    > Martin


    But the duplication are always not very big, from about 100 lines
    (rare) to less the 5 lines. As you can see the Rate30 is much bigger
    than Rate60, that means there are a lot of small duplications.
     
    Terry, Feb 7, 2009
    #3
  4. Terry schrieb:
    > On 2月7æ—¥, 下åˆ3æ—¶36分, "Martin v. Löwis" <> wrote:
    >>> Does that say something about the code quality of Python3.0?

    >> Not necessarily. IIUC, copying a single file with 2000 lines
    >> completely could already account for that increase.
    >>
    >> It would be interesting to see what specific files have gained
    >> large numbers of additional files, compared to 2.5.
    >>
    >> Regards,
    >> Martin

    >
    > But the duplication are always not very big, from about 100 lines
    > (rare) to less the 5 lines. As you can see the Rate30 is much bigger
    > than Rate60, that means there are a lot of small duplications.


    Do you by any chance have a few examples of these? There is a lot of
    idiomatic code in python to e.g. acquire and release the GIL or doing
    refcount-stuff. If that happens to be done with rather generic names as
    arguments, I can well imagine that as being the cause.

    Diez
     
    Diez B. Roggisch, Feb 7, 2009
    #4
  5. Terry

    Terry Guest

    On 2月7æ—¥, 下åˆ7æ—¶10分, "Diez B. Roggisch" <> wrote:
    > Terry schrieb:
    >
    > > On 2月7æ—¥, 下åˆ3æ—¶36分, "Martin v. Löwis" <> wrote:
    > >>> Does that say something about the code quality of Python3.0?
    > >> Not necessarily. IIUC, copying a single file with 2000 lines
    > >> completely could already account for that increase.

    >
    > >> It would be interesting to see what specific files have gained
    > >> large numbers of additional files, compared to 2.5.

    >
    > >> Regards,
    > >> Martin

    >
    > > But the duplication are always not very big, from about 100 lines
    > > (rare) to less the 5 lines. As you can see the Rate30 is much bigger
    > > than Rate60, that means there are a lot of small duplications.

    >
    > Do you by any chance have a few examples of these? There is a lot of
    > idiomatic code in python to e.g. acquire and release the GIL or doing
    > refcount-stuff. If that happens to be done with rather generic names as
    > arguments, I can well imagine that as being the cause.
    >
    > Diez


    Example 1:
    Found a 64 line (153 tokens) duplication in the following files:
    Starting at line 73 of D:\DOWNLOADS\Python-3.0\Python\thread_pth.h
    Starting at line 222 of D:\DOWNLOADS\Python-3.0\Python
    \thread_pthread.h

    return (long) threadid;
    #else
    return (long) *(long *) &threadid;
    #endif
    }

    static void
    do_PyThread_exit_thread(int no_cleanup)
    {
    dprintf(("PyThread_exit_thread called\n"));
    if (!initialized) {
    if (no_cleanup)
    _exit(0);
    else
    exit(0);
    }
    }

    void
    PyThread_exit_thread(void)
    {
    do_PyThread_exit_thread(0);
    }

    void
    PyThread__exit_thread(void)
    {
    do_PyThread_exit_thread(1);
    }

    #ifndef NO_EXIT_PROG
    static void
    do_PyThread_exit_prog(int status, int no_cleanup)
    {
    dprintf(("PyThread_exit_prog(%d) called\n", status));
    if (!initialized)
    if (no_cleanup)
    _exit(status);
    else
    exit(status);
    }

    void
    PyThread_exit_prog(int status)
    {
    do_PyThread_exit_prog(status, 0);
    }

    void
    PyThread__exit_prog(int status)
    {
    do_PyThread_exit_prog(status, 1);
    }
    #endif /* NO_EXIT_PROG */

    #ifdef USE_SEMAPHORES

    /*
    * Lock support.
    */

    PyThread_type_lock
    PyThread_allocate_lock(void)
    {
     
    Terry, Feb 7, 2009
    #5
  6. Terry

    Terry Guest

    On 2月7æ—¥, 下åˆ7æ—¶10分, "Diez B. Roggisch" <> wrote:
    > Terry schrieb:
    >
    > > On 2月7æ—¥, 下åˆ3æ—¶36分, "Martin v. Löwis" <> wrote:
    > >>> Does that say something about the code quality of Python3.0?
    > >> Not necessarily. IIUC, copying a single file with 2000 lines
    > >> completely could already account for that increase.

    >
    > >> It would be interesting to see what specific files have gained
    > >> large numbers of additional files, compared to 2.5.

    >
    > >> Regards,
    > >> Martin

    >
    > > But the duplication are always not very big, from about 100 lines
    > > (rare) to less the 5 lines. As you can see the Rate30 is much bigger
    > > than Rate60, that means there are a lot of small duplications.

    >
    > Do you by any chance have a few examples of these? There is a lot of
    > idiomatic code in python to e.g. acquire and release the GIL or doing
    > refcount-stuff. If that happens to be done with rather generic names as
    > arguments, I can well imagine that as being the cause.
    >
    > Diez


    Example 2:
    Found a 16 line (106 tokens) duplication in the following files:
    Starting at line 4970 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c
    Starting at line 5015 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c
    Starting at line 5073 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c
    Starting at line 5119 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c

    PyErr_Format(PyExc_TypeError,
    "GeneratorExp field \"generators\" must be a list, not a %.200s", tmp-
    >ob_type->tp_name);

    goto failed;
    }
    len = PyList_GET_SIZE(tmp);
    generators = asdl_seq_new(len, arena);
    if (generators == NULL) goto failed;
    for (i = 0; i < len; i++) {
    comprehension_ty value;
    res = obj2ast_comprehension
    (PyList_GET_ITEM(tmp, i), &value, arena);
    if (res != 0) goto failed;
    asdl_seq_SET(generators, i, value);
    }
    Py_XDECREF(tmp);
    tmp = NULL;
    } else {
    PyErr_SetString(PyExc_TypeError, "required
    field \"generators\" missing from GeneratorExp");
     
    Terry, Feb 7, 2009
    #6
  7. Terry

    Terry Guest

    On 2月7æ—¥, 下åˆ7æ—¶10分, "Diez B. Roggisch" <> wrote:
    > Terry schrieb:
    >
    > > On 2月7æ—¥, 下åˆ3æ—¶36分, "Martin v. Löwis" <> wrote:
    > >>> Does that say something about the code quality of Python3.0?
    > >> Not necessarily. IIUC, copying a single file with 2000 lines
    > >> completely could already account for that increase.

    >
    > >> It would be interesting to see what specific files have gained
    > >> large numbers of additional files, compared to 2.5.

    >
    > >> Regards,
    > >> Martin

    >
    > > But the duplication are always not very big, from about 100 lines
    > > (rare) to less the 5 lines. As you can see the Rate30 is much bigger
    > > than Rate60, that means there are a lot of small duplications.

    >
    > Do you by any chance have a few examples of these? There is a lot of
    > idiomatic code in python to e.g. acquire and release the GIL or doing
    > refcount-stuff. If that happens to be done with rather generic names as
    > arguments, I can well imagine that as being the cause.
    >
    > Diez


    Example of a small one (61 token duplicated):
    Found a 19 line (61 tokens) duplication in the following files:
    Starting at line 132 of D:\DOWNLOADS\Python-3.0\Python\modsupport.c
    Starting at line 179 of D:\DOWNLOADS\Python-3.0\Python\modsupport.c

    PyTuple_SET_ITEM(v, i, w);
    }
    if (itemfailed) {
    /* do_mkvalue() should have already set an error */
    Py_DECREF(v);
    return NULL;
    }
    if (**p_format != endchar) {
    Py_DECREF(v);
    PyErr_SetString(PyExc_SystemError,
    "Unmatched paren in format");
    return NULL;
    }
    if (endchar)
    ++*p_format;
    return v;
    }

    static PyObject *
     
    Terry, Feb 7, 2009
    #7
  8. Terry

    Terry Guest

    On 2月7æ—¥, 下åˆ7æ—¶10分, "Diez B. Roggisch" <> wrote:
    > Terry schrieb:
    >
    > > On 2月7æ—¥, 下åˆ3æ—¶36分, "Martin v. Löwis" <> wrote:
    > >>> Does that say something about the code quality of Python3.0?
    > >> Not necessarily. IIUC, copying a single file with 2000 lines
    > >> completely could already account for that increase.

    >
    > >> It would be interesting to see what specific files have gained
    > >> large numbers of additional files, compared to 2.5.

    >
    > >> Regards,
    > >> Martin

    >
    > > But the duplication are always not very big, from about 100 lines
    > > (rare) to less the 5 lines. As you can see the Rate30 is much bigger
    > > than Rate60, that means there are a lot of small duplications.

    >
    > Do you by any chance have a few examples of these? There is a lot of
    > idiomatic code in python to e.g. acquire and release the GIL or doing
    > refcount-stuff. If that happens to be done with rather generic names as
    > arguments, I can well imagine that as being the cause.
    >
    > Diez


    Example of a even small one (30 token duplicated):
    Found a 11 line (30 tokens) duplication in the following files:
    Starting at line 2551 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c
    Starting at line 3173 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c

    if (PyObject_SetAttrString(result, "ifs", value) == -1)
    goto failed;
    Py_DECREF(value);
    return result;
    failed:
    Py_XDECREF(value);
    Py_XDECREF(result);
    return NULL;
    }

    PyObject*
     
    Terry, Feb 7, 2009
    #8
  9. Terry

    Terry Guest

    On 2月7æ—¥, 下åˆ7æ—¶10分, "Diez B. Roggisch" <> wrote:
    > Terry schrieb:
    >
    > > On 2月7æ—¥, 下åˆ3æ—¶36分, "Martin v. Löwis" <> wrote:
    > >>> Does that say something about the code quality of Python3.0?
    > >> Not necessarily. IIUC, copying a single file with 2000 lines
    > >> completely could already account for that increase.

    >
    > >> It would be interesting to see what specific files have gained
    > >> large numbers of additional files, compared to 2.5.

    >
    > >> Regards,
    > >> Martin

    >
    > > But the duplication are always not very big, from about 100 lines
    > > (rare) to less the 5 lines. As you can see the Rate30 is much bigger
    > > than Rate60, that means there are a lot of small duplications.

    >
    > Do you by any chance have a few examples of these? There is a lot of
    > idiomatic code in python to e.g. acquire and release the GIL or doing
    > refcount-stuff. If that happens to be done with rather generic names as
    > arguments, I can well imagine that as being the cause.
    >
    > Diez


    And I'm not saying that you can not have duplication in code. But it
    seems that the stable & successful software releases tend to have
    relatively stable duplication rate.
     
    Terry, Feb 7, 2009
    #9
  10. Terry <terry.yinzhe <at> gmail.com> writes:
    > On 2月7æ—¥, 下åˆ7æ—¶10分, "Diez B. Roggisch" <> wrote:
    > > Do you by any chance have a few examples of these? There is a lot of
    > > idiomatic code in python to e.g. acquire and release the GIL or doing
    > > refcount-stuff. If that happens to be done with rather generic names as
    > > arguments, I can well imagine that as being the cause.


    > Starting at line 5119 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c


    This isn't really fair because Python-ast.c is auto generated. ;)
     
    Benjamin Peterson, Feb 7, 2009
    #10
  11. > And I'm not saying that you can not have duplication in code. But it
    > seems that the stable & successful software releases tend to have
    > relatively stable duplication rate.


    So if some software has an instable duplication rate, it probably
    means that it is either not stable, or not successful.

    In the case of Python 3.0, it's fairly obvious which one it is:
    it's not stable. Indeed, Python 3.0 is a significant change from
    Python 2.x. Of course, anybody following the Python 3 development
    process could have told you see even without any code metrics.

    I still find the raw numbers fairly useless. What matters more to
    me is what specific code duplications have been added. Furthermore,
    your Dup30 classification is not important to me, but I'm rather
    after the nearly 2000 new chunks of code that has more than 60
    subsequent tokens duplicated.

    Regards,
    Martin
     
    Martin v. Löwis, Feb 7, 2009
    #11
  12. > But the duplication are always not very big, from about 100 lines
    > (rare) to less the 5 lines. As you can see the Rate30 is much bigger
    > than Rate60, that means there are a lot of small duplications.


    I don't find that important for code quality. It's the large chunks
    that I would like to see de-duplicated (unless, of course, they are
    in generated code, in which case I couldn't care less).

    Unfortunately, none of the examples you have posted so far are
    - large chunks, and
    - new in 3.0.

    Regards,
    Martin
     
    Martin v. Löwis, Feb 7, 2009
    #12
  13. -On [20090207 18:25], Scott David Daniels () wrote:
    >This analysis overlooks the fact that 3.0 _was_ a major change, and is
    >likely to grow cut-and-paste solutions to some problems as we switch to
    >Unicode strings from byte strings.


    You'd best hope the copied section was thoroughly reviewed otherwise you're
    duplicating a flaw across X other sections. And then you also best hope that
    whoever finds said flaw and fixes it is also smart enough to check for
    similar constructs around the code base.

    --
    Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
    イェルーン ラウフロック ヴァン デル ウェルヴェン
    http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
    Earth to earth, ashes to ashes, dust to dust...
     
    Jeroen Ruigrok van der Werven, Feb 7, 2009
    #13
  14. Terry

    Steve Holden Guest

    Jeroen Ruigrok van der Werven wrote:
    > -On [20090207 18:25], Scott David Daniels () wrote:
    >> This analysis overlooks the fact that 3.0 _was_ a major change, and is
    >> likely to grow cut-and-paste solutions to some problems as we switch to
    >> Unicode strings from byte strings.

    >
    > You'd best hope the copied section was thoroughly reviewed otherwise you're
    > duplicating a flaw across X other sections. And then you also best hope that
    > whoever finds said flaw and fixes it is also smart enough to check for
    > similar constructs around the code base.
    >

    This is probably preferable to five different developers solving the
    same problem five different ways and introducing three *different* bugs, no?

    regards
    Steve
    --
    Steve Holden +1 571 484 6266 +1 800 494 3119
    Holden Web LLC http://www.holdenweb.com/
     
    Steve Holden, Feb 7, 2009
    #14
  15. > This is probably preferable to five different developers solving the
    > same problem five different ways and introducing three *different* bugs, no?


    With the examples presented, I'm not convinced that there is actually
    significant code duplication going on in the first place.

    Regards,
    Martin
     
    Martin v. Löwis, Feb 7, 2009
    #15
  16. -On [20090207 21:07], Steve Holden () wrote:
    >This is probably preferable to five different developers solving the
    >same problem five different ways and introducing three *different* bugs, no?


    I guess the answer would be 'that depends', but in most cases you would be
    correct, yes.

    --
    Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
    イェルーン ラウフロック ヴァン デル ウェルヴェン
    http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
    Earth to earth, ashes to ashes, dust to dust...
     
    Jeroen Ruigrok van der Werven, Feb 7, 2009
    #16
  17. > yet the general tone of the responses has been more defensive than i would
    > have expected. i don't really understand why. nothing really terrible,
    > given the extremes you get on the net in general, but still a little
    > disappointing.


    I think this is fairly easy to explain. The OP closes with the question
    "Does that say something about the code quality of Python3.0?"
    thus suggesting that the quality of Python 3 is poor.

    Nobody likes to hear that the quality of his work is poor. He then goes
    on saying

    "But it seems that the stable & successful software releases tend to
    have relatively stable duplication rate."

    suggesting that Python 3.0 cannot be successful, because it doesn't have
    a relatively stable duplication rate.

    Nobody likes to hear that a project one has put many month into cannot
    be successful.

    Hence the defensive responses.

    > i'm not saying there is such a solution. i'm not even saying that there
    > is certainly a problem. i'm just making the quiet observation that the
    > original information is interesting, might be useful, and should be
    > welcomed.


    The information is interesting. I question whether it is useful as-is,
    as it doesn't tell me *what* code got duplicated (and it seems it is
    also incorrect, since it includes analysis of generated code). While I
    can welcome the information, I cannot welcome the conclusion that the
    OP apparently draws from them.

    Regards,
    Martin
     
    Martin v. Löwis, Feb 7, 2009
    #17
  18. Terry

    Terry Guest

    On 2月8æ—¥, 上åˆ12æ—¶20分, Benjamin Peterson <> wrote:
    > Terry <terry.yinzhe <at> gmail.com> writes:
    >
    > > On 2月7æ—¥, 下åˆ7æ—¶10分, "Diez B. Roggisch" <> wrote:
    > > > Do you by any chance have a few examples of these? There is a lot of
    > > > idiomatic code in python to e.g. acquire and release the GIL or doing
    > > > refcount-stuff. If that happens to be done with rather generic names as
    > > > arguments, I can well imagine that as being the cause.

    > > Starting at line 5119 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c

    >
    > This isn't really fair because Python-ast.c is auto generated. ;)


    Oops! I don't know that! Then the analysis will not be valid, since
    too many duplications are from there.
     
    Terry, Feb 8, 2009
    #18
  19. Terry

    Terry Guest

    On 2月8æ—¥, 上åˆ8æ—¶51分, Terry <> wrote:
    > On 2月8æ—¥, 上åˆ12æ—¶20分, Benjamin Peterson <> wrote:
    >
    > > Terry <terry.yinzhe <at> gmail.com> writes:

    >
    > > > On 2月7æ—¥, 下åˆ7æ—¶10分, "Diez B. Roggisch" <> wrote:
    > > > > Do you by any chance have a few examples of these? There is a lot of
    > > > > idiomatic code in python to e.g. acquire and release the GIL or doing
    > > > > refcount-stuff. If that happens to be done with rather generic names as
    > > > > arguments, I can well imagine that as being the cause.
    > > > Starting at line 5119 of D:\DOWNLOADS\Python-3.0\Python\Python-ast.c

    >
    > > This isn't really fair because Python-ast.c is auto generated. ;)

    >
    > Oops! I don't know that! Then the analysis will not be valid, since
    > too many duplications are from there.


    Hey!

    I have to say sorry because I found I made a mistake. Because Python-
    ast.c is auto-generated and shouldn't be counted here, the right
    duplication rate of Python3.0 is very small (5%).
    And I found the duplications are quite trivial, I wound not say that
    all of them are acceptable, but certainly not a strong enough evident
    for code quality.

    I have made the same analysis to some commercial source code, the
    dup60 rate is quite often significantly larger than 15%.
     
    Terry, Feb 8, 2009
    #19
  20. On Sun, Feb 8, 2009 at 9:12 AM, Terry <> wrote:

    >> I have made the same analysis to some commercial source code, the
    >> dup60 rate is quite often significantly larger than 15%.


    En Sun, 08 Feb 2009 07:10:12 -0200, Henry Read <>
    escribió:

    > I don't think code duplication rate has strong relationship towards code
    > quality.


    Not directly; but large chunks of identical code repeated in many places
    aren't a good sign. I'd question myself if all of them are equally tested?
    What if someone fixes a bug - will the change be propagated everywhere?
    Should the code be refactored?


    --
    Gabriel Genellina
     
    Gabriel Genellina, Feb 8, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stefan Behnel
    Replies:
    3
    Views:
    340
    HoneyMonster
    Jun 26, 2012
  2. Andrew Berg
    Replies:
    0
    Views:
    262
    Andrew Berg
    Jun 25, 2012
  3. Michiel Overtoom
    Replies:
    28
    Views:
    591
    Chris Angelico
    Jun 28, 2012
  4. Steven D'Aprano
    Replies:
    0
    Views:
    143
    Steven D'Aprano
    Dec 23, 2013
  5. Replies:
    3
    Views:
    116
    Gary Herron
    Dec 23, 2013
Loading...

Share This Page