writing python extensions in assembly

I

inhahe

Can anyone give me pointers/instructions/a template for writing a Python
extension in assembly (or better, HLA)?
 
D

Diez B. Roggisch

inhahe said:
Can anyone give me pointers/instructions/a template for writing a Python
extension in assembly (or better, HLA)?

You could write a C-extension and embed assembly. See the docs for how
to write one. If you know how to implement a C-callingconvention-based
shared library in assembly (being an assembler guru you sure know how
that works), you could mimic a C-extension.

Diez
 
I

inhahe

Well the problem is that I'm actually not an assembler guru, so I don't know
how to implement a dll in asm or use a c calling convention, although I'm
sure those instructions are available on the web. I was just afraid of
trying to learn that AND making python-specific extensions at the same time.
I thought of making a c extension with embedded asm, but that just seemed
less than ideal. But if somebody thinks that's the Right Way to do it,
that's good enough..
 
D

D'Arcy J.M. Cain

Can anyone give me pointers/instructions/a template for writing a Python
extension in assembly (or better, HLA)?

I am trying to imagine the requirements document for your project.

- Must be error prone and hard to debug
- Must take an order of magnitude longer to write
- Must be unportable to other systems

Really, why don't you write your etension in C? Do you really think
you will improve your code by writing it in assembler? Even embedding
assembler in C code make it unportable and I find it hard to imagine
anything that you want to do in a Python context that can't be done at
least as well in C if not pure Python.

Just curious is all.
 
I

inhahe

You could be right, but here are my reasons.

I need to make something that's very CPU-intensive and as fast as possible.
The faster, the better, and if it's not fast enough it won't even work.

They say that the C++ optimizer can usually optimize better than a person
coding in assembler by hand can, but I just can't believe that, at least for
me, because when I code in assembler, I feel like I can see the best way to
do it and I just can't imagine AI would even be smart enough to do it that
way...

For portability, I'd simply write different asm routines for different
systems. How wide a variety of systems I'd support I don't know. As a bare
minimum, 32-bit x86, 64-bit x86, and one or more of their available forms of
SIMD.
 
D

Diez B. Roggisch

inhahe said:
Well the problem is that I'm actually not an assembler guru, so I don't know
how to implement a dll in asm or use a c calling convention, although I'm
sure those instructions are available on the web. I was just afraid of
trying to learn that AND making python-specific extensions at the same time.
I thought of making a c extension with embedded asm, but that just seemed
less than ideal. But if somebody thinks that's the Right Way to do it,
that's good enough..

I think the right thing to do if you are not as fluent in assembly is do
not do anything in it at all. What do you need it for?

Diez
 
D

D'Arcy J.M. Cain

You could be right, but here are my reasons.

I need to make something that's very CPU-intensive and as fast as possible.
The faster, the better, and if it's not fast enough it won't even work.

They say that the C++ optimizer can usually optimize better than a person
coding in assembler by hand can, but I just can't believe that, at least for
me, because when I code in assembler, I feel like I can see the best way to
do it and I just can't imagine AI would even be smart enough to do it that
way...

Perhaps. Conventional wisdom says that you shouldn't optimize until
you need to though. That's one of the benefits of the way Python
works. Here's how I would do it.

1. Write the code (call it a prototype) in pure Python. Make sure that
everything is modularized based on functionality. Try to get it split
into nice, bite size chunks. Make sure that you have unit tests for
everything that you write.

2. Once the code is functioning, benchmark it and find the
bottlenecks. Replace the problem methods with a C extension. Refactor
(and check your unit tests again) if needed to break out the problem
areas into as small a piece as possible.

3. If it is still slow, embed some assembler where it is slowing down.

One advantage of this is that you always know if your optimizations are
useful. You may be surprised to find that you hardly ever need to go
beyond step 1 leaving you with the most portable and easily maintained
code that you can have.
For portability, I'd simply write different asm routines for different
systems. How wide a variety of systems I'd support I don't know. As a bare
minimum, 32-bit x86, 64-bit x86, and one or more of their available forms of
SIMD.

Even on the same processor you may have different assemblers depending
on the OS.
 
I

inhahe

I like to learn what I need, but I have done assembly before, I wrote a
terminal program in assembly for example, with ansi and avatar support. I'm
just not fluent in much other than the language itself, per se.

Perhaps C would be as fast as my asm would, but C would not allow me to use
SIMD, which seems like it would improve my speed a lot, I think my goals are
pretty much what SIMD was made for.
 
D

Diez B. Roggisch

inhahe said:
I like to learn what I need, but I have done assembly before, I wrote a
terminal program in assembly for example, with ansi and avatar support. I'm
just not fluent in much other than the language itself, per se.

Perhaps C would be as fast as my asm would, but C would not allow me to use
SIMD, which seems like it would improve my speed a lot, I think my goals are
pretty much what SIMD was made for.


That is not true. I've used the altivec-extensions myself on OSX and
inside C.

Besides, the parts of your program that are really *worth* optimizing
are astonishly few. Don't bother using assembler until you need to.

Diez
 
I

inhahe

D'Arcy J.M. Cain said:
2. Once the code is functioning, benchmark it and find the
bottlenecks. Replace the problem methods with a C extension. Refactor
(and check your unit tests again) if needed to break out the problem
areas into as small a piece as possible.

There's probably only 2 or 3 basic algorithms that will need to have all
that speed.
3. If it is still slow, embed some assembler where it is slowing down.

I won't know if the assembler is faster until I embed it, and if I'm going
to do that I might as well use it.
Although it's true I'd only have to embed it for one system to see (more or
less).
Even on the same processor you may have different assemblers depending
on the OS.

yeah I don't know much about that, I was figuring perhaps I could limit the
assembler parts / methodology to something I could write generically
enough.. and if all else fails write for the other OS's or only support
windows. also I think I should be using SIMD of some sort, and I'm not
sure but I highly doubt C++ compilers support SIMD.
 
H

Henrique Dante de Almeida

yeah I don't know much about that,  I was figuring perhaps I could limit the
assembler parts / methodology to something I could write generically
enough.. and if all else fails write for the other OS's or only support
windows.   also I think I should be using SIMD of some sort, and I'm not
sure but I highly doubt C++ compilers support SIMD.

You're wrong.

Maybe we could help you better if you told us what task are you
trying to achieve (or which algorithms do you think need optimization).
 
M

Mensanator

There's probably only 2 or 3 basic algorithms that will need to have all
that speed.




I won't know if the assembler is faster until I embed it, and if I'm going
to do that I might as well use it.
Although it's true I'd only have to embed it for one system to see (more or
less).





yeah I don't know much about that,  I was figuring perhaps I could limit the
assembler parts / methodology to something I could write generically
enough.. and if all else fails write for the other OS's or only support
windows.   also I think I should be using SIMD of some sort, and I'm not
sure but I highly doubt C++ compilers support SIMD.

The Society for Inherited Metabolic Disorders?

Why wouldn't the compilers support it? It's part of the x86
architexture,
isn't it?
 
D

Dan Upton

3. If it is still slow, embed some assembler where it is slowing down.
I won't know if the assembler is faster until I embed it, and if I'm going
to do that I might as well use it.
Although it's true I'd only have to embed it for one system to see (more or
less).

Regardless of whether it's faster, I thought you indicated that really
it's most important that it's fast enough.

That said, it's not true that you won't know if it's faster until you
embed it--that's what unit testing would be for. Write your loop(s)
in Python, C, ASM, <insert language here> and run them, on actual
inputs (or synthetic, if necessary, I suppose). That's how you'll be
able to tell whether it's even worth the effort to get the assembly
callable from Python.

Why wouldn't the compilers support it? It's part of the x86
architexture,
isn't it?

Yeah, but I don't know if it uses it by default, and my guess is it
depends on how the compiler back end goes about optimizing the code
for whether it will see data access/computation patterns amenable to
SIMD.
 
S

sjdevnull

There's probably only 2 or 3 basic algorithms that will need to have all
that speed.




I won't know if the assembler is faster until I embed it, and if I'm going
to do that I might as well use it.

You won't know if the C is faster than the assembly until you write
it, and if you're going to do that you might as well use it...

If the C is fast enough, there's no point in wasting time writing the
assembly.

(Also FWIW C and C++ are different languages; you seem to conflate the
two a few times upthread).
 
I

inhahe

Yeah, but I don't know if it uses it by default, and my guess is it
depends on how the compiler back end goes about optimizing the code
for whether it will see data access/computation patterns amenable to
SIMD.

perhaps you explicitly use them with some extended syntax or something?
 
D

Dan Upton

perhaps you explicitly use them with some extended syntax or something?

Hey, I learned something today.

http://www.tuleriit.ee/progs/rexample.php

Also, from the gcc manpage, apparently 387 is the default when
compiling for 32 bit architectures, and using sse instructions is
default on x86-64 architectures, but you can use -march=(some
architecture with simd instructions), -msse, -msse2, -msse3, or
-mfpmath=(one of 387, sse, or sse,387) to get the compiler to use
them.

As long as we're talking about compilers and such... anybody want to
chip in how this works in Python bytecode or what the bytecode
interpreter does? Okay, wait, before anybody says that's
implementation-dependent: does anybody want to chip in what the
CPython implementation does? (or any other implementation they're
familiar with, I guess)
 
I

Ivan Illarionov

Can anyone give me pointers/instructions/a template for writing a Python
extension in assembly (or better, HLA)?

Look up pygame sources. They have some hot inline MMX stuff.
I experimented with this rescently and I must admit that it's etremely
hard to beat C compiler. My first asm code actually was slower than C,
only after reading Intel docs, figuring out what makes 'movq' and
'movntq' different I was able to write something that was several times
faster than C.

D language inline asm and tools to make Python extensions look very
promising although I haven't tried them yet.

-- Ivan
 
D

Diez B. Roggisch

Also, from the gcc manpage, apparently 387 is the default when
compiling for 32 bit architectures, and using sse instructions is
default on x86-64 architectures, but you can use -march=(some
architecture with simd instructions), -msse, -msse2, -msse3, or
-mfpmath=(one of 387, sse, or sse,387) to get the compiler to use
them.

As long as we're talking about compilers and such... anybody want to
chip in how this works in Python bytecode or what the bytecode
interpreter does? Okay, wait, before anybody says that's
implementation-dependent: does anybody want to chip in what the
CPython implementation does? (or any other implementation they're
familiar with, I guess)

There isn't anything in (C)Python aware of these architecture extensions
- unless 3rd-party-libs utilize it. The bytecode-interpreter is machine
and os-independent. So it's above that level anyway. And AFAIK all
mathematical functionality is the one exposed by the OS math-libs.

Having said that, there are of course libs like NumPy that do take
advantage of these architectures, through the use of e.g. lib atlas.

Diez
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,174
Latest member
BlissKetoACV
Top