Announcement

**sinepgib** · 26 October 2022, 07:55 PM

If what you want is to add some fast equivalent of eval just embedding LuaJIT in your native, static application will probably result in far better performance than any interpreter calling eval (even if you call compile first).

**RahulSundaram** · 26 October 2022, 09:32 PM

Originally posted by atomsymbol

There exists:

Code:

$ python
>>> help(compile)

This only helps with the initial run when Python compiles .py files into .pyc equivalents and looks for those files first on subsequent runs. Distributions typically do this for you. However it doesn't help with runtime performance after that at all.

**NobodyXu** · 27 October 2022, 03:32 AM

Originally posted by atomsymbol

After reading the following example, I hope that it will be clear what I mean by generating (and then using) Python code at run-time:

Code:

$ python
>>> a=1
>>> code=compile('a+2', '<generated-code>', 'eval')
>>> eval(code)
3
>>> a=-10
>>> eval(code)
-8
>>> print(code)
<code object <module> at 0x7fa164ab1c60, file "<generated-code>", line 1>
>>> print(code.co_code)
b'e\x00d\x00\x17\x00S\x00'
>>> import dis
>>> dis.dis(code.co_code)
0 LOAD_NAME 0 (0)
2 LOAD_CONST 0 (0)
4 BINARY_ADD
6 RETURN_VALUE
>>> print(code.co_consts)
(2,)

While that can be seen as some form of JIT, the code it generated isn't native code that can be executed on the CPU.
Instead, it is an internal representation of python bitcode, which in turn has to be interpreted by the CPython implementation.

And the biggest problem of CPython is not about interpreting, but rather the GIL which makes it impossible to benefit from multithreading in python except for performing I/O or calling external FFI functions that release GIL.

That makes python code hard to scale.
Even if you have 32 cores, your python code would execute in single thread.
Creating multiple threads would not only speed it up, but slow it down due to presence of GIL...

**PeeJay** · 27 October 2022, 05:05 AM

Would be interesting to see if this speedup affects both AMD + Intel equally.

**sinepgib** · 27 October 2022, 07:59 AM

Originally posted by NobodyXu View Post

And the biggest problem of CPython is not about interpreting, but rather the GIL which makes it impossible to benefit from multithreading in python except for performing I/O or calling external FFI functions that release GIL.

Note CPython's serial performance is also quite bad (compared to other languages, usage dictates whether it is good enough), so it is not just the GIL.
There are many places to look at for causes because high level languages hide a lot of complexity below them (that's why we love them!). Lots of allocations, reference counting, dictionary accesses for most fields, in some cases implicit dictionary creation when passing arguments, etc...
Removing the GIL would (mostly) fix the parallelization problem, but the serial speed is bad and can only get worse with the GIL removal alone.

**NobodyXu** · 27 October 2022, 08:16 AM

Originally posted by atomsymbol

Please re-read my 1st post in this forum thread. When I wrote "Dynamic programming languages", I meant dynamic programming languages. If I was to mean JIT then I would have written "JIT". I mentioned Java because there exists software which is generating Java bytecode at run-time (see for example https://asm.ow2.io/ and articles on Google Scholar).

Sorry I lost the context here.
About the original post, what you talking about requires JIT.
Without JIT, there will be no performance benefit.

While python can generate code at runtime and compile it to python's internal bitcode, there's currently no JIT so it cannot run faster than AOT language.

**NobodyXu** · 27 October 2022, 08:18 AM

Originally posted by sinepgib View Post

Note CPython's serial performance is also quite bad (compared to other languages, usage dictates whether it is good enough), so it is not just the GIL.
There are many places to look at for causes because high level languages hide a lot of complexity below them (that's why we love them!). Lots of allocations, reference counting, dictionary accesses for most fields, in some cases implicit dictionary creation when passing arguments, etc...
Removing the GIL would (mostly) fix the parallelization problem, but the serial speed is bad and can only get worse with the GIL removal alone.

Yeah, though removing GIL would at least make it possible to scale up the application and sometimes hide other inefficiencies.
Sometimes this might be good enough.

And yes, the GIL simplifies the python interpreter, supports multithreading without hurting single-thread speed.
IMO removing it probably requires JIT or some new language constructs.

**NobodyXu** · 27 October 2022, 08:21 AM

Originally posted by atomsymbol

The bottleneck is that CPython isn't analysing/tracing the object graph while the Python program has multiple threads.

The bottleneck is definitely GIL.
It means only one thread can interpret and run python's internal bitcode at one time, which is essentially a global mutex and execute the code on a single CPU but run different thread in turn to simulate multi-cpus.
To make it even worse, this is built on the fact the OS-scheduled thread, so even less efficient.

**NobodyXu** · 27 October 2022, 09:15 AM

Originally posted by atomsymbol

It is obvious that there do exist cases in which specialized interpreted code runs faster than AOT-compiled code. (I am only claiming that such cases do exist - not claiming how many such cases there are. Computing whether specialized interpreted code would run faster than AOT is impossible if the specialization is using domain-specific knowledge.)

Well, it might exist, though I would argue that for these cases where you know a more specialized code can help, you can also add these specialization to the AOT-compiled code.
After all, handling of specialized cases need to present in the code when you write them.

**NobodyXu** · 27 October 2022, 09:18 AM

Originally posted by atomsymbol

GIL's bottleneck isn't in code/bytecode, because Python bytecode (similarly to binary code stored in CPU's L1I cache) is mostly immutable. GIL's bottleneck is in data (Python objects).

https://en.wikipedia.org/wiki/Component_(graph_theory)

A well-designed locking mechanism should be fine-grand and protect specific piece data with minimized critical section, but here GIL is more like serializing the execution of the entire python bitcode and emulating a single-core CPU to the python code.
External ffi code can certainly release GIL and uses any thread without contention, but for python code, they can only access a single thread at any time.

Announcement

Python 3.11 Performance Benchmarks Show Huge Improvement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment