Announcement

**TobiSGD** · 08 July 2012, 09:10 PM

Originally posted by maldorordiscord View Post

If you read the article carefully the 3A do have 2piece of Vector-SIMD units with 128bit per core
This means the 3A can handle 2 SSE(64bit) instructions at the same time per core.
but the 3A do have 4 cores 4*2=8
But the 3B do have 2*256bit simd-vector units per core and the 3C do have 2*512bit vector-simd units.
the 3A can already handle 8 SSE instructions at a time with all cores together.
the 3C can handle 256 ---> 64bit-sse instructions at a time.....
but you will still claim that the loongson 3C can not handle 8 pieces of 64bit-SSE instructions at the same time...

The one here claiming things are you. Here for example you claim that the Loongson can handle as much SSE instructions at a time as fit physically into the registers. Do you have any proof for that? The article does not say anything like that at all, so we are still at 33%, since the article doesn't deliver a proof for your claim. And I will not claim that it can't handle that, claiming things that are unknown is your part here. May be it can do that, but there is no proof for that given here from you.

And the last one "It does this with 70% of native speed." has no relevance right now because to validate this we need real hardware and more real benchmarks.

It doesn't matter at all if you have already run benchmarks on that hardware or not. You have claimed that it will run with 70% performance. Proof or admit that you made that up without having evidences, simple as that. This is not a church, this is a technical forum. If you can't proof your claims than they have no worth at all.

**maldorordiscord** · 08 July 2012, 10:31 PM

Originally posted by TobiSGD View Post

The one here claiming things are you. Here for example you claim that the Loongson can handle as much SSE instructions at a time as fit physically into the registers. Do you have any proof for that? The article does not say anything like that at all, so we are still at 33%, since the article doesn't deliver a proof for your claim. And I will not claim that it can't handle that, claiming things that are unknown is your part here. May be it can do that, but there is no proof for that given here from you.

you really do have zero understanding what is a vector space and what is a simd unit.
First of all what is a vector Array data structure http://en.wikipedia.org/wiki/Array_data_structure
now you get it you can fill a big vector array data structure with small vector array data structures.
for the CPU it doesn't matter he can calculate it as 1 single vector array.
because SIMD means calculating all data in the same time in a loop-level parallelism.
http://en.wikipedia.org/wiki/Data_parallelism
this calculating of many small vector arrays is also called Parallel array http://en.wikipedia.org/wiki/Parallel_array

maybe now you get it. its the same with x86 cpus they also calculate 2 SSE instructions in a 128bit vector unit and 4 in a 256bit vector unit.

Originally posted by TobiSGD View Post

It doesn't matter at all if you have already run benchmarks on that hardware or not. You have claimed that it will run with 70% performance. Proof or admit that you made that up without having evidences, simple as that. This is not a church, this is a technical forum. If you can't proof your claims than they have no worth at all.

the time will show the truth

we know the truth after michael do some real hardware benchmarks.

**TobiSGD** · 08 July 2012, 10:38 PM

Originally posted by maldorordiscord View Post

the time will show the truth

we know the truth after michael do some real hardware benchmarks.

So you admit that you have no clue if your statement is true, finally. Has taken a long time.

**TobiSGD** · 08 July 2012, 10:56 PM

Originally posted by maldorordiscord View Post

you really do have zero understanding what is a vector space and what is a simd unit.
First of all what is a vector Array data structure http://en.wikipedia.org/wiki/Array_data_structure
now you get it you can fill a big vector array data structure with small vector array data structures.
for the CPU it doesn't matter he can calculate it as 1 single vector array.
because SIMD means calculating all data in the same time in a loop-level parallelism.
http://en.wikipedia.org/wiki/Data_parallelism
this calculating of many small vector arrays is also called Parallel array http://en.wikipedia.org/wiki/Parallel_array

maybe now you get it. its the same with x86 cpus they also calculate 2 SSE instructions in a 128bit vector unit and 4 in a 256bit vector unit.

x86 CPUs that have SSE registers don't need to emulate SSE, they can execute it natively. So you are absolutely sure that these emulation units can handle the same amount of data without having losses due to the emulation?

**maldorordiscord** · 09 July 2012, 07:59 AM

Originally posted by TobiSGD View Post

x86 CPUs that have SSE registers don't need to emulate SSE, they can execute it natively. So you are absolutely sure that these emulation units can handle the same amount of data without having losses due to the emulation?

The loongson don't need to emulate SSE because the loongson can execute it natively !
The loongson losses speed but not in the moment of execute the vector stuff.
The loongson losses the speed by translating the format but that work comes before the executing.
An native x86 core just load the stuff and execute it means 2 steps the loongson do have 3 steps.
Load the stuff then translate the format and then execute it.
I hope its clear now.

The other side is the loongson 3A don't have SSE3,SSE4.2,AVX,FMA that and because of the lost of translating the format makes clear they don't have any chance to win any benchmark but its only because of the 100% Intel monopole abuse. Also the lag of PCI passthrough in the chip-set makes it complicated to use 3D acceleration in the Qemu windows emulation and Qemu can not handle wine-lib openGL pass-through like virtual-box this ends in the result that you can not play games with that cpu in the emulation right now ... But its not a cpu problem its a software and hardware (main-board chipset) problem

**maldorordiscord** · 09 July 2012, 08:02 AM

Originally posted by TobiSGD View Post

So you admit that you have no clue if your statement is true, finally. Has taken a long time.

The future will show it. What if I'm right proved in the future? There is a possibility for this case !

then what?

**TobiSGD** · 09 July 2012, 08:13 AM

Originally posted by maldorordiscord View Post

The loongson don't need to emulate SSE because the loongson can execute it natively !

...

The loongson losses the speed by translating the format but that work comes before the executing.

Why do they have to translate something that they can execute natively? Doesn't make much sense.

The future will show it. What if I'm right proved in the future? There is a possibility for this case !

Then you are a lucky boy. For now it is that you make up claims without having any proof for your claims.

**maldorordiscord** · 09 July 2012, 09:30 AM

Originally posted by TobiSGD View Post

Why do they have to translate something that they can execute natively? Doesn't make much sense.

Did you read the article? The vector instruction is not all. They can accelerate single instruction but not ALL. Some other instructions are emulated and some are executed natively in the end they need a format format which makes this possible.
In other words: if all is executed natively they get 100% speed and not 70% speed.
They can execute the SSE natively but other instructions not and this causes the speed lost of 30%

Originally posted by TobiSGD View Post

Then you are a lucky boy. For now it is that you make up claims without having any proof for your claims.

In your words I'm 66% correct =(it can handle SSE instructions natively)+(many SSE instructions at the same time.) right now. And 100% in the future if I'm a lucky boy in your words.
Doesn't sound so bad

And the "70%" number is not my idea thats are the official numbers from the Chinese Academy of Sciences.
Which means there is a high probability that I'm 100% correct.

**TobiSGD** · 09 July 2012, 09:42 AM

Originally posted by maldorordiscord View Post

Did you read the article?
.

Yes, I did. They have to translate SSE commands to their native command, because their units are SSE-like, but not native SSE. Sorry, no native SSE to see here.

In your words I'm 66% correct =(it can handle SSE instructions natively)+(many SSE instructions at the same time.) right now. And 100% in the future if I'm a lucky boy in your words.
Doesn't sound so bad

Nope, you are not. It can't handle SSE natively (but that doesn't matter, that wasn't your claim). Also, your claim was that it can handle exactly 8 SSE instructions at a time. We still see no proof for that. So you are still at 33%.

And the "70%" number is not my idea thats are the official numbers from the Chinese Academy of Sciences.
Which means there is a high probability that I'm 100% correct.

The only performance comparison I can see in the documentation you gave use til now is for x86 code, not SSE code, so it doesn't change from 33% until you provide proof for your claims.
Again, this is a technical forum, no one cares for your believes, provide proof or your claims are irrelevant.

**maldorordiscord** · 09 July 2012, 10:34 AM

Originally posted by TobiSGD View Post

Yes, I did. They have to translate SSE commands to their native command, because their units are SSE-like, but not native SSE. Sorry, no native SSE to see here.

My effort to simplify it for you doesn't mean that your acting dumb pays you anything.
I wrote: the Loongson need to translate the format from the x86 one into the Loongson one.
So now you confirm me, of course, only to seemingly contradict me.
Any reasonably intelligent reader will notice this rotten trick..

Originally posted by TobiSGD View Post

Nope, you are not. It can't handle SSE natively (but that doesn't matter, that wasn't your claim).

You lag any Focus because a x86 core also can not handle CISC code natively this turn your argument into bullshit. a native x86 core also accelerate the cisc code after translating the code with the microcode into the internal CPU architecture logic for example VLIW-like (all modern cpus are internal RISC or VLIW)
In the end only 1 fact count :emulated or accelerated and the Loongson can accelerate it.

Originally posted by TobiSGD View Post

Also, your claim was that it can handle exactly 8 SSE instructions at a time. We still see no proof for that.

I can not do anything for your limited intelligence to recognize the truth.
But one is for sure you can put two 64bit vector data areas into one 128bit vector data area.
And you can put four 64bit vector data areas into one 256bit vector data area.
And you can put eight 64bit vector data areas into one 512bit vector data area.

Originally posted by TobiSGD View Post

So you are still at 33%.

You are to limited to understand simple facts: you can calculate many small vector areas in a big vector area space. The only think you need to do is to put it together and after the calculation you need to decompose it.

Originally posted by TobiSGD View Post

The only performance comparison I can see in the documentation you gave use til now is for x86 code, not SSE code, so it doesn't change from 33% until you provide proof for your claims.
Again, this is a technical forum, no one cares for your believes, provide proof or your claims are irrelevant.

Your reading skill is questionable because there is no comparison between native speed and x86 format speed. There is only a comparison between full emulated x86 format speed and accelerated x86 format speed. The only logical conclusion out of this is the speed up factor. Not the absolute speed.
After the format translation there is no speed lost in executing SSE code. Because of this the speed up factor after format-translation is 100% to the speed up factor of a native x86 core.
And about "technical forum" I insert 1000 times more technical informations about this tropic to the forum than you because you do not have insert any informations.
So please improve your personal performance and be a useful member.

Announcement

MIPS Loongson 3A Benchmarks On Debian

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment