Announcement

**Myownfriend** · 19 August 2023, 07:55 AM

Originally posted by qarium View Post

you wrongfully think i am silent. you wrongfully think i did drop a topic. ...

i had a medical emergency (not my person had the emergency) and important work to do outside of forum and computers.
a person related to me nearly died and i had to drive 8 hours in the car today to help.

I'm very sorry that happened and I hope they're okay.

Originally posted by qarium View Post

you myfriend you really have some real problems i can assure you that i had a Kyro1/2 card 22 years ago and i told you why i did give the card back i was not able to use more than 512mb ram. just remember it was the time of windows 98/windows ME and this windows version could not perform ram swaping on the harddrive with more than 512mb ram you could hack this by disabling the ram swapping on the harddrive. then 756mb ram or even 3 or even 3,5GB ram on 32bit cpus where no problem. more than 3,5gb was impossible because 512mb of the 32bit space where reserved for fixed spaces means hardwar.
with a configuration like this you had to use tools who every second or so did occupy ram space and then release the ramspace as emty because the DOS kernel of win98/me had no ram managment means i used a tool who did this ram managment. at that time people told me in Emule forums you could no do server for emule or otehr P2P tasks with win98/me because of no ram managment but if you used tools who automatically occupied ram and give it free emty this worked well they claimed you need to use windows 2000 at that time.

that could be the reason why these Kyro1/2 cards assumed you only run 512mb ram with win98/ME but i can also say these cards where also incompatible with windows2000 and XP at that time. so you als0 could not fix it with useing windows 2000 all the gamers who did buy this card had win98 or windows ME at that time because the games did not run in windows 2000 or NT 4.0

Okay. Here's a word of advice for making your posts better. Check your spelling and use proper capitalization. This almost read like a run-on sentence.

So onto what you said. I looked it up and it looks like there were issues with Windows 98 and ME using over 512MB. People who run either OS on newer hardware usually use an unofficial patch to get around those issues. Fair enough. Maybe that would lead to issues with the Kyro or Kyro II. However you said:

"i did buy such a Imagination Technology PowerVR Kyro graphics card i am nur is was the Kyro1 or Kyro 2,,, but this card was total shit and i did give it back after only 1-2 days. the reason was my system at that time had 756MB RAM but this card only supported 512MB ram. to use this card i would had downgrade to 512mb ram so i decided to just give the card bag."

So you're making a judgement on the quality of the card based on the driver assuming a limitation of the OS that people would need to use a workaround or extra tools to get around. How does that make the card itself shit? That doesn't reflect on it's architecture, it would have to do with it's driver.

Originally posted by qarium View Post

about your favored topic

Not my favorite topic. it's just something I mentioned and you took issue with it.

Originally posted by qarium View Post

about your favored topic chiplet design with multible gpu/comute dies and explicit not the amd way with cache dies.

I don't know if you're trying to add information or just introduce a new section of your post. I'm not against AMD's cache die chiplets. They're an effective way to increase yields and reducing fab costs for the GPU. Pushing blocks that don't shrink well onto an older process while reducing the size of the GCD was smart. It's just that obviously they would get even higher yields and lower costs if they could just connect smaller GCDs.

AMD was attempting to do that in Navi 4C with up to 9 SEDs (Shader Engine Dies), and active interposer, a MID (Multimedia and IO die), and 6 GDDR PHY dies but it's been canceled as AMD isn't looking to produce high-end cards next gen.

Originally posted by qarium View Post

i do not believe that Tile-Based Rendering is the solution to chiplet design in GPUs and the reason i think so is the patent situation

You just said in your last post that their patents on tile-based rendering should have run out by now and I already mention like four other GPU architectures that are tile-based.

Regardless, if there are patent issues that really doesn't prevent it from being a solution to spreading a GPU design across chiplets. It would just mean there's some roadblocks for those who would need to pay to use these patents.

I'm also not suggesting that Nvidia and AMD must go tile-based in order to use chiplets. I was stating why I feel tile-based designs are naturally more easily suited to spanning across chiplets. That doesn't mean typical IMRs can't also span across chiplets, it's just way more complicated.

Originally posted by qarium View Post

in patents you only have 3-4 outcomes and this is all the time: one outcome is patent costs to high you do not use it
ond outcome is patent costs is cheaper than the benefit then you use it. and then if patent costs to high but after 20 years the patent time runs out companies as soon as possible use this patent. and so one and so one.

Okay but you just said in the previous post that their patents should have run up by now because it's been over 20 years. In that post you suggested that TBR and TBDR must not be so great because the patents ran out and other people aren't making tile-based GPUs. But there are people making tile-based GPUs and have been for at least 10 years now.

Originally posted by qarium View Post

with Tile-Based Rendering this is different the patent run out and as you say AMD/Nvidia/intel still do not use it. why ? you already explained it. they only used similar techniques to mimic the effect without really use it.

Well, no, I said they use similar in-hardware techniques to get some of the benefits. Those techniques still don't get them the same level of bandwidth savings and more importantly they don't make it any easier for them to scale their design across chips. The methods that I mentioned that get the most bandwidth savings and performance improvements are from software developers implementing ideas from TBR and TBDRs on IMRs using shaders. They still don't get the the full benefits of a TBR or TBDR but using compute shaders implement TBR-like rendering would be an way to divide up and distribute work between chips in an IMR. The issue is that you can't split an IMR up into chiplets and expect to developers to implement tile-based rendering on them in order to use all of your GPU. It would also effectively prevent legacy software from ever being able to use more than one chiplet. Implementing tiling in-hardware would allow all software to use all the chiplets but software that is aware it's running on a TBDR would get more performance out of it.

The reason I mentioned those techniques is to demonstrate the effectiveness of TBDR techniques and to explain why I felt TBRs are uniquely well-suited to working with chiplets. I feel I demonstrated that well.

Originally posted by qarium View Post

and as you say the only big player with relevant marketshare in the desktop business is APPLE but apple does not have chiplet design yet instead they use very very large chip dies... means the complete opposit of what you claim.

Apples M2 Ultra is two chips. The M2 Ultra is the name of the package not the die. It's two M2 Maxes (perfect M2s) connected together with an interposer called UltraFusion that allows the two chips to communicate at 2.5TB/s.

I know 2.5TB/s of bandwidth sounds like a lot but the cumulative bandwidth that the MCDs provide to the GCD in the 7900XTX is 5.3TB/s. That's just for the compute die to access memory that's off-die. The fact that the GCD itself can't be split up means that would require it's internal bandwith is much higher than 5.3TB/s.

The fact that the M2 Ultra connects two TBDRs (compute portions and all) together and has them working as one with less than half of the bandwidth that an IMR graphics compute chiplet needs to communicate with it's last-level cache and external memory, shows that I was right: you need far less inter-chiplet bandwidth for two TBDR chiplets to work as one GPU than you would for an IMR.

That also means that the amount of bandwidth that the MCDs provide the 7900XTX's one GCD is more than enough to cover four M2 chips being connected together. But remember that the M2 Maxes aren't GPU chips, they're SOCs. So putting four together gets you one SOC with 48 CPU cores, 152 GPU core (that's 19,456 ALUs), 64 Neural Engine cores, four Image Signal Processors, four video accelerators, a bunch of USB4, and four PCI storage controllers.

From that we can derive that if the chips were just TBDR GPUs and lacked everything else that makes them SOCs, they could get away with an even less bandwidth still.

Originally posted by qarium View Post

instead i think FSR3 upscalling with or without temporal data with Frame-generation is the solution for chiplet designs.

I didn't read anything beyond this point yet but I'm gonna tell you already that that doesn't work. That's software solution not a hardware solution so any software that doesn't use FSR3 would only use one chiplet. That's exactly the same limitation as SLI and Crossfire but now applied to a single card. I also don't see any way that FSR3 would help divide work between chiplets at all. The way that FSR3 and DLSS3 reduce work requires inherently serial techniques. For example:

You must first render the low resolution image before you can upscale it.
In order to get motion vectors for the current frame, you must first compute the previous frame.
In order to interpolate the in-between frame, the GPU first computer and upscale the last two frames to interpolate a frame between them.

If we assume a four chiplet GPU, I supposed it could try to render two of the frames at the same time on different chiplets, just like Alternate Frame Rendering. But if frame 1 takes longer to render than frame 2, it needs to wait on the frame 1 to be done in order to calculate motion vectors for frame 2. Now the middle frame can be interpolated but that's a less intensive process than rendering a single low-res image so it wouldn't make full use out of a third chiplet. More importantly, the first two chiplets are done with their work so there's not reason one of them couldn't do it. The only use I can think of for the last two chiplets is to compute two more future frames or have them do split frame rendering, but this is all really just taking ancient methods of multi-GPU rendering on IMRs, apply it to chiplets and adding temporal upscaling and interpolation.

Look up benchmarks of Quad SLI and Quad Crossfire rendering. You might see decent boost across to cards but it's never 2x and sometimes you get lower frame rates than one card. As you add more cards, you get more diminishing returns and the chances getting a negative benefit over a single card increase.

Now I'll read how you think it will work.

Originally posted by qarium View Post

Frame-Generation in DLSS3 and FSR3 are the proof that you can use old data to calculate with deep learning or by smart algorytms what the frame should look like without really calculate it.

Yes, you need temporal information. Time is linear, not parallel. This sounds like you're leaning toward dividing work with Alternate Frame Rendering.

Originally posted by qarium View Post

now if you have FSR and frame generation and you use multible GPUs to render some parts of the screen then the parts who where NOT calculated can be restored from old data by the FSR3 frame generation engine.

You're saying some parts of the screen. That's dividing work spatially. Did you not read what I said a few posts ago. IMRs take in un-transformed triangles and spit out pixels. An IMR has no idea where a triangle is going to render on screen before it's submitted because it's coordinates haven't been transformed to screen space coordinates via vertex shading. You can't just have different chiplets take different triangles, transform them, and then write them wherever they need to be on the screen because the order that depth buffer and color buffers are tested and updated matters. The depth buffer's two jobs are to help prevent fragment shading on pixels that won't contribute to the final image and solve the visibility problem AKA make sure that far triangles aren't rendering over nearer triangles. Only triangles that don't overlap can check and update the depth and color buffers in parallel. One way to insure that two chiplets aren't going to render over each other's output is to assign each chiplet a fourth of the screen and send each chiplet all the geometry. The chiplets would then apply vertex shading to each triangle and then render or reject triangles depending on whether they're in that chiplets fourth or not. This is how split frame rendering works and it has the disadvantage of not scaling up vertex shading at all.

Originally posted by qarium View Post

chiplet designs and multigpu solutions had the problems that if some parts run on one gpu and some parts on another gpu that if you put it together it maybe does not fit together and you have artefacts or differences what makes it look ugly...

What? No. That's never been the issue. The issue with multi-GPU scaling has always been a combination of a high bandwidth requirement for interchip communication and/or finding ways to evenly divide work between chips in a way to get linear scaling. The only artifacts were frame hitching from alternate frame rendering.

FSR3 and DLSS3+ create artifacts. They're interpolating data all over the place including whole frames where a character might just be missing a leg or something.

Originally posted by qarium View Post

but with FSR3 with frame generation or DLSS3+ the deep learning algorythmen run on shaders or custom DLSS like hardware could just fill the gab between what is really there and what is not there...

They fill in the gaps but none of this is applicable to dividing the work among chiplets. You're don't understand what the issue is and hoping that deep-learning and AI is a miracle technology that's going to fix it somehow. It's not.

Originally posted by qarium View Post

and this could be made in a way that humans can not see a quality loss ... maybe if you make screenshots you see something but these technologies are not made for screenshot these technologies are made for real movement and in movement means real gaming it could have such small differences that the human eye could not see a different.

I don't even know what you're trying to say here.

**YBoy360** · 19 August 2023, 09:23 PM

Originally posted by qarium View Post

"potential to shake up the market"

looks like you life in a alternative reality... for many years Imagination Technology/PowerVR hat the worst opensource driver support from all GPU companies. the situation was even worst than Nvidia because Nvidia had as some people claim working closed source drivers and Imagination Technology had never good working GPU drivers on linux.
of course they say they changed and now do open-source drivers... but they are not known to have "Good" opensource drivers.

the release of PowerVR cards like LTT where a full disaster with only bad reviews... there was not a single review who said it is good.

you say "potential to shake up the market" but it only shows you life in an alternative reality.

you claim Imagination Technology/PowerVR has chiplet design and others have no chiplet design thats wrong.

I used Kyro I and II happily under Linux, back in 2000.. It smoked NVidia/Windows under Counter-Strike AFAIR.

**qarium** · 22 August 2023, 04:11 PM

Originally posted by Myownfriend View Post

I'm very sorry that happened and I hope they're okay.

sadly not the bacteria is resistant against the antibiotic and the necrosis of the infected area is going on and she now gets a second emergency operation.

Originally posted by Myownfriend View Post

Okay. Here's a word of advice for making your posts better. Check your spelling and use proper capitalization. This almost read like a run-on sentence.

i can do this but most of the time its not worth the effort.

Originally posted by Myownfriend View Post

So onto what you said. I looked it up and it looks like there were issues with Windows 98 and ME using over 512MB. People who run either OS on newer hardware usually use an unofficial patch to get around those issues. Fair enough. Maybe that would lead to issues with the Kyro or Kyro II. However you said:
"i did buy such a Imagination Technology PowerVR Kyro graphics card i am nur is was the Kyro1 or Kyro 2,,, but this card was total shit and i did give it back after only 1-2 days. the reason was my system at that time had 756MB RAM but this card only supported 512MB ram. to use this card i would had downgrade to 512mb ram so i decided to just give the card bag."
So you're making a judgement on the quality of the card based on the driver assuming a limitation of the OS that people would need to use a workaround or extra tools to get around. How does that make the card itself shit? That doesn't reflect on it's architecture, it would have to do with it's driver.

so now you believe the story about over 512mb ram with Kyro GPUs LOL...
"this card was total shit"
now you say i am not allowed to say this if that 512mb problem was the only problem. wrong.
it was not the only problem. because they where only on the market a very short period of time they did also drop the driver support and did not develop drivers for future operating systems like windows 2000 and XP.
thats why all the Kyro GPU customers where doomed and forced to stay on the old OS or else buy new hardware.
ok now you claim this is not true either... but fact is they did give up and thats why there where no new drivers for new Operating systems. thats a fact.

"How does that make the card itself shit? That doesn't reflect on it's architecture, it would have to do with it's driver."

see my second argument about drivers for future operating systems has also nothing to do with the cards architecture.
i already said they failed for other reasons and the tile based rendering architecture was not the reason they failed.

but today because of the patents we can say for sure if this technology would be of any good people would use it because there are no patents to stop people to use it but AMD/Nvidia/Intel only use technology to mimic the effect of tiled based rendering.

Originally posted by Myownfriend View Post

Not my favorite topic. it's just something I mentioned. You took issue with it because I didn't say that AMD is the bestest, and you blew up about it. But lets continue...
I don't know if you're trying to add information or just introduce a new section of your post.
You just said in your last post that their patents on tile-based rendering should have run out by now and I already mention like four other GPU architectures that are tile-based.

no you did explain to me that these "others" outside of Apple they do not really use Tiled based rendering instead the use other technologies to mimic the effect.

thats the point the Patent is run out and AMD and Nvidia still do not use it. they only mimic the effect.

Originally posted by Myownfriend View Post

Regardless, if there are patent issues that really doesn't prevent it from being a solution to spreading a GPU design across chiplets. It would just mean there's some roadblocks for those who would need to pay to use these patents.

this is the way Apple goes with their Apple M2 Ultra SOC/chiplet design yes...

but as soon as you make FSR3/DLSS mandatory this open up the possibility to implement a different style of chiplet design.

a FSR3/DLSS upscalling/Temporal/frame-generation engine can easily be modified to accept different render results from different chiplets into one output.

you already said such a design maybe would be incompatible with older games and engines because you need FSR3 then but honestly somethimes such a technology becomes the standard even if incompatible with older games because the benefits could be great.

you also talk about the bandwidth tese 2 apple m2 ultra socs need to realize a chiplet design and this bandwidth is high if you would use such a FSR3+ chiplet design the bandwidth you need to combine chiplets could be ZERO.

this means apple can do 2 chiplets right now and in near future mayber 4 chiplets but this design does no scale to 100 chiplets

such a FSR3+chiplet design with ZERO bandwidth needed could scale to much higher ammount of chiplets.

Originally posted by Myownfriend View Post

I'm also not suggesting that Nvidia and AMD must go tile-based in order to use chiplets. I was stating why I feel tile-based designs are naturally more easily suited to spanning across chiplets. That doesn't mean typical IMRs can't also span across chiplets, it's just way more complicated.

they maybe go tile-based for their chiplet design. maybe maybe but making FSR3/DLSS mandatory would be the smarter move i think. the bandwidth needed for such a solution would be much lower.

Originally posted by Myownfriend View Post

Okay but you just said in the previous post that their patents should have run up by now because it's been over 20 years. In that post you suggested that TBR and TBDR must not be so great because the patents ran out and other people aren't making tile-based GPUs. But there are people making tile-based GPUs and have been for at least 10 years now.

the apple m2 ultra was the proof that you can not do TOP performance with it.
the apple m2 ultra soc lose agaist a Nvidia RTX 4090 and also apple m2 ultra lose agaist a AMD 7900XTX.
also the apple socs are not known for good raytracing performance.

please wake me up if any tile-based rendering product gives us TOP performance instead of second class performance.

a apple m2 ultra costs you 7000-8000€ if you only build a 3000€ AMD system it beats apple in any benchmark.

and this only because the apple m2 ultra GPU can not compete.

Originally posted by Myownfriend View Post

Well, no, I said they use similar in-hardware techniques to get some of the benefits. Those techniques still don't get them the same level of bandwidth savings and more importantly they don't make it any easier for them to scale their design across chips. The methods that I mentioned that get the most bandwidth savings and performance improvements are from software developers implementing ideas from TBR and TBDRs on IMRs using shaders. They still don't get the the full benefits of a TBR or TBDR but using compute shaders implement TBR-like rendering would be an way to divide up and distribute work between chips in an IMR. The issue is that you can't split an IMR up into chiplets and expect to developers to implement tile-based rendering on them in order to use all of your GPU. It would also effectively prevent legacy software from ever being able to use more than one chiplet. Implementing tiling in-hardware would allow all software to use all the chiplets but software that is aware it's running on a TBDR would get more performance out of it.
The reason I mentioned those techniques is to demonstrate the effectiveness of TBDR techniques and to explain why I felt TBRs are uniquely well-suited to working with chiplets. I feel I demonstrated that well.
Apples M2 Ultra is two chips. The M2 Ultra is the name of the package not the die. It's two M2 Pros (perfect M2s) connected together with an interposer called UltraFusion that allows the two chips to communicate at 2.5TB/s.

thank you for your reminder i did already know that M2 ultra is 2 chips but i did honestly not remember or thought about this.
but on the other hand it has near zero relevance because M2 ultra lose agaist 4090 and also lose agaist a 7900XTX

so you can build a AMD based computer with a Ryzen 7950X3D and a 7900XTX and this is much cheaper and beats the highest apple product.

so please wake me up at the moment when apple present a tile based rendering chiplet design who has higher performance.

Originally posted by Myownfriend View Post

I know 2.5TB/s of bandwidth sounds like a lot but the cumulative bandwidth that the MCDs provide to the GCD in the 7900XTX is 5.3TB/s. That's just for the compute die to access memory that's off-die. The fact that the GCD itself can't be split up means that would require it's internal bandwith is much higher than 5.3TB/s.
The fact that the M2 Ultra connects two TBDRs (compute portions and all) together and has them working as one with less than half of the bandwidth that an IMR graphics compute chiplet needs to communicate with it's last-level cache and external memory, shows that I was right: you need far less inter-chiplet bandwidth for two TBDR chiplets to work as one GPU than you would for an IMR.
That also means that the amount of bandwidth that the MCDs provide the 7900XTX's one GCD is more than enough to cover four M2 chips being connected together. But remember that the M2 Pros aren't GPU chips, they're SOCs. So putting four together gets you one SOC with 48 CPU cores, 152 GPU core (that's 19,456 ALUs), 64 Neural Engine cores, four Image Signal Processors, four video accelerators, a bunch of USB4, and four PCI storage controllers.
From that we can derive that if the chips were just TBDR GPUs and lacked everything else that makes them SOCs, they could get away with an even less bandwidth still.

you act like apple m2 ultra gives us top notch performance but this is not the case such a system costs you 7000-8000€

and any 2000-3000€ AMD system beats this system easily in any benchmarks.

and to beat it in games you do not even need a rxzen 7950X3D a 7800X3D does the job for much cheaper.
a 7900XTX beats the apple m2 ultra easily in any benchmark.

Originally posted by Myownfriend View Post

I didn't read anything beyond this point yet but I'm gonna tell you already that that doesn't work. That's software solution not a hardware solution so any software that doesn't use FSR3 would only use one chiplet. That's exactly the same limitation as SLI and Crossfire but now applied to a single card. I also don't see any way that FSR3 would help divide work between chiplets at all. The way that FSR3 and DLSS3 reduce work requires inherently serial techniques. For example:

you do not see how it divide work ? FSR already upscale from 3 sources one is the lower resolution one source is the Temporal component and one source is the Frame generation AI the chiplet design would plain and simple FSR4.0 with now 4. source
you could just render different part of the screen on different gpus or chiplets and the FSR engine would glue it together to make it look like 1 result.

Originally posted by Myownfriend View Post

You must first render the low resolution image before you can upscale it.
In order to get motion vectors for the current frame, you must first compute the previous frame.
In order to interpolate the in-between frame, the GPU first computer and upscale the last two frames to interpolate a frame between them.

you could render different parts of the screen on different gpus or chiplets and make the FSR4 engine to glue it together.

Originally posted by Myownfriend View Post

If we assume a four chiplet GPU, I supposed it could try to render two of the frames at the same time on different chiplets, just like Alternate Frame Rendering. But if frame 1 takes longer to render than frame 2, it needs to wait on the frame 1 to be done in order to calculate motion vectors for frame 2. Now the middle frame can be interpolated but that's a less intensive process than rendering a single low-res image so it wouldn't make full use out of a third chiplet. More importantly, the first two chiplets are done with their work so there's not reason one of them couldn't do it. The only use I can think of for the last two chiplets is to compute two more future frames or have them do split frame rendering, but this is all really just taking ancient methods of multi-GPU rendering on IMRs, apply it to chiplets and adding temporal upscaling and interpolation.
Look up benchmarks of Quad SLI and Quad Crossfire rendering. You might see decent boost across to cards but it's never 2x and sometimes you get lower frame rates than one card. As you add more cards, you get more diminishing returns and the chances getting a negative benefit over a single card increase.
Now I'll read how you think it will work.
Yes, you need temporal information. Time is linear, not parallel. This sounds like you're leaning toward dividing work with Alternate Frame Rendering.
You're saying some parts of the screen. That's dividing work spatially. Did you not read what I said a few posts ago. IMRs take in un-transformed triangles and spit out pixels. An IMR has no idea where a triangle is going to render on screen before it's submitted because it's coordinates haven't been transformed to screen space coordinates via vertex shading. You can't just have different chiplets take different triangles, transform them, and then write them wherever they need to be on the screen because the order that depth buffer and color buffers are tested and updated matters. The depth buffer's two jobs are to help prevent fragment shading on pixels that won't contribute to the final image and solve the visibility problem AKA make sure that far triangles aren't rendering over nearer triangles. Only triangles that don't overlap can check and update the depth and color buffers in parallel. One way to insure that two chiplets aren't going to render over each other's output is to assign each chiplet a fourth of the screen and send each chiplet all the geometry. The chiplets would then apply vertex shading to each triangle and then render or reject triangles depending on whether they're in that chiplets fourth or not. This is how split frame rendering works and it has the disadvantage of not scaling up vertex shading at all.

I admit that you know much about this topic and i do not know enough to even talk with you in a meaningfull way.
at last i can say i did read anything you write and i try to learn.
but i can also say that your reference hardware as the ultimative solution the apple m2 ultra dual-chiplet gpu can be outgunned by 50% cheaper systems just a 7800X3D+7900XTX this means to me they did not find the right solution yet.
wake me up if apple M3 ultra system beats any other system by factor 2....

Originally posted by Myownfriend View Post

What? No. That's never been the issue. The issue with multi-GPU scaling has always been a combination of a high bandwidth requirement for interchip communication and/or finding ways to evenly divide work between chips in a way to get linear scaling. The only artifacts were frame hitching from alternate frame rendering.

to my knowlege they fixed the problem that the different versions of different gpus did not fit together by simple disable any graphic effect who caused this problem...
if you say "That's never been the issue." this is true but only because they simple did disable any graphic effect who had this problem.
and the end-result was never the same. you have this problem if you want to use any graphic effect you normaly use on a single traditional gpu.

Originally posted by Myownfriend View Post

FSR3 and DLSS3+ create artifacts. They're interpolating data all over the place including whole frames where a character might just be missing a leg or something.
They fill in the gaps but none of this is applicable to dividing the work among chiplets. You're don't understand what the issue is and hoping that deep-learning and AI is a miracle technology that's going to fix it somehow. It's not.
I don't even know what you're trying to say here.

i just want to say i am sure that future gpu chip design will make FSR/DLSS technology mandatory.
because the people use it anyway to get better performance so in the end this means you can make chipdesign who can not run without it to get full performance.

**Myownfriend** · 22 August 2023, 09:55 PM

Originally posted by qarium View Post

sadly not the bacteria is resistant against the antibiotic and the necrosis of the infected area is going on and she now gets a second emergency operation.

That sounds extremely scary. I hope things start taking a turn for the better soon.

Originally posted by qarium View Post

i can do this but most of the time its not worth the effort.

It's always worth the effect to make your posts easier to understand. If you're spending a lot of time writing a post already then it's better to make sure that others understand it so that follow up posts don't need to discuss misunderstandings and stuff.

Originally posted by qarium View Post

so now you believe the story about over 512mb ram with Kyro GPUs LOL...
"this card was total shit"
now you say i am not allowed to say this if that 512mb problem was the only problem. wrong.
it was not the only problem. because they where only on the market a very short period of time they did also drop the driver support and did not develop drivers for future operating systems like windows 2000 and XP.
thats why all the Kyro GPU customers where doomed and forced to stay on the old OS or else buy new hardware.
ok now you claim this is not true either... but fact is they did give up and thats why there where no new drivers for new Operating systems. thats a fact.

"How does that make the card itself shit? That doesn't reflect on it's architecture, it would have to do with it's driver."

see my second argument about drivers for future operating systems has also nothing to do with the cards architecture.
i already said they failed for other reasons and the tile based rendering architecture was not the reason they failed.

You're doing that thing where you're playing with time again. You said you had the card for two days and you got rid of it because of an issue with it's driver not supporting your configuration. I'm assuming this was when the card was new and I don't think you reverted to Windows 98 to use the card. in other words, the state of it's drivers in the future was not part of your decision to return the card.

Also I looked and there appear to be drivers for Windows 2000 and XP, so I don't see how they didn't support newer operating systems. As Yboy360 pointed out, it also had Linux drivers, too.

But also, considering the discussion that we're having, the architecture of the card the most important part of it.

Originally posted by qarium View Post

but today because of the patents we can say for sure if this technology would be of any good people would use it because there are no patents to stop people to use it but AMD/Nvidia/Intel only use technology to mimic the effect of tiled based rendering.

You're misunderstanding, they're not mimicking the effect. In hardware they're making use data locality and caching to reduce bandwidth to some degree but it's not getting the effect of tiled-rendering. The same applies to methods of rendering that I mentioned. Early Z is taking an idea from PowerVR's rendering pipeline specifically but it has nothing to do with tile-based rendering and the fragment reads and writes are happening over the memory controller.

Also, Intel literally used/uses tile-based rendering. I said that before.

Originally posted by qarium View Post

no you did explain to me that these "others" outside of Apple they do not really use Tiled based rendering instead the use other technologies to mimic the effect.

thats the point the Patent is run out and AMD and Nvidia still do not use it. they only mimic the effect.

I have a feeling you're skipping over large parts of my posts. Adreno and Mali do actually use tile-based rendering. They're not mimicking the effect.

Whether or not AMD and Nvidia use it doesn't speak to it's potential.

Originally posted by qarium View Post

this is the way Apple goes with their Apple M2 Ultra SOC/chiplet design yes...

I just want to point out that you said in your last post that they don't use chiplets. You said that, if I was right that tile-based renderers can be split into chiplets well then Apple would use chiplets. Now that you know that they are using chiplets, you're refusing to concede the point, and acting like you knew they used chiplets already.

Originally posted by qarium View Post

but as soon as you make FSR3/DLSS mandatory this open up the possibility to implement a different style of chiplet design.

I already explained what issues that would have.

Originally posted by qarium View Post

a FSR3/DLSS upscalling/Temporal/frame-generation engine can easily be modified to accept different render results from different chiplets into one output.

FSR3 and DLSS already can theoretically accept different rendering results from different chiplets/GPUs. I even explained how it could divide that work up but it wouldn't be doing anything knew. It would just use old methods of multi-GPU scaling that already don't work well. There's nothing about the technology that fixes the issues with creating IMR-based GPU with chiplets. The issue is that there isn't a very good way of dividing up the work that an IMR does.

You're talking around the issue and you're not explaining anything. How would FSR3 and DLSS help break IMRs into chiplets?

Originally posted by qarium View Post

you already said such a design maybe would be incompatible with older games and engines because you need FSR3 then but honestly somethimes such a technology becomes the standard even if incompatible with older games because the benefits could be great.

Sure. That's an option but you still haven't provided any real fix for how IMRs would scale up to using chiplets efficiently.

Originally posted by qarium View Post

you also talk about the bandwidth tese 2 apple m2 ultra socs need to realize a chiplet design and this bandwidth is high if you would use such a FSR3+ chiplet design the bandwidth you need to combine chiplets could be ZERO.

How?

Originally posted by qarium View Post

this means apple can do 2 chiplets right now and in near future mayber 4 chiplets but this design does no scale to 100 chiplets

Why?

Originally posted by qarium View Post

such a FSR3+chiplet design with ZERO bandwidth needed could scale to much higher ammount of chiplets.

It would be impossible for any architecture with multiple chips to have zero inter-chiplet bandwidth unless they aren't communicating at all. That's just not how computers work.

Originally posted by qarium View Post

they maybe go tile-based for their chiplet design. maybe maybe but making FSR3/DLSS mandatory would be the smarter move i think. the bandwidth needed for such a solution would be much lower.

FSR3 works on tile-based renderers, too.

Again, FSR and DLSS are just upscaling algorithms using spatial and temporal data. They reduce bandwidth rendering at a lower internal resolution and then re-using data from the previous frame to reconstruct a higher quality image. It's essentially what super resolution technologies have done for years but in a more advance way.

Rendering the initial low-resolution image is still the most expensive part of the process and that's not being done any differently from how a GPU would render an image at native resolution.

Originally posted by qarium View Post

the apple m2 ultra was the proof that you can not do TOP performance with it.
the apple m2 ultra soc lose agaist a Nvidia RTX 4090 and also apple m2 ultra lose agaist a AMD 7900XTX.
also the apple socs are not known for good raytracing performance.

The M2 does not live in the same ecosystem as the RTX 4090 and 7900XTX. Any comparisons between the two would be with different CPUs using different operating systems and different APIs with different degrees of maturity in terms of support. In other words, they're hard to really compare. While I do believe that Apple have overhyped the performance of their chips, if you were to compare performance between lets say the M2 Ultra vs the 4090 in Blender, for example, we wouldn't just be comparing hardware, we'd also be comparing the Blender's Cuda backend with the much newer Metal backend.

Originally posted by qarium View Post

please wake me up if any tile-based rendering product gives us TOP performance instead of second class performance.

You're going off on a tangent again. You're getting back into your fanboy mindset. The conversation is about scaling across chips and you're trying to make it about which company's GPU benchmarks the absolute fastest. In 3D Mark, the M2 Ultra gets about 80% of the performance of the 7900XTX. The inverse of that is that the 7900XTX is 25% faster than M2 Ultra. Are you really going to act like that's enough of a difference to explain away the difference in chiplet bandwidth requirements between the two?

I'll repeat: the 7900XTX has a peak bandwidth of 5.3TB/s between it's MCD and GCD. That's just between the compute portion and it's LLC + external memory. The M2 Ultra has an interconnect between both of it's SOCs that's just 2.5TB/s. That's less than half the bandwidth for way more than half of the performance of the 7900XTX.

The M2 Ultra LPDDR5 memory provides it with a max bandwidth of 800GB/s. That's 16% less external bandwidth than the 7900XTX. We know each M2 chiplet provides half the memory controllers so 400GB/s of that 2.5TB/s is just for scaling up the memory architecture. The remaining 2.1TB/s is for for everything else data exchange for everything else: 24 CPU cores, 76 GPU cores, caches, 32 neural engines, 2 Security Enclaves, a few encoders, four display engines, a few image signal processors, 32 PCIe Gen 4 lanes, USB4, and maybe some other stuff.

To put it another way, if you were to cut the 7900XTX GCD in half and got rid of half the MCDs, the peak bandwidth between the MCDs and the GCD would still be 6% greater than the peak bandwidth between both SOCs the M2 Ultra but it would perform way slower.

Originally posted by qarium View Post

a apple m2 ultra costs you 7000-8000€ if you only build a 3000€ AMD system it beats apple in any benchmark.

and this only because the apple m2 ultra GPU can not compete.

Now you're talking about the price that Apple chooses to charge for their hardware instead of the topic at hand. Apple has always charged more for it's hardware even when it was using lower-end AMD GPUs. That's not representative of the cost to produce their GPUs so it's not relevant to what we're discussing.

Originally posted by qarium View Post

thank you for your reminder i did already know that M2 ultra is 2 chips but i did honestly not remember or thought about this.

So you're saying you knew the M2 Ultra was two chips, but in a topic about scaling between chiplets, you "forgot" and said it was one larger chip.

Originally posted by qarium View Post

but on the other hand it has near zero relevance because M2 ultra lose agaist 4090 and also lose agaist a 7900XTX

The fact that it's two chiplets is the most relevant part. This is a discussion about GPUs being able to scale with chiplets. If you remember, the entire conversation started because I said that I thought Imagination's GPUs had a lot of potential to shake up the market because their design can scale with chiplets which allow for producing both large and smaller GPUs with very high yields and low cost.

Now you're like "Well the M2 Ultra doesn't matter because it's not as powerful as the 4090 and 7900XTX", but that wasn't the point because I'm not trying to talk about specific GPUs. The only reason the M2 Ultra came up is because it's an example of chiplet scaling with n TBR and because you're incapable of talking about about TBR and IMR in any detail at all.

My whole point is that tile-based GPUs lend themselves can more easily divide their work among chiplets than IMRs. So if a chiplet-based TBR GPU was created that's the same size as a monolithic IMR GPU (or even a chiplet-based IMR GPU with a very large compute die), the chiplet-based TBR GPU would be much cheaper to produce.

So lets say we're comparing a 600mm2 monolithic GPU versus one that's the same size but it composed of four chiplets. I picked 600mm because teh 4090's die is 608mm.

On a 300mm wafer with a defect density of 0.1/sq.cm, you'd be able to produce 84 chips per wafer but only 47 good dies. With chiplets, you can produce 378 chips per wafer with 326 good dies. With these good dies, you could create 81 (technically 81.5) GPUs per wafer. If we say the wafers are $10,000 each then each monolith GPU die would cost $212.77 to produce while each chiplet-based GPU would be cost only $122.69. That means the TBR GPU would be 42.5% cheaper to produce than the IMR. Because the TBR would also rely on external memory bandwidth less, it would could also get away with using GPU VRAM. If the price of the chip translates directly to the price of the graphics cards then the IMR would have to be 75% faster than the TBR to compete with it's price/performance ratio.

You can do the math yourself with a die yield calculator.

Originally posted by qarium View Post

so you can build a AMD based computer with a Ryzen 7950X3D and a 7900XTX and this is much cheaper and beats the highest apple product.

so please wake me up at the moment when apple present a tile based rendering chiplet design who has higher performance.

you act like apple m2 ultra gives us top notch performance but this is not the case such a system costs you 7000-8000€

and any 2000-3000€ AMD system beats this system easily in any benchmarks.

and to beat it in games you do not even need a rxzen 7950X3D a 7800X3D does the job for much cheaper.
a 7900XTX beats the apple m2 ultra easily in any benchmark.

We're not talking about PCs versus Macs. You're acting like I'm trying to convince you to switch to a Mac. I'm not, I'm just using one of there chips as an example of what I'm talking about. Can please drop the fanboy war arguments for a little bit? Stopping spinning out into these arguments when you don't feel comfortable in your ability to engage with the actual topic.

Also Apple TBR GPU IS high performance, it's just not the world's faster GPU. It is the worlds fastest consumer GPU that scales it's compute with chiplets though.

Originally posted by qarium View Post

you do not see how it divide work ? FSR already upscale from 3 sources one is the lower resolution one source is the Temporal component and one source is the Frame generation AI the chiplet design would plain and simple FSR4.0 with now 4. source

These aren't three sources. Those are three steps. The temporal component is the motion vector buffer which is created when rendering the low-resolution image. It can be computed at the same time as the low resolution image but would require each GPU to transform the triangles yet again and doesn't require anywhere near the same level of computation as does to render the rest of the buffers. The frame generation step can't be done until the low resolution image and the motion vectors are computed and, if there isn't data for the previous frame, then the last two steps can't be done at all. What you're describing are sequential steps, not ways to break the work up in parallel.

Originally posted by qarium View Post

you could just render different part of the screen on different gpus or chiplets and the FSR engine would glue it together to make it look like 1 result.

you could render different parts of the screen on different gpus or chiplets and make the FSR4 engine to glue it together.

I already explained this. Splitting the frame up is exactly how SLI and Crossfire have worked for years and it doesn't scale well. FSR4 isn't needed to "glue" anything together and that's not what it does anyway. FSR and DLSS are spatio-temporal upscaling and interpolation methods. They don't "stitch" anything together nor would they need to.

The issue with multi-GPU/multi-chiplet scaling with IMRs isn't that they need to stitch together the end result. That takes up very little bandwidth and requires almost no computation.

The issue, which I mentioned already, are that:

1. IMRs get sent pre-transformed triangles and spit out pixels. These triangles need to be transformed into screen-space before the GPU knows what part of the screen it goes to so the chiplets can't just be sent triangles for a specific portion of the frame. The solution to this in split-frame rendering has been for each to GPUs transform all the triangles but only produce pixels for the triangles that are in their segment of the screen. There isn't any scaling in the vertex shading in this method. Each GPU is repeating work that the other are doing. The only scaling happens in the fragment shading stage

2. You can't just have each GPU render an evenly sized slice of the image and have linear scaling because different sections of the screen take a different amount of time to render. The GPUs can't know a head of time how to evenly distribute the work because they haven't rendered the current frame yet. All they can do is use frame times from the last frame to guess how the work should be distributed. If there's a lot going on on-screen, then the information from last frame will frequently apply very poorly to the current frame and scaling with bad. This guess frequently gets worse as you add more GPUs/chiplets. You can get far better distribution if you use smaller slices and give more slices to each GPU. That's the idea behind Nvidia's Checkered Frame Rendering SLI mode and is similar to tile-based rendering.

3. All of the triangles in the frame could be transformed in parallel and spread evenly across the GPUs but in order for the depth buffer to properly discard un-needed work insure that pixels aren't drawn in front of pixels that they are supposed to be behind, overlapping triangles need to be drawn in order from front to back. That's sequential process, not parralel and requires communication between the chips per-triangle in order to convey that order. Triangles can be rendered in parallel if they know ahead ahead of time that the triangles won't overlap.

The entire reason why I said that tile-based renders can scale well to chiplets is because it doesn't have any of these limitations. As I said before, the way that a TBR works internally is pretty much the same as a bunch of separate mini GPUs working on a bunch of separate little frames in parallel. It wouldn't need to work any differently if it were split into chiplets.

Originally posted by qarium View Post

I admit that you know much about this topic and i do not know enough to even talk with you in a meaningfull way.
at last i can say i did read anything you write and i try to learn.

I can tell. That's why I don't understand why you're so insistent on arguing with me when you don't know what I'm saying. You could have just asked me questions instead of trying to argue something you don't know. Why suggest that DLSS and FSR could help IMRs scale between chiplets when you don't know about that either? You could have asked "Could FSR and DLSS help distribute work between GPU chiplets?"

Originally posted by qarium View Post

but i can also say that your reference hardware as the ultimative solution the apple m2 ultra dual-chiplet gpu can be outgunned by 50% cheaper systems just a 7800X3D+7900XTX this means to me they did not find the right solution yet.
wake me up if apple M3 ultra system beats any other system by factor 2....

Again, you're acting like the price that Apple charges for their hardware in some way related to their GPUs being tile-based. As I said before, Apple has been overcharging for their computers long before they started shipping laptops and desktops with Apple silicon. The actual chip is a fraction of the cost of the Apple's Macs.

Originally posted by qarium View Post

to my knowlege they fixed the problem that the different versions of different gpus did not fit together by simple disable any graphic effect who caused this problem...
if you say "That's never been the issue." this is true but only because they simple did disable any graphic effect who had this problem.
and the end-result was never the same. you have this problem if you want to use any graphic effect you normaly use on a single traditional gpu.

That's not how modern GPUs work. They don't just have a bunch of graphics effects that they toggle on and off. Modern GPUs are many core processors mean to process large data sets in parallel with some fixed function hardware. All graphics "effects' are done through shaders.

Originally posted by qarium View Post

i just want to say i am sure that future gpu chip design will make FSR/DLSS technology mandatory.
because the people use it anyway to get better performance so in the end this means you can make chipdesign who can not run without it to get full performance.

That's not going to happen. There's no interest in making those features mandatory, there's no real way to make them mandatory, and there's not reason to make them mandatory. It's very clear that you don't know how FSR, DLSS, or GPUs work.

**qarium** · 23 August 2023, 01:46 PM

Originally posted by Myownfriend View Post

That sounds extremely scary. I hope things start taking a turn for the better soon.
It's always worth the effect to make your posts easier to understand. If you're spending a lot of time writing a post already then it's better to make sure that others understand it so that follow up posts don't need to discuss misunderstandings and stuff.
You're doing that thing where you're playing with time again. You said you had the card for two days and you got rid of it because of an issue with it's driver not supporting your configuration. I'm assuming this was when the card was new and I don't think you reverted to Windows 98 to use the card. in other words, the state of it's drivers in the future was not part of your decision to return the card.
Also I looked and there appear to be drivers for Windows 2000 and XP, so I don't see how they didn't support newer operating systems. As Yboy360 pointed out, it also had Linux drivers, too.

this broken link(i fixed it) is from 2004 this means years to late. windows 2000 did come out in the year 2000
this means the windows 2000 driver was 4 years to late
windows XP did come out in 2001 means 4 years to late
and also this is fake driver because: "these drivers have not been WHQL certified"

thats exactly my point Kyro and kyro2 hardware owners where forced to stay on window 98 and windowsME for 3-4 years and even then the driver they get was not a real windows driver because of no WHQL...

Originally posted by Myownfriend View Post

But also, considering the discussion that we're having, the architecture of the card the most important part of it.

right outside of driver problems and memory problems it was a interesting architecture but still patent runs out and people still not use it.

Originally posted by Myownfriend View Post

You're misunderstanding, they're not mimicking the effect. In hardware they're making use data locality and caching to reduce bandwidth to some degree but it's not getting the effect of tiled-rendering. The same applies to methods of rendering that I mentioned. Early Z is taking an idea from PowerVR's rendering pipeline specifically but it has nothing to do with tile-based rendering and the fragment reads and writes are happening over the memory controller.

then its a mysteria that the reverence hardware for tiled based rendering does not outperform AMD/Nvidia...

the joke about apple m2 ultra is that the GPU is not faster its slower. even with gpu-chiplet design.

Originally posted by Myownfriend View Post

Also, Intel literally used/uses tile-based rendering. I said that before.

same problem here intel performs like shit.

Originally posted by Myownfriend View Post

I have a feeling you're skipping over large parts of my posts. Adreno and Mali do actually use tile-based rendering. They're not mimicking the effect.
Whether or not AMD and Nvidia use it doesn't speak to it's potential.

i do not intentionally skip parts i plain ans simple have not the time to go into every detail
ok Adreno and Mali use the magic bulled but Samsung who use RDNA1/2 in their SOCs outperform any Mali and Adreno gpu.
like samsung Samsung Exynos 2200 use RDNA2

Exynos mit RDNA-GPU lebt weiter: Samsung verlängert Vertrag mit AMD

https://winfuture.de/news,135548.html

Samsung hat eine Verlängerung seiner Kooperation mit dem Grafikspezialisten AMD bekannt gegeben. Damit ist der Weg frei, auch in Zukunft AMDs RNDA-Grafikeinheiten in den Samsung Exynos-Chips für Smartphones zu verwenden. Was genau Samsung mit den AMD-GPUs vorhat, ist allerdings offen.

so what is the point of tiled based rendering in mali and Adreno if RDNA2 is better anyway ?

Originally posted by Myownfriend View Post

I just want to point out that you said in your last post that they don't use chiplets. You said that, if I was right that tile-based renderers can be split into chiplets well then Apple would use chiplets. Now that you know that they are using chiplets, you're refusing to concede the point, and acting like you knew they used chiplets already.

right you are right i just forgott that M2 ultra is a chiplet but this example is irrelevant because they lose all benchmarks.

"and acting like you knew they used chiplets already"

no i do not fool you i really did know it but i just forgott it honestly… but it has zero relevance because this design does not give you high performance. and i am not interested in low-performance computing.
just get a 7900XTX and enjoy high performance at low price.

Originally posted by Myownfriend View Post

I already explained what issues that would have.
FSR3 and DLSS already can theoretically accept different rendering results from different chiplets/GPUs. I even explained how it could divide that work up but it wouldn't be doing anything knew. It would just use old methods of multi-GPU scaling that already don't work well. There's nothing about the technology that fixes the issues with creating IMR-based GPU with chiplets. The issue is that there isn't a very good way of dividing up the work that an IMR does.
You're talking around the issue and you're not explaining anything. How would FSR3 and DLSS help break IMRs into chiplets?

I honestly don't know. i try to find out about it.

Originally posted by Myownfriend View Post

Sure. That's an option but you still haven't provided any real fix for how IMRs would scale up to using chiplets efficiently.
How?
Why?
It would be impossible for any architecture with multiple chips to have zero inter-chiplet bandwidth unless they aren't communicating at all. That's just not how computers work.
FSR3 works on tile-based renderers, too.
Again, FSR and DLSS are just upscaling algorithms using spatial and temporal data. They reduce bandwidth rendering at a lower internal resolution and then re-using data from the previous frame to reconstruct a higher quality image. It's essentially what super resolution technologies have done for years but in a more advance way.
Rendering the initial low-resolution image is still the most expensive part of the process and that's not being done any differently from how a GPU would render an image at native resolution.

i have a basic unterstanding how upscalling works.

"How?
Why?"

I honestly don't know.

Originally posted by Myownfriend View Post

The M2 does not live in the same ecosystem as the RTX 4090 and 7900XTX. Any comparisons between the two would be with different CPUs using different operating systems and different APIs with different degrees of maturity in terms of support. In other words, they're hard to really compare. While I do believe that Apple have overhyped the performance of their chips, if you were to compare performance between lets say the M2 Ultra vs the 4090 in Blender, for example, we wouldn't just be comparing hardware, we'd also be comparing the Blender's Cuda backend with the much newer Metal backend.

right but who cares ? the last system i build as Blender workstation was with a AMD PRO W7900

M2 ultra is expensive and gives you slower system and you wonder why tiled based rendering rund out of patent time and people don't care

Originally posted by Myownfriend View Post

You're going off on a tangent again. You're getting back into your fanboy mindset. The conversation is about scaling across chips and you're trying to make it about which company's GPU benchmarks the absolute fastest. In 3D Mark, the M2 Ultra gets about 80% of the performance of the 7900XTX. The inverse of that is that the 7900XTX is 25% faster than M2 Ultra. Are you really going to act like that's enough of a difference to explain away the difference in chiplet bandwidth requirements between the two?

"The conversation is about scaling across chips"
this does only matter of the performance is as fast (or much cheaper) or faster.
apple is slower and more expensive no one with a brain buy this.

do you really think people are interested in scaling across chips if the result is slower and more expensive ?

"the M2 Ultra gets about 80% of the performance of the 7900XTX."

then you add that the 7900XTX is 1000€ and the M2 Ultra is 8000€ and as soon as you say so no one wants apple anymore.

"the 7900XTX is 25% faster than M2 Ultra."

you get 25% higher performance with much cheaper price.

"Are you really going to act like that's enough of a difference to explain away the difference in chiplet bandwidth requirements between the two?"

i say chiplet design only matters if the performance is high. if the performance is low no one cares.

Originally posted by Myownfriend View Post

I'll repeat: the 7900XTX has a peak bandwidth of 5.3TB/s between it's MCD and GCD. That's just between the compute portion and it's LLC + external memory. The M2 Ultra has an interconnect between both of it's SOCs that's just 2.5TB/s. That's less than half the bandwidth for way more than half of the performance of the 7900XTX.
The M2 Ultra LPDDR5 memory provides it with a max bandwidth of 800GB/s. That's 16% less external bandwidth than the 7900XTX. We know each M2 chiplet provides half the memory controllers so 400GB/s of that 2.5TB/s is just for scaling up the memory architecture. The remaining 2.1TB/s is for for everything else data exchange for everything else: 24 CPU cores, 76 GPU cores, caches, 32 neural engines, 2 Security Enclaves, a few encoders, four display engines, a few image signal processors, 32 PCIe Gen 4 lanes, USB4, and maybe some other stuff.
To put it another way, if you were to cut the 7900XTX GCD in half and got rid of half the MCDs, the peak bandwidth between the MCDs and the GCD would still be 6% greater than the peak bandwidth between both SOCs the M2 Ultra but it would perform way slower.

ok i get it tiled based renering is the future and apple is awesome you only have to pay quadruple price and accept 25% less performance but you get the good feeling of Apple advertisement.

Originally posted by Myownfriend View Post

Now you're talking about the price that Apple chooses to charge for their hardware instead of the topic at hand. Apple has always charged more for it's hardware even when it was using lower-end AMD GPUs. That's not representative of the cost to produce their GPUs so it's not relevant to what we're discussing.

you can't separate the topics if the price is high and the performance is low then result is people don't want it and only stupid people buy it.
of course it is relevant. but yes i get it tiled based rendering is the future even if the fact that everyone who used it is complete slow shit like intel who lose any benchmark or apple who is still slower than cheap hardware.

Originally posted by Myownfriend View Post

So you're saying you knew the M2 Ultra was two chips, but in a topic about scaling between chiplets, you "forgot" and said it was one larger chip.
The fact that it's two chiplets is the most relevant part. This is a discussion about GPUs being able to scale with chiplets. If you remember, the entire conversation started because I said that I thought Imagination's GPUs had a lot of potential to shake up the market because their design can scale with chiplets which allow for producing both large and smaller GPUs with very high yields and low cost.

you say its low cost but result for the consumer is super high price... yes yes of course.

"This is a discussion about GPUs being able to scale with chiplets"
who cares if this is the topic if the result is irrelevant ? some stupid people buy overprices apple stuff i get it,

like the andreno and mali example who lose the benchmarks agaist RDNA2 samsung chips.

AMD and Nvidia would do it if the patents did run out and the result is better.

Originally posted by Myownfriend View Post

Now you're like "Well the M2 Ultra doesn't matter because it's not as powerful as the 4090 and 7900XTX", but that wasn't the point because I'm not trying to talk about specific GPUs. The only reason the M2 Ultra came up is because it's an example of chiplet scaling with n TBR and because you're incapable of talking about about TBR and IMR in any detail at all.

ok fine i wait until tiled based rendering is faster and cheaper.

Originally posted by Myownfriend View Post

My whole point is that tile-based GPUs lend themselves can more easily divide their work among chiplets than IMRs. So if a chiplet-based TBR GPU was created that's the same size as a monolithic IMR GPU (or even a chiplet-based IMR GPU with a very large compute die), the chiplet-based TBR GPU would be much cheaper to produce.

nice i wait for that i wait for a cheaper and faster product... but right now this is not the case.

Originally posted by Myownfriend View Post

So lets say we're comparing a 600mm2 monolithic GPU versus one that's the same size but it composed of four chiplets. I picked 600mm because teh 4090's die is 608mm.
On a 300mm wafer with a defect density of 0.1/sq.cm, you'd be able to produce 84 chips per wafer but only 47 good dies. With chiplets, you can produce 378 chips per wafer with 326 good dies. With these good dies, you could create 81 (technically 81.5) GPUs per wafer. If we say the wafers are $10,000 each then each monolith GPU die would cost $212.77 to produce while each chiplet-based GPU would be cost only $122.69. That means the TBR GPU would be 42.5% cheaper to produce than the IMR. Because the TBR would also rely on external memory bandwidth less, it would could also get away with using GPU VRAM. If the price of the chip translates directly to the price of the graphics cards then the IMR would have to be 75% faster than the TBR to compete with it's price/performance ratio.
You can do the math yourself with a die yield calculator.

yes honestly this sounds really good. honestly. but can i buy it cheaper and faster ? no i can't so this is really only hypothetical.

if i can choose between mali or andreno or RDNA2 ARM SOC ? i would choose the RDNA2 one...
this means tiled based gpus lose real competition...

Originally posted by Myownfriend View Post

We're not talking about PCs versus Macs. You're acting like I'm trying to convince you to switch to a Mac. I'm not, I'm just using one of there chips as an example of what I'm talking about. Can please drop the fanboy war arguments for a little bit? Stopping spinning out into these arguments when you don't feel comfortable in your ability to engage with the actual topic.
Also Apple TBR GPU IS high performance, it's just not the world's faster GPU. It is the worlds fastest consumer GPU that scales it's compute with chiplets though.

it has nothing to do with fanboy war argument i just try to messure your arguments in real world
then i see high prices and lower performance and i really wonder whats going on.

Originally posted by Myownfriend View Post

These aren't three sources. Those are three steps. The temporal component is the motion vector buffer which is created when rendering the low-resolution image. It can be computed at the same time as the low resolution image but would require each GPU to transform the triangles yet again and doesn't require anywhere near the same level of computation as does to render the rest of the buffers. The frame generation step can't be done until the low resolution image and the motion vectors are computed and, if there isn't data for the previous frame, then the last two steps can't be done at all. What you're describing are sequential steps, not ways to break the work up in parallel.

ok i read what you write.

Originally posted by Myownfriend View Post

I already explained this. Splitting the frame up is exactly how SLI and Crossfire have worked for years and it doesn't scale well. FSR4 isn't needed to "glue" anything together and that's not what it does anyway. FSR and DLSS are spatio-temporal upscaling and interpolation methods. They don't "stitch" anything together nor would they need to.
The issue with multi-GPU/multi-chiplet scaling with IMRs isn't that they need to stitch together the end result. That takes up very little bandwidth and requires almost no computation.
The issue, which I mentioned already, are that:
1. IMRs get sent pre-transformed triangles and spit out pixels. These triangles need to be transformed into screen-space before the GPU knows what part of the screen it goes to so the chiplets can't just be sent triangles for a specific portion of the frame. The solution to this in split-frame rendering has been for each to GPUs transform all the triangles but only produce pixels for the triangles that are in their segment of the screen. There isn't any scaling in the vertex shading in this method. Each GPU is repeating work that the other are doing. The only scaling happens in the fragment shading stage
2. You can't just have each GPU render an evenly sized slice of the image and have linear scaling because different sections of the screen take a different amount of time to render. The GPUs can't know a head of time how to evenly distribute the work because they haven't rendered the current frame yet. All they can do is use frame times from the last frame to guess how the work should be distributed. If there's a lot going on on-screen, then the information from last frame will frequently apply very poorly to the current frame and scaling with bad. This guess frequently gets worse as you add more GPUs/chiplets. You can get far better distribution if you use smaller slices and give more slices to each GPU. That's the idea behind Nvidia's Checkered Frame Rendering SLI mode and is similar to tile-based rendering.
3. All of the triangles in the frame could be transformed in parallel and spread evenly across the GPUs but in order for the depth buffer to properly discard un-needed work insure that pixels aren't drawn in front of pixels that they are supposed to be behind, overlapping triangles need to be drawn in order from front to back. That's sequential process, not parralel and requires communication between the chips per-triangle in order to convey that order. Triangles can be rendered in parallel if they know ahead ahead of time that the triangles won't overlap.
The entire reason why I said that tile-based renders can scale well to chiplets is because it doesn't have any of these limitations. As I said before, the way that a TBR works internally is pretty much the same as a bunch of separate mini GPUs working on a bunch of separate little frames in parallel. It wouldn't need to work any differently if it were split into chiplets.

ok i get it thats the future but we have yet to wait for any usefull product from this future.

Originally posted by Myownfriend View Post

I can tell. That's why I don't understand why you're so insistent on arguing with me when you don't know what I'm saying. You could have just asked me questions instead of trying to argue something you don't know. Why suggest that DLSS and FSR could help IMRs scale between chiplets when you don't know about that either? You could have asked "Could FSR and DLSS help distribute work between GPU chiplets?"

right but makes it any different ? at least i try to figure out what you are talkking
https://en.wikipedia.org/wiki/Immedi...puter_graphics)
reading on wikipedia for example.

Originally posted by Myownfriend View Post

Again, you're acting like the price that Apple charges for their hardware in some way related to their GPUs being tile-based. As I said before, Apple has been overcharging for their computers long before they started shipping laptops and desktops with Apple silicon. The actual chip is a fraction of the cost of the Apple's Macs.

right. problem is apple is shit as hell... tiled based rendering is maybe not shit but as soon as apple use it it becomes shit.

Originally posted by Myownfriend View Post

That's not how modern GPUs work. They don't just have a bunch of graphics effects that they toggle on and off. Modern GPUs are many core processors mean to process large data sets in parallel with some fixed function hardware. All graphics "effects' are done through shaders.
That's not going to happen. There's no interest in making those features mandatory, there's no real way to make them mandatory, and there's not reason to make them mandatory. It's very clear that you don't know how FSR, DLSS, or GPUs work.

ok i try to read more about it.

**qarium** · 23 August 2023, 05:23 PM

Originally posted by Myownfriend View Post

IMR-based GPU

i did try to read the wikipedia page about IMR ... and it did not help.

can you suggest me websites or links or books about this topic ?

**sbivol** · 24 August 2023, 12:15 AM

The comment section is now two bots posting progressively longer replies with a progressively lower SNR.

**Myownfriend** · 24 August 2023, 02:28 AM

Originally posted by qarium View Post

then its a mysteria that the reverence hardware for tiled based rendering does not outperform AMD/Nvidia...

the joke about apple m2 ultra is that the GPU is not faster its slower. even with gpu-chiplet design.

same problem here intel performs like shit.

Not all IMRs and TBRs are the same speed. I could compare Apple's M2 Ultra to a lower-end IMR and say that IMRs but that would be disingenuous. You don't seem to know how to properly compare things. Even when you compared Intel and AMD's GPUs, you decided to compare the A770 to the 7900XTX and disregarded the fact that the A770 is priced more similarly to a 7600. In your mind, it was fair comparison to compare the A770 and the 7900XTX because they're each their companies respective flag ships but doesn't make much sense considering intel's GPUs weren't trying to compete on the high-end.

AMD is intending to bow out of the high-end market for RDNA4 so Nvidia's high-end cards will be much faster than AMD's highest-end RDNA4 cards. Does that mean that RDNA4 is awful? No, because peak performance isn't the only metric that people judge chips on. Price to performance and performance per watt are way more important metrics and those are the types of things that should be compared.

Hell, the RTX 4090 generally outperforms the 7900XTX but latter costs $600 less so there's still people who buy them.

Originally posted by qarium View Post

i do not intentionally skip parts i plain ans simple have not the time to go into every detail
ok Adreno and Mali use the magic bulled but Samsung who use RDNA1/2 in their SOCs outperform any Mali and Adreno gpu.
like samsung Samsung Exynos 2200 use RDNA2

Exynos mit RDNA-GPU lebt weiter: Samsung verlängert Vertrag mit AMD

https://winfuture.de/news,135548.html

Samsung hat eine Verlängerung seiner Kooperation mit dem Grafikspezialisten AMD bekannt gegeben. Damit ist der Weg frei, auch in Zukunft AMDs RNDA-Grafikeinheiten in den Samsung Exynos-Chips für Smartphones zu verwenden. Was genau Samsung mit den AMD-GPUs vorhat, ist allerdings offen.

so what is the point of tiled based rendering in mali and Adreno if RDNA2 is better anyway ?

The most recent Samsung phone that ships with the Xclipse 920 (RDNA2) GPU is the Galaxy S22 which also had a version with Qualcomm chip.

https://gfxbench.com/compare.jsp?ben...044584&os1=And roid&api1=gl&hwtype1=GPU&hwname1=Qualcomm+Adreno+% 28TM%29+730&did2=107047252&os2=Android&api2=gl&hwt ype2=GPU&hwname2=Samsung+Electronics+Co.%2C+Ltd.+A NGLE+%28Samsung+Xclipse+920%29+on+Vulkan+1.1.179

This benchmark shows Adreno 730 beating the Xclipse 920 by decent margins.

Here's how that Adreno compares to Apple's GPU in the iPhone 14 Pro Max.

https://gfxbench.com/compare.jsp?ben...044584&os1=And roid&api1=gl&hwtype1=GPU&hwname1=Qualcomm+Adreno+% 28TM%29+730&did2=109254608&os2=iOS&api2=metal&hwty pe2=iGPU&hwname2=Apple+A16+GPU

This review also shows Adreno beating the Xclipse 920 in 3DMark.

Samsung Galaxy S22 benchmarked: Snapdragon 8 Gen 1 vs Exynos 2200

https://www.androidauthority.com/snapdragon-8-gen-1-vs-exynos-2200-3122407/

Which Samsung Galaxy S22 chipset is the most powerful? Here's the Snapdragon 8 Gen 1 vs Exynos 2200 benchmarked.

Also the Galaxy S23 has come out since then and it's shipping with a Qualcomm chip world-wide.

Originally posted by qarium View Post

so what is the point of tiled based rendering in mali and Adreno if RDNA2 is better anyway ?

Why do you keep on assuming that, if a TBR is underperforms an IMR, that it must because it's a TBR? You already established that you really don't understand how an IMR or TBR works so why do you insist on drawing conclusions like this when you know you're not informed on these things?

In this case you weren't even correct that the RDNA2 GPU outperformed the Adreno.

Originally posted by qarium View Post

right you are right i just forgott that M2 ultra is a chiplet but this example is irrelevant because they lose all benchmarks.

"and acting like you knew they used chiplets already"

no i do not fool you i really did know it but i just forgott it honestly… but it has zero relevance because this design does not give you high performance. and i am not interested in low-performance computing.

Whether or not you have any interest in low-performance computing doesn't matter. The entire reason that we've talked so much about the M2 Ultra is because it's a tile-based GPU that scales it's compute across chiplets. I said that TBDRs should scale pretty well with chiplets because they try to keep the majority of their bandwidth on chip and minimize off-chip bandwidth. You claimed that that must not be true because Apple would have done that instead of making one large SOC with the M2 Ultra.

As you pointed out, you "forgot" that they did actually use two chiplets instead of one large chip. After I brought that up, you chose not to drop it. Instead you doubled down on TBR's needing a lot of bandwidth so I had to go into greater detail explaining how that wasn't true. Now you're saying that doesn't matter because you can only get those GPUs in expensive Macs and they're not as powerful as AMD's GPUs, etc.

I'm not trying to claim that Apple's Mac's have the best price to performance ratio. This whole conversation started because I said that I believe Imagination's GPUs have potential to stir up the market if they enter the desktop space again. I don't have any hope that Apple is going to make dedicated GPUs that work on PCs.

You're keep trying to change the argument and forget why you brought stuff up in the first place. If you had just remembered/known that Apple does use chiplets on the M2 Ultra you wouldn't have made the claim that TBRs don't save bandwidth, I wouldn't have had to correct you, you wouldn't have doubled down, and so on and so forth.

Did you forget what the discussion was about? You keep on trying to "win" this argument by making it putting a lot of weight on inconsequential things. For example, your first response brought up GPUs from over 20 years.

Originally posted by qarium View Post

just get a 7900XTX and enjoy high performance at low price.

They aren't low-performance.

Originally posted by qarium View Post

I honestly don't know. i try to find out about it.

i have a basic unterstanding how upscalling works.

"How?
Why?"

I honestly don't know.

If you don't know about these things then why bring them up like you do?

Originally posted by qarium View Post

right but who cares ? the last system i build as Blender workstation was with a AMD PRO W7900

You missed the point of why I what I was saying in the first place and you're acting like this discussion is about you. The reason I mentioned Blender and the differences in ecosystems isn't because I thought you would personally care about Blender rendering performance. I mentioned it because the differences in ecosystems create a lot of variables outside of the hardware that can effect benchmarks.[/QUOTE]

Originally posted by qarium View Post

M2 ultra is expensive and gives you slower system and you wonder why tiled based rendering rund out of patent time and people don't care.

"The conversation is about scaling across chips"
this does only matter of the performance is as fast (or much cheaper) or faster.
apple is slower and more expensive no one with a brain buy this.

do you really think people are interested in scaling across chips if the result is slower and more expensive ?

Again you keep suggesting that the reason it's slower or more expensive is because it's tile-based or chiplet-based. That's not the case. You can't buy an M2 Ultra, you can only buy a Mac with an M2 Ultra. You're not comparing the cost of the chip, you're comparing the cost of the computer it comes in and you're ignoring that Apple has always charged insane amounts for their Mac workstations. Even when they were using off-the-shelf Intel Xeons and AMD GPUs, they would charge 4 or 5 times more money then it would cost to by a PC with same parts. You're drawing conclusions based on lack of information and that's never something you should do.

Originally posted by qarium View Post

"the M2 Ultra gets about 80% of the performance of the 7900XTX."

then you add that the 7900XTX is 1000€ and the M2 Ultra is 8000€ and as soon as you say so no one wants apple anymore.

"the 7900XTX is 25% faster than M2 Ultra."

you get 25% higher performance with much cheaper price.

"Are you really going to act like that's enough of a difference to explain away the difference in chiplet bandwidth requirements between the two?"

i say chiplet design only matters if the performance is high. if the performance is low no one cares.

ok i get it tiled based renering is the future and apple is awesome you only have to pay quadruple price and accept 25% less performance but you get the good feeling of Apple advertisement.

Qarium, the name of the topic is "Imagination Tech Posts Updated PowerVR Linux DRM Driver". I'm not trying to sell people on Apple's hardware. How are you this bad at understanding the point? How is something low-performance just because it's not as fast as the 7900XTX? It literally outperforms the 7900XT which is just one card below it. By you own metric of high-performance, the 7900XTX wouldn't be high-performance because the 4090 outperforms it. But now that I mentioned that, you'll just say "Yea, but I'd had to use close-source drivers if I used the 4090" so that you can change the topic from performance to company stances on open source.

Originally posted by qarium View Post

you can't separate the topics if the price is high and the performance is low then result is people don't want it and only stupid people buy it.
of course it is relevant. but yes i get it tiled based rendering is the future even if the fact that everyone who used it is complete slow shit like intel who lose any benchmark or apple who is still slower than cheap hardware.

you say its low cost but result for the consumer is super high price... yes yes of course.

"This is a discussion about GPUs being able to scale with chiplets"
who cares if this is the topic if the result is irrelevant ? some stupid people buy overprices apple stuff i get it,

like the andreno and mali example who lose the benchmarks agaist RDNA2 samsung chips.

AMD and Nvidia would do it if the patents did run out and the result is better.

I'm not separating price and performance at all and it's weird that you're claiming I am considering you've compared the performance of a $250 GPU to a $1000 GPU. I also already showed you the math on the effects of chiplets on cost of chips yet you're ignoring that.

You're repeatedly claimed that chiplets and tile-based rendering lead to low performance and high costs. What you provided as proof has been:

That Apple charges a lot for their computers even though they've been doing that long before they used chiplets or TBDR GPUs.
The fact that the 7900XTX outperforms it even though it's also chiplet-based just not in the same way. It's also a huge reason why it costs $600 less the 4090.
Because Samsung used RDNA-based GPUs in their phones. You claimed it outperformed Adreno GPUs and the only proof you provided was an article saying that Samsung decided to continue their partnership with AMD. You provided no benchmarks because the actual benchmarks would have shown that not only were you completely wrong, but Apple's GPU in the iPhone 14 actually nearly doubles its performance while having better battery life.

Originally posted by qarium View Post

ok fine i wait until tiled based rendering is faster and cheaper.

nice i wait for that i wait for a cheaper and faster product... but right now this is not the case.

if i can choose between mali or andreno or RDNA2 ARM SOC ? i would choose the RDNA2 one...
this means tiled based gpus lose real competition...

it has nothing to do with fanboy war argument i just try to messure your arguments in real world
then i see high prices and lower performance and i really wonder whats going on.

And you're repeating that RDNA2's mobile GPUs are the best around despite that not being the case and claiming that you're not a fanboy. You couldn't even bring yourself to call the SOC the Samsung Exnos 2200 or the GPU the Xclipse 920. You just called the "RDNA2 ARM SOC". These whole argument is something that you're pretending you have any interest in. It's clear you didn't look it up and were just trying to find ways to put over AMD even if the facts aren't on your side.

Originally posted by qarium View Post

ok i get it thats the future but we have yet to wait for any usefull product from this future.

yes honestly this sounds really good. honestly. but can i buy it cheaper and faster ? no i can't so this is really only hypothetical.

Do I have to remind you that my original post said that I felt like Imagination's future GPUs have the potential to shake up the if they enter the desktop market again.
And yea, I was comparing a hypothetical chiplet-based TBR vs a hypothetical monolithic IMR with everything else being the same. The point was to compare how much chiplets could reduce the cost of a larger GPU.

And could a chiplet-based GPU be cheaper and faster? Yes. In that example, I even said that the monolithic IMR would have to be 75% faster in order to offer better price to performance than the chiplet-based TBR . That means that the chiplet-based TBR could still scale itself up quite a bit more while still remaining cheaper than the monolithic IMR. In that example, each of those TBR chiplets is $30. The price gap between the chips is $90 so you could add two more chiplets to get 50% more performance and still be cheaper to produce than the monolithic IMR. That example also doesn't take into account that using chiplets would allow for better binning of chiplets that could scale up higher in clock speed, too, which has the benefit of increasing tile bandwidth.

Originally posted by qarium View Post

right. problem is apple is shit as hell... tiled based rendering is maybe not shit but as soon as apple use it it becomes shit.

I don't like Apple either but that doesn't mean there hardware sucks and that's a bad attitude to have. Apple hardware only supported AMD or Intel GPUs for years.

Originally posted by qarium View Post

right but makes it any different ? at least i try to figure out what you are talkking
https://en.wikipedia.org/wiki/Immedi...puter_graphics)
reading on wikipedia for example.

I already explained this in detail multiple times.

Originally posted by qarium View Post

i did try to read the wikipedia page about IMR ... and it did not help.

can you suggest me websites or links or books about this topic ?

Absolutely!

To get started, I recommend watching this.

- YouTube

https://www.youtube.com/watch?v=cvcAjgMUPUA

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

That will help you understand what a vertex shader does and why a GPU doesn't know where a vertex is going to be in the frame before being transformed. If you want more detail about the process, you can read this.

LearnOpenGL - Coordinate Systems

https://learnopengl.com/Getting-started/Coordinate-Systems

Learn OpenGL . com provides good and clear modern 3.3+ OpenGL tutorials with clear examples. A great resource to learn modern OpenGL aimed at beginners.

Here's a description of the graphics pipeline.

Graphics pipeline - Wikipedia

https://en.wikipedia.org/wiki/Graphics_pipeline

These articles from Imagination's blog explain immediate Mode Rendering, Tile-based Deferred Rendering, and Tile-based Immediate Mode Rendering pretty well. I never mentioned Tile-based Immediate Mode Rendering because the distinction wasn't really relevant and I didn't want to over-complicate anything. Anyway, these articles provide some useful diagrams that show the pipelines for each. They're 8 years to 10 years old so they don't mention chiplets but they're still useful and they do have a brief bit on parralelism which is relevant to chiplets and multi-GPU.

Understanding PowerVR Series5XT: PowerVR, TBDR and architecture efficiency - Imagination

https://blog.imaginationtech.com/understanding-powervr-series5xt-powervr-tbdr-and-architecture-efficiency-part-4/

Tile Based Deferred Rendering (TBDR) is a rendering approach unique to PowerVR which aims to ‘delay’ all texturing and shading operations until their visibility is known.

A look at the PowerVR graphics architecture: Tile-based rendering - Imagination

https://blog.imaginationtech.com/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/

Rys looks at our PowerVR graphics architecture and describes how Tile-Based Rendering (TBR) works in practice.

A look at the PowerVR graphics architecture: Deferred rendering - Imagination

https://blog.imaginationtech.com/the-dr-in-tbdr-deferred-rendering-in-rogue/

The deferred rendering technique in PowerVR GPUs takes the information generated by the tiler to defer the rendering of subsequently generated pixels.

Arm also has a little bit of documentation on both but I think it's actually easier to understand than Imagination's blog posts

Documentation – Arm Developer

https://developer.arm.com/documentation/102662/0100/Immediate-Mode-GPUs?lang=en

Documentation – Arm Developer

https://developer.arm.com/documentation/102662/0100/Tile-based-GPUs?lang=en

Lastly, I just want you to see some timelapses of a frame being rendered so you get an idea of the order that a game might render stuff to the screen, how many different types of buffers are needed to create one pixel on screen, and how different the process can be. I really wish I could find one that shows how things are rendered per-triangle instead of per draw call but it's the best I could find.

- YouTube

https://www.youtube.com/watch?v=3GVDDvTjPww

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

- YouTube

https://www.youtube.com/watch?v=PiMyGHLIoXA

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

- YouTube

https://www.youtube.com/watch?v=u6Gsz6DsQBI

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

- YouTube

https://www.youtube.com/watch?v=AX1XfftG3U0

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

- YouTube

https://www.youtube.com/watch?v=QVag9DM2sRU

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

Remember that a tile-based renderer tries to keep as much of the bandwidth for this process in on-chip memory as possible and that memory is very low latency and high bandwidth. AMD's Infinity Cache, or really any kind of cache, tries to do the same but it's difficult to fully keep that bandwidth on-chip because their working set is the entire frame. When rendering a game with a 160-bit G-buffer at 3840x2160, the G-buffer is 158MB so it's too much to contain within the 96-128MB of Infinity Cache so a decent amount of data is still being evicted to and read back from VRAM . With a fatter G-buffer and higher resolutions, the issue becomes worse.

I don't know the current max-tile size on Imagination's GPU chiplets but I know each chiplet can render four tiles at a time. With a 64x64px tile size, a 256-bit g-buffer would require only 128KB per-tile so that's just a total of 512KB per chiplet and 2MB for four chiplets. It would need to keep 2040 tiles in external memory for a UHD g-buffer, but most and potential all of the actually rendering bandwidth will be kept on those 2MB of on-chip memory. I know their most recent chips have dynamic tile sizes but I don't have any details on how that works I assume they use larger tiles depending on the target g-buffer size. So if they have enough tile memory for 256-bit 64x64px tiles but the game only uses a 128-bit buffer then it might use 64x128px tiles or 128x256px tiles if it's just a depth pass for shadows.

Also remember that any time you're seeing an opaque object get drawn in front of something that was already drawn, that's called overdraw, and it means the GPU wasted bandwidth and processing time rendering the pixels behind it. Games try to render from front to back as much as best they can because an IMR requires that in order to try to discard any extra work as much as possible. Submission order isn't important to TBDR like Imagination's GPU because it sorts the geometry from front to back within a tile before creating the depth buffer and drawing the image. So in games where there's a lot of overdraw, a TBDR is going to do way less work than a IMR.

**qarium** · 24 August 2023, 06:07 PM

Originally posted by sbivol View Post

The comment section is now two bots posting progressively longer replies with a progressively lower SNR.

do you think i am a bot ?

**Myownfriend** · 25 August 2023, 02:01 PM

Originally posted by qarium View Post

do you think i am a bot ?

He was insulting us lol

Announcement

Imagination Tech Posts Updated PowerVR Linux DRM Driver

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment