News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Fastest Absolute Function

Started by Twister, August 23, 2010, 09:31:34 PM

Previous topic - Next topic

dioxin

Redskull,
CALL 1234567 is unconditional and goes to a fixed address. The CPU knows this fixed address so it can start fetching subsequent instruction with 100% certainty. Such braches are 100% predicted by the CPU.
I'm NOT saying the same about JMP ecx, you'd better ask Lingo why he thinks that one is predictable.

Paul.

dioxin

Redskull,
QuoteIf you have anything that says different, I'd like to see it.
See the Intel 64 and IA-32 Architectures Optimization Reference Manual
Order Number 248966-016
Section 2.1.2.1 Branch Prediction Unit

I'd copy the section here but I can't cut and paste from the document.

Paul.

redskull

3.4.1.3  Static Prediction
Branches that do not have a history in the BTB (see Section 3.4.1, "Branch Prediction Optimization") are predicted using a static prediction algorithm. Pentium 4, PentiumM, Intel Core Solo and Intel Core Duo processors have similar static predic-
tion algorithms that:
predict unconditional branches to be taken
• predict indirect branches to be NOT taken

The first time the CPU encounters JMP 12345, there is no BTB history, and it predicts (correctly) the branch be taken to the target.  When the jump is actually retired, the BTB is updated with the address (12345).  Between the time the same jump is executed again, more jumps fill up the BTB and bump out the original one, and the new one is unlucky enough to have the same set value.  The next time the first JMP comes around, the CPU sees what it thinks is a BTB entry for it, and uses the stored target adress from the other jump to start fetching.  Would this not cause a misprediction for a unconditional direct jump?

99.9% is not 100%

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

dioxin

Redskull,
Quoteand uses the stored target adress from the other jump to start fetching.
Why would it do that? The other jump is another jump in another place in code. If there is history of this jump then that history would be used. If there is no history of this jump then the rules are followed from the beginning again.


Quote from the above reference:
QuoteThe Branch Prediction Unit makes the following types of predictions:
..
Direct calls and jumps. Targets are read as a target array without regarding the taken or not taken prediction.

Indirect calls and jumps. These may either be predicted as having a monotonic target or as having targets that vary in accordance with recent program behavior.
Paul.

redskull

Quote from: dioxin on August 24, 2010, 06:26:48 PMWhy would it do that? The other jump is another jump in another place in code. If there is history of this jump then that history would be used. If there is no history of this jump then the rules are followed from the beginning again.

Because it doesn't use the entire 32 bits of the branch instruction address to identify the BTB entry.  Other jumps at certain set multiples will have identical entries.

r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

jj2007

I wonder whether the "penalty" is worth arguing. Here is a testbed with three different calling variants: C style, a hybrid vararg (i.e. no add esp, nnn after the call) using a jmp dword ptr [], and two procs with fixed number of args:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
6       cycles for CcallP  3
10      cycles for CcallP  6
6       cycles for VarArgP 3
10      cycles for VarArgP 6
5       cycles for FixArgP 3
8       cycles for FixArgP 6

6       cycles for CcallP  3
10      cycles for CcallP  6
6       cycles for VarArgP 3
10      cycles for VarArgP 6
5       cycles for FixArgP 3
8       cycles for FixArgP 6


clive

Quote from: redskull
Because it doesn't use the entire 32 bits of the branch instruction address to identify the BTB entry.  Other jumps at certain set multiples will have identical entries.

Quote from: redskull
The first time the CPU encounters JMP 12345, there is no BTB history, and it predicts (correctly) the branch be taken to the target.  When the jump is actually retired, the BTB is updated with the address (12345).

Isn't it tracking the SOURCE instruction location (with some granularity), not the TARGET? Knowing the target address is fun and all, but isn't that irrelevant to the taken/not taken metrics. Hopefully the code at the TARGET is cached, but either way it's the filling of the pipeline with the correct execution path that is saving you the 20-30 cycle stall, and the speculative prefetch down both paths that gets the cache lines populated.

Quote from: jj2007I wonder whether the "penalty" is worth arguing.

Depends, I've seen the JMP ECX construct expose the pipeline stall on an Atom, something along the lines of the return address being at, or within a couple of instructions, to another branch. If you get repetitively nailed by this it's going to be a bigger problem.
It could be a random act of randomness. Those happen a lot as well.

redskull

Quote from: clive on August 24, 2010, 07:06:06 PM
Isn't it tracking the SOURCE instruction location (with some granularity), not the TARGET?

It is my understanding it tracks both.  It does the source to track the entries itself (and to predict sets larger than 1 jump), but also the target for helping out with indirect jumps.  That way, it can assign an unique BTB entry for each target as it's encountered, not just source, and it can hazard a guess as towhere it will go.  I have no firm evidence to back that up, however.

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

Rockoon

Quote from: lingo on August 24, 2010, 05:17:33 PM
"Would you still maintain that it does not matter because it happens a 'long time after my last JMP?'"

After me, a flood of chaos.... :lol

You seem to have a real problem answering my Yes or No questions. Do they make you uncomfortable?
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

clive

Quote from: redskullIt is my understanding it tracks both.

It strikes me that doing that for run of the mill Jxx branches would add a hideous amount of silicon for little appreciable benefit, it always knows where the drop through case and target are implicitly. I'm not even sure it would help for jump table like constructs (switch/case), where it's always going to branch somewhere but with significantly more than 2 possible outcomes.
It could be a random act of randomness. Those happen a lot as well.

clive

Quote from: Rockoon
Quote from: lingo
"Would you still maintain that it does not matter because it happens a 'long time after my last JMP?'"

After me, a flood of chaos.... :lol

You seem to have a real problem answering my Yes or No questions. Do they make you uncomfortable?

Actually I thought lingo's answer was actually funnier, and more accurate, than the Yes/No response you were looking for. Something along the lines of Other/Don't Know/Don't Care which all the stupid political polls or customer survey forms have. A bit like answering the race question on the census with Nascar or Formula 1. Do Hispanics like Nascar? And how should that have any impact on the apportionment of federal tax dollars?
It could be a random act of randomness. Those happen a lot as well.

redskull

Quote from: clive on August 24, 2010, 09:52:42 PM
It strikes me that doing that for run of the mill Jxx branches would add a hideous amount of silicon for little appreciable benefit, it always knows where the drop through case and target are implicitly.

True, but the tide these days is turning to abstract base classes and virtual functions.  If compilers are turning every call into an indirect one, than all the other branch prediction silicon is for naught.  Of course, it's all conjecture based on my limited knowledge of undocumented features, so who knows what's really going on.

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

MichaelW

Quote from: NightWare on August 24, 2010, 01:54:12 PM
MichaelW, with a loop you can't see the mispredictions, they're absorbed by the loop... the cache read it once, or two (here the mispredictions could exists), and once it's in the cache you can't see them again, it's divided by the numbre of loop, it disappear far after the dot...

That is why I used a loop. I reasoned that if there were mispredictions for the jmp ecx code in the repeated calls, I could eliminate some of them by doing the calls from a loop where the jump destination was the same for every call. For small repeat and loop counts it worked the way I expected. But as I increased the counts I expected the loop cycles for the jmp ecx code to eventually drop below the loop cycles for the ret code, and for some reason that I don't understand, this did not happen.
eschew obfuscation

daydreamer

Quote from: Rockoon on August 24, 2010, 05:01:47 PM
Quote from: daydreamer on August 24, 2010, 04:23:22 PM
what about a SSE parallel ABS function?

Why cant you just clear the high bit of the floats with an ANDPS on the value 7FFFFFFFh?

Am I missing something to do with IEEE encodings of special values? (NaN, etc..) ?
I test that and see if that works, just not sure about IEEE encodings is that simple as just clear highbit
it sucks to see this thread sidetrack into branch prediction debate

jj2007

Quote from: daydreamer on August 25, 2010, 10:26:29 AM
I test that and see if that works, just not sure about IEEE encodings is that simple as just clear highbit
it sucks to see this thread sidetrack into branch prediction debate

If the float is a valid number, the clearing should work. But I thought the OP wanted dwords...?

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
14      cycles for AbsLingo, pos
9       cycles for AbsJJp proc, pos
1       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos

9       cycles for AbsLingo, neg
9       cycles for AbsJJp proc, neg
3       cycles for mov arg, AbsJJ(arg), neg
3       cycles for SetAbsJJ arg, neg

11      cycles for AbsLingo, pos
8       cycles for AbsJJp proc, pos
1       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos

9       cycles for AbsLingo, neg
9       cycles for AbsJJp proc, neg
1       cycles for mov arg, AbsJJ(arg), neg
0       cycles for SetAbsJJ arg, neg


Lingo's code is a bit bloated but for once it does not throw exceptions :bg

AbsJJ MACRO arg  ; use as mov MyNewDword, Abs(OldDword)
  mov eax, arg
  test eax, eax
  .if Sign?
neg eax
  .endif
  EXITM <eax>
ENDM

SetAbsJJ MACRO arg  ; use as SetAbsJJ MyDword
  ifdifi <eax>, <arg>
mov eax, arg
  endif
  test eax, eax
  .if Sign?
neg arg
  .endif
ENDM