Need Byte Calculation Code (No CRC)

Started by hrxxx, February 20, 2010, 12:58:26 AM

Previous topic - Next topic

hrxxx

I'm PB Coder, some day i write function
CRC calculation, by adopting Export Function from ntdll.dll (RtlComputeCrc32)
The calculation is really fast,
but i need other byte (data) calculation code,
any one can help me?
I need calculation faster than CRC, written ini PowerBasci with ASM inline. :bg :bg

dedndave

welcome to the forum, hrxxx   :U
sounds to me like you have been bitten by the ASM bug   :bg
pretty soon, you will forget all about PB

hrxxx

Quote from: dedndave on February 20, 2010, 01:11:57 AM
welcome to the forum, hrxxx   :U
sounds to me like you have been bitten by the ASM bug   :bg
pretty soon, you will forget all about PB
Yeah  :bg
I choose PB beacause i can fix basic with ASM, really nice, fast anda power  :cheekygreen: :cheekygreen:

MichaelW

You could try an inline ASM implementation of Adler-32. A good one to start with would be the version included with Zlib.
eschew obfuscation

hrxxx

Quote from: MichaelW on February 20, 2010, 05:28:07 AM
You could try an inline ASM implementation of Adler-32. A good one to start with would be the version included with Zlib.
Yeah, i must download zlib 1st. I hope Adler-32 faster that CRC........  :U

MichaelW

Now that I test, on a P3, the Zlib crc-32 (4-table) version is 2 cycles faster than the adler-32 version (157 cycles versus 159 cycles, for the string "my other brother darryl"). I started out trying to compare the adler-32 cycle count to that for RtlComputeCrc32, but the function is not included for Windows 2000 and my XP system is down.
eschew obfuscation

hrxxx

Quote from: MichaelW on February 20, 2010, 06:38:15 AM
Now that I test, on a P3, the Zlib crc-32 (4-table) version is 2 cycles faster than the adler-32 version (157 cycles versus 159 cycles, for the string "my other brother darryl"). I started out trying to compare the adler-32 cycle count to that for RtlComputeCrc32, but the function is not included for Windows 2000 and my XP system is down.

How to perform test?
u call function from DLL, or u create ur own code in your lang.

This code is adopted from ntdll.dll XP, and write in PowerBasic with inline ASM


FUNCTION SysRtlComputeCrc32(BYVAL dwInitialize AS LONG, BYVAL pBufferData AS DWORD, BYVAL pBufferDataLen AS DWORD) AS LONG
'---adopsi fungsi RtlComputeCrc32 punya xp_ntdll.dll :) *---buanter banget,ngebutzabiz,wuzzwuzzwuzz...
!mov    eax, dwInitialize
!xor    ecx, ecx ;---reset:0.
!cmp    pBufferDataLen, ecx
!not    eax
!jbe    local_7C961521
local_7C961502:
!mov    edx, pBufferData
!movzx  edx, Byte[ecx+edx]
!xor    edx, eax
!and    edx, &H000000FF
!shr    eax, &H08
!xor    eax, ASM_CRC32_TABLE_CODE[&H04*edx]
!inc    ecx
!cmp    ecx, pBufferDataLen
!jb     local_7C961502
local_7C961521:
!not    eax
!mov    function, eax
EXIT FUNCTION
ASM_CRC32_TABLE_CODE:
!dd     &H00000000 ;---ini yang pertama (crc32_table[0]).
!dd     &H77073096, &HEE0E612C, &H990951BA, &H076DC419
!dd     &H706AF48F, &HE963A535, &H9E6495A3, &H0EDB8832
!dd     &H79DCB8A4, &HE0D5E91E, &H97D2D988, &H09B64C2B
!dd     &H7EB17CBD, &HE7B82D07, &H90BF1D91, &H1DB71064
!dd     &H6AB020F2, &HF3B97148, &H84BE41DE, &H1ADAD47D
!dd     &H6DDDE4EB, &HF4D4B551, &H83D385C7, &H136C9856
!dd     &H646BA8C0, &HFD62F97A, &H8A65C9EC, &H14015C4F
!dd     &H63066CD9, &HFA0F3D63, &H8D080DF5, &H3B6E20C8
!dd     &H4C69105E, &HD56041E4, &HA2677172, &H3C03E4D1
!dd     &H4B04D447, &HD20D85FD, &HA50AB56B, &H35B5A8FA
!dd     &H42B2986C, &HDBBBC9D6, &HACBCF940, &H32D86CE3
!dd     &H45DF5C75, &HDCD60DCF, &HABD13D59, &H26D930AC
!dd     &H51DE003A, &HC8D75180, &HBFD06116, &H21B4F4B5
!dd     &H56B3C423, &HCFBA9599, &HB8BDA50F, &H2802B89E
!dd     &H5F058808, &HC60CD9B2, &HB10BE924, &H2F6F7C87
!dd     &H58684C11, &HC1611DAB, &HB6662D3D, &H76DC4190
!dd     &H01DB7106, &H98D220BC, &HEFD5102A, &H71B18589
!dd     &H06B6B51F, &H9FBFE4A5, &HE8B8D433, &H7807C9A2
!dd     &H0F00F934, &H9609A88E, &HE10E9818, &H7F6A0DBB
!dd     &H086D3D2D, &H91646C97, &HE6635C01, &H6B6B51F4
!dd     &H1C6C6162, &H856530D8, &HF262004E, &H6C0695ED
!dd     &H1B01A57B, &H8208F4C1, &HF50FC457, &H65B0D9C6
!dd     &H12B7E950, &H8BBEB8EA, &HFCB9887C, &H62DD1DDF
!dd     &H15DA2D49, &H8CD37CF3, &HFBD44C65, &H4DB26158
!dd     &H3AB551CE, &HA3BC0074, &HD4BB30E2, &H4ADFA541
!dd     &H3DD895D7, &HA4D1C46D, &HD3D6F4FB, &H4369E96A
!dd     &H346ED9FC, &HAD678846, &HDA60B8D0, &H44042D73
!dd     &H33031DE5, &HAA0A4C5F, &HDD0D7CC9, &H5005713C
!dd     &H270241AA, &HBE0B1010, &HC90C2086, &H5768B525
!dd     &H206F85B3, &HB966D409, &HCE61E49F, &H5EDEF90E
!dd     &H29D9C998, &HB0D09822, &HC7D7A8B4, &H59B33D17
!dd     &H2EB40D81, &HB7BD5C3B, &HC0BA6CAD, &HEDB88320
!dd     &H9ABFB3B6, &H03B6E20C, &H74B1D29A, &HEAD54739
!dd     &H9DD277AF, &H04DB2615, &H73DC1683, &HE3630B12
!dd     &H94643B84, &H0D6D6A3E, &H7A6A5AA8, &HE40ECF0B
!dd     &H9309FF9D, &H0A00AE27, &H7D079EB1, &HF00F9344
!dd     &H8708A3D2, &H1E01F268, &H6906C2FE, &HF762575D
!dd     &H806567CB, &H196C3671, &H6E6B06E7, &HFED41B76
!dd     &H89D32BE0, &H10DA7A5A, &H67DD4ACC, &HF9B9DF6F
!dd     &H8EBEEFF9, &H17B7BE43, &H60B08ED5, &HD6D6A3E8
!dd     &HA1D1937E, &H38D8C2C4, &H4FDFF252, &HD1BB67F1
!dd     &HA6BC5767, &H3FB506DD, &H48B2364B, &HD80D2BDA
!dd     &HAF0A1B4C, &H36034AF6, &H41047A60, &HDF60EFC3
!dd     &HA867DF55, &H316E8EEF, &H4669BE79, &HCB61B38C
!dd     &HBC66831A, &H256FD2A0, &H5268E236, &HCC0C7795
!dd     &HBB0B4703, &H220216B9, &H5505262F, &HC5BA3BBE
!dd     &HB2BD0B28, &H2BB45A92, &H5CB36A04, &HC2D7FFA7
!dd     &HB5D0CF31, &H2CD99E8B, &H5BDEAE1D, &H9B64C2B0
!dd     &HEC63F226, &H756AA39C, &H026D930A, &H9C0906A9
!dd     &HEB0E363F, &H72076785, &H05005713, &H95BF4A82
!dd     &HE2B87A14, &H7BB12BAE, &H0CB61B38, &H92D28E9B
!dd     &HE5D5BE0D, &H7CDCEFB7, &H0BDBDF21, &H86D3D2D4
!dd     &HF1D4E242, &H68DDB3F8, &H1FDA836E, &H81BE16CD
!dd     &HF6B9265B, &H6FB077E1, &H18B74777, &H88085AE6
!dd     &HFF0F6A70, &H66063BCA, &H11010B5C, &H8F659EFF
!dd     &HF862AE69, &H616BFFD3, &H166CCF45, &HA00AE278
!dd     &HD70DD2EE, &H4E048354, &H3903B3C2, &HA7672661
!dd     &HD06016F7, &H4969474D, &H3E6E77DB, &HAED16A4A
!dd     &HD9D65ADC, &H40DF0B66, &H37D83BF0, &HA9BCAE53
!dd     &HDEBB9EC5, &H47B2CF7F, &H30B5FFE9, &HBDBDF21C
!dd     &HCABAC28A, &H53B39330, &H24B4A3A6, &HBAD03605
!dd     &HCDD70693, &H54DE5729, &H23D967BF, &HB3667A2E
!dd     &HC4614AB8, &H5D681B02, &H2A6F2B94, &HB40BBE37
!dd     &HC30C8EA1, &H5A05DF1B, &H2D02EF8D, &HFFFFFFFF

' adopted by Pamzlogiz
END FUNCTION


Try it, it is very fast...

and compare with adler calculation write in ASM
which the faster one ? im sure is affected by efficienct code that we write

Normally adler really faster than CRC32, but i'm still adopting how zlib do it (calculate adler). May be u can help me? :green :green

dedndave


MichaelW

#8
QuoteNormally adler really faster than CRC32

It turns out that the problem was the size of my test string. With a 324-byte string crc32 runs in 973 cycles and adler32 in 550 cycles. This is for compiler-optimized versions of the Zlib functions, using the Microsoft Visual C++ Toolkit 2003 compiler with /O2 /G6. For comparison, a 4-table procedure done in MASM ran in 887 cycles, a single-table procedure in 1895 cycles, a bit-by-bit procedure in 33900 cycles, and a MASM version of your inline assembly function in 2580 cycles. And in case you're thinking that my implementation must be wrong, it returned the correct CRC value. All tests were run on a P3. No assembly version of adler32 yet.

Edit:

I forgot to align the table for your inline assembly function, and after I did so the cycle count dropped to 2205.
eschew obfuscation

clive

I'm not a PowerBasic guy, so sorry if the syntax isn't perfect. Also not sure if ebx needs preserving, so stacked it. Here is a slightly more efficient CRC32 implementation which folds BYTEs into DWORDs. On a Pentium P4 (Prescott) using RDTSC to time cycles, we have about 262 cycles for the original code, and 157 cycles for this one, using the "my other brother darryl" string. I'd have to blow the dust off the P3 (Katmai) as it's got to be 11 year since I was hacking around with it.

Basically it will perform as many 32-bit operations as possible and then finish with 0 to 3 8-bit operations.

-Clive

!mov    eax, dwInitialize
!mov    ecx, pBufferDataLen
!jecxz  local_500
!not    eax
!push   ebx
!mov    edx, pBufferData
!shr ecx, 2
!jz local_200
local_100:
!xor eax, Dword[edx]
!movzx ebx, al
!shr    eax, &H08
!xor    eax, ASM_CRC32_TABLE_CODE[&H04*ebx]
!movzx ebx, al
!shr    eax, &H08
!xor    eax, ASM_CRC32_TABLE_CODE[&H04*ebx]
!add edx, 4
!movzx ebx, al
!shr    eax, &H08
!xor    eax, ASM_CRC32_TABLE_CODE[&H04*ebx]
!movzx ebx, al
!shr    eax, &H08
!xor    eax, ASM_CRC32_TABLE_CODE[&H04*ebx]
!dec ecx
!jnz    local_100
local_200:
!mov    ecx, pBufferDataLen
!and ecx,3
!jz local_400
local_300:
!movzx  ebx, Byte[edx]
!xor    eax, ebx
!movzx ebx, al
!shr    eax, &H08
!inc edx
!xor    eax, ASM_CRC32_TABLE_CODE[&H04*ebx]
!dec ecx
!jnz    local_300
local_400:
!pop ebx
!not    eax
local_500:
!mov    function, eax
It could be a random act of randomness. Those happen a lot as well.

dedndave

nice code, Clive - welcome to the forum   :U

clive

Quote from: dedndave on February 22, 2010, 01:34:17 AM
nice code, Clive - welcome to the forum

Thanks, known hutch-- forever, kind of been off doing stuff with ARM and MIPS code.

Not sure this is even optimum, seem to think the shifts will kill you on a P4. If a 16-bit CRC was workable, an 8-bit xchg rather than a shift might favour more platforms. Personally I would recommend using CRCs for any integrity application. The size, shift direction and the polynomial are debatable. I've built hardware CRC engines that can clock entire words at a time, the logic cascades rather cleverly.

The SSE 4.2 (Nehalem) has a CRC32 instruction, but the polynomial might not suit everyone, its the one used by iSCSI (CASTAGNOLI), not the one used by PKZIP. I've seen people on Intel's site complain about not being able to set the polynomial, but their implementation is probably a single set of 32 flip flops, and a bunch of random logic that can operate very fast, not a ROM or table driven construct. For simple integrity purposes the polynomial probably isn't too important in the scale of things*, though it won't help if your software has to use a specific one. * Most common ones have been picked for specific performance characteristics, and may have known limitations/expectations/sensitivities.

I don't have a SSE 4.2 box to hand, but the opcodes should look like this.
00401B7C F20F38F00B             crc32   ecx,byte ptr [ebx]
00401B81 F20F38F0CA             crc32   ecx,dl
00401B86 F20F38F10B             crc32   ecx,dword ptr [ebx]
00401B8B F20F38F1CA             crc32   ecx,edx
00401B90 66F20F38F10B           crc32   ecx,word ptr [ebx]
00401B96 66F20F38F1CA           crc32   ecx,dx


-Clive
It could be a random act of randomness. Those happen a lot as well.

hrxxx