News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

WSTR SIZEOF quirk

Started by jj2007, November 09, 2008, 09:43:27 PM

Previous topic - Next topic

jj2007

Since I didn't get an answer in the lab, I try my luck here:
According to Masm, the length of...
.data
WSTR ws, "my other brother darryl my other brother darryl my other brother darryl my other brother darryl xxxx"
...is:
  LENGTHOF:     40
  SIZEOF:       80
However, the true size of the string is 100 characters, or 200 bytes. I have checked this even in Olly, with...
.data
MyHi1 db "Ciao - ", 0
WSTR ws, "my other brother darryl my other brother darryl my other brother darryl my other brother darryl xxxx"
MyHi2 db "greetings from jj2007", 13, 10, 0

...and the three strings sit in memory one after the other. Brother Darryl is 100 chars long, so how is it possible that LENGTHOF reports 40 instead of 100?

Jimg

I would guess it has something to do with the WSTR macro breaking the string at 40 characters.
      % FORC arg, <nustr>
          if cnt lt 1
            addstr1 CATSTR addstr1,<">,<arg>,<">
          elseif cnt lt 40
            addstr1 CATSTR addstr1,<,">,<arg>,<">
           
          elseif cnt lt 41


= 00000028      2           cnt = cnt + 1
     2           if cnt lt 1
     2             addstr1 CATSTR addstr1,<">,< >,<">
     2           elseif cnt lt 40
     2             addstr1 CATSTR addstr1,<,">,< >,<">
     2            
     2           elseif cnt lt 41
= " "      2             addstr2 CATSTR addstr2,<">,< >,<">
     2           elseif cnt lt 80
     2             addstr2 CATSTR addstr2,<,">,< >,<">
     2
     2           elseif cnt lt 81
     2             addstr3 CATSTR addstr3,<">,< >,<">
     2           elseif cnt lt 120
     2             addstr3 CATSTR addstr3,<,">,< >,<">
     2           endif
= 00000029      2           cnt = cnt + 1
     2           if cnt lt 1
     2             addstr1 CATSTR addstr1,<">,<d>,<">
     2           elseif cnt lt 40
     2             addstr1 CATSTR addstr1,<,">,<d>,<">
     2            
     2           elseif cnt lt 41
     2             addstr2 CATSTR addstr2,<">,<d>,<">
     2           elseif cnt lt 80
= " ","d"      2             addstr2 CATSTR addstr2,<,">,<d>,<">

jj2007

Quote from: Jimg on November 09, 2008, 10:24:10 PM
I would guess it has something to do with the WSTR macro breaking the string at 40 characters.

Bingo, thanksalot!
Here is a version that returns a correct SIZEOF and LENGTHOF:

wstr MACRO var:REQ, args:VARARG
LOCAL argByte, argText, ct
  ct=0
  FOR var, <args>
if ct
  argText CATSTR argText,<, >, <var>
else
  argText equ <var>
endif
ct=ct+1
  ENDM
.data
  argByte db argText
.data?
align 2
  var dw SIZEOF argByte dup(?)
.code
  invoke MultiByteToWideChar, CP_ACP, MB_PRECOMPOSED,
  offset argByte, -1, offset var, LENGTHOF var
ENDM


The tradeoff: It must be placed in the code section...
.code
start:
wstr MyText, "This is a Unicode text,", 13, 10, "simple but effective", 0
wstr MyTitle, "Hello", 0
invoke MessageBoxW, 0, addr MyText, addr MyTitle, MB_OK

The wide string is placed in the uninitialised section, so that looks like a gain in size; however, the call to MultiByteToWideChar costs 26 bytes, therefore the macro bloats the code for strings below 26 bytes.

GregL

#3
It looks to me like WSTR is working fine, wprintf prints the string OK. lstrlenW reports the correct number of characters. If you look at the string in memory it looks correct. The problem seems to be LENGTHOF and SIZEOF don't work correctly with Unicode strings.


.686
.MODEL FLAT,STDCALL
OPTION CASEMAP:NONE

INCLUDE \masm32\include\windows.inc

INCLUDE \masm32\include\kernel32.inc
INCLUDE \masm32\include\user32.inc
INCLUDE \masm32\include\msvcrt.inc
INCLUDE \masm32\include\masm32.inc

INCLUDE \masm32\macros\macros.asm
INCLUDE \masm32\macros\ucmacros.asm

INCLUDELIB \masm32\lib\kernel32.lib
INCLUDELIB \masm32\lib\user32.lib
INCLUDELIB \masm32\lib\msvcrt.lib
INCLUDELIB \masm32\lib\masm32.lib

.DATA

    WSTR ws, "my other brother darryl my other brother darryl my other brother darryl my other brother darryl xxxx"
    wslen EQU $-ws
    szWSCrLf WORD 13,10,0

.CODE

  start:

    INVOKE crt_wprintf, ADDR ws
    INVOKE crt_wprintf, ADDR szWSCrLf

    print "wslen =    "
    mov eax, wslen
    print udword$(eax)," bytes",13,10

    print "lstrlenW = "
    lea eax, ws
    INVOKE lstrlenW, eax
    print udword$(eax)," WCHARs (not including the terminating null character)",13,10

    print "LENGTHOF = "
    mov eax, LENGTHOF ws
    print udword$(eax),13,10

    print "SIZEOF =   "
    mov eax, SIZEOF ws
    print udword$(eax),13,10

    print "SIZE =     "
    mov eax, SIZE ws
    print udword$(eax),13,10

    inkey chr$(13,10,"Press any key to exit ... ")
    print chr$(13,10)

    INVOKE ExitProcess, 0

END start






jj2007

Quote from: Greg on November 10, 2008, 12:05:43 AM
It looks to me like WSTR is working fine, wprintf prints the string OK. lstrlenW reports the correct number of characters. If you look at the string in memory it looks correct.

Yes, I can confirm that.

Quote
The problem seems to be LENGTHOF and SIZEOF don't work correctly with Unicode strings.

They work correctly. It is the WSTR macro, as Jimg rightly observed.

Here is a macro that works correctly.


Comment ~
Usage:
  wstr MyText, "This is a Unicode text,", 13, 10, "simple but effective", 13, 10, 0
  wstr MyTitle, "Hello", 0
  invoke MessageBoxW, 0, addr MyText, addr MyTitle, MB_OK
~
wstr MACRO varWide:REQ, args:VARARG
LOCAL varByte
.data
  varByte db args
.data?
align 2
  varWide dw SIZEOF varByte dup(?)
.code
  invoke MultiByteToWideChar, CP_ACP, MB_PRECOMPOSED,
  offset varByte, -1, offset varWide, LENGTHOF varWide
ENDM

hutch--

Both LENGTHOF and SIZEOF return BYTE values where UNICODE is always WORD values so they cannot work on unicode strings.

he macro was written around limitations imposed by the masm macro engine and the string length limits it imposes. If you want to do unicode the recommended way, use resource strings, thats how Microsoft designed the system.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

#6
Quote from: hutch-- on November 10, 2008, 12:34:25 PM
Both LENGTHOF and SIZEOF return BYTE values where UNICODE is always WORD values so they cannot work on unicode strings.

Well, if you say so then it's probably true (although my wstr macro says the contrary :wink).

.code
   wstr ws, "my other brother darryl my other brother darryl my other brother darryl my other brother darryl xxxx", 0

Output:
  LENGTHOF:     101     (including zero delimiter)
  SIZEOF:       202     (including zero delimiter)

  lstrlenW return value:  100

On another related topic: I have tried to use the "Agner Fog loop" (named so by Ratch in this thread):

align 2
uclenAF proc pStr         ; "Agner Fog algo"
   mov eax, [esp+4]
   mov ecx, 80008000h
   .Repeat
      mov edx, [eax]
      and edx, 0ff7fff7fh      ; mask out bit 8
      add eax, 4
      sub edx, 00010001h      ; get zero bytes
      and edx, ecx         ; sieve them out
   .Until !Zero?
   sub eax, [esp+4]
   sub eax, 4
   shr eax, 1
   dec dx
   jns @F
   inc eax
@@:   retn 4
uclenAF endp

Timings are among the best (code attached). However, I wonder if it is correct to assume that the end of a Unicode string can be tested by looking at the lower half of the WORD? It seems to work, and it is difficult to imagine that a combination of non-zero HIWORDHIBYTE and zero LOWORDLOBYTE would be legal, but... is there a clearly documented rule?

(EDIT: Byte, not Word)

[attachment deleted by admin]

hutch--

Straight from the MASM 6.1 manual.

Quote
     The LENGTHOF operator returns the number of data items allocated
     for <variable>. The SIZEOF operator returns the total number of
     bytes allocated for <variable> or the size of <type> in bytes. For
     variables, SIZEOF is equal to the value of LENGTHOF times the
     number of bytes in each element.

Your macro is nothing more than a wrapper for a Windows API that converts an ANSI string on the fly to unicode so it does not do the same thing, the "wstr" macro WRITES unicode data to the .DATA section while your macro stores the data as ANSI and converts it.

The "right" way to store unicode string data is in the resource section as a string resource. The format is exclusively unicode.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

.nolist ; ******************** CONSOLE assembly ********************
include \masm32\include\masm32rt.inc
include \masm32\macros\ucmacros.asm

wstr MACRO var:REQ, args:VARARG
LOCAL argByte
.data
  argByte db args
.data?
align 2
  var dw SIZEOF argByte dup(?)
.code
  invoke MultiByteToWideChar, CP_ACP, MB_PRECOMPOSED,
  offset argByte, -1, offset var, LENGTHOF var
ENDM

.data
  ; fails miserably:
  ; WSTR wssh1, " Wouldn't it be", 13, 10, " nice if macros simply",13,10," worked as they should,", 13, 10, " instead of producing wrong sizes?"
  ; works, but produces wrong SIZEOF and LENGTHOF:
  WSTR wssh, " WSTR (SH): Wouldn't it beLF nice if macros simplyLF worked as they should,LF instead of producing wrong sizes?"

.code
start:
wstr AppName, "Testing the wstr macro:", 0
wstr wsjj, " wstr (JJ): Wouldn't it be", 13, 10, " nice if macros simply",13,10," worked as they should,", 13, 10, " instead of producing wrong sizes?", 0

invoke crt_printf, chr$("%S%c"), addr wsjj

print chr$(13,10, 10,"  LENGTHOF wsJJ: ", 9)

; REMOVED TRAILING COMMA, accepted by ML but rejected by JWasm:
; print ustr$(LENGTHOF wsjj),9, "(including zero delimiter)", 13,10,
print ustr$(LENGTHOF wsjj),9, "(including zero delimiter)", 13,10
print "  SIZEOF wsJJ:    ", 9
print ustr$(SIZEOF wsjj),9, "(including zero delimiter)", 13, 10, 10, 10, 10

invoke crt_printf, chr$("%S%c"), addr wssh

print chr$(13,10, 10,"  LENGTHOF wsSH: ", 9)
print ustr$(LENGTHOF wssh), 13,10
print "  SIZEOF wsSH:    ", 9
print ustr$(SIZEOF wssh), 13, 10, 10

invoke MessageBoxW, 0, addr wsjj, addr AppName, MB_OK

    exit

end start


By the way, for both versions, crt_printf adds a strange character at the end.

EDIT: Shorter macro version. It works fine with ml614 and ml9 but JWasm chokes on ustr$.
EDIT(2): Before Japheth gets angry: See above, TRAILING COMMA. Should work only in strict compatibility mode :wink

MichaelW

QuoteBy the way, for both versions, crt_printf adds a strange character at the end.

You have two format specifications, the first specifying a wide-character string and the second a single-byte character, and only one argument. Did you perhaps intend to use a LF as a second argument, to serve the same purpose as a trailing CRLF for the print macro?
eschew obfuscation

hutch--

Here is the right way to do it. The MASM macro is a simple string constructor from ANSI to unicode that actually stores the data as UNICODE in the .DATA section. The second technique is the approved Microsoft method of using resource strings that can be written in unicode and is stored as unicode in the resource section of the file.

If you want to make a macro that actually writes unicode rather than just converting it on the fly, parse C style escapes embeddd in the string and convert them to characters bef9re writing the data to the .DATA section as WORD sized characters.

[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: MichaelW on November 11, 2008, 12:31:32 AM
QuoteBy the way, for both versions, crt_printf adds a strange character at the end.

You have two format specifications, the first specifying a wide-character string and the second a single-byte character, and only one argument. Did you perhaps intend to use a LF as a second argument, to serve the same purpose as a trailing CRLF for the print macro?


Thanks, Michael - that kind of stupid error happens when copying code from other threads.

   invoke crt_printf, chr$("%S"), addr wssh   ; one unicode string
   ; invoke crt_printf, chr$("%S%c"), addr wssh   ; one unicode, one ansi single byte character

EDIT: Field specs at MSDN

jj2007

Quote from: hutch-- on November 10, 2008, 08:54:57 PM
Your macro is nothing more than a wrapper for a Windows API

Sorry, it was not my intention to promote Windows :red

I take your criticism seriously. Therefore you'll find attached a version that doesn't use MultiByteToWideChar. It also happens to be shorter: Only 20 bytes per call. So every time you use the wstr macro for a unicode string with more than 20 characters, your code size will be smaller than doing the same with the Masm32Lib WSTR macro. In addition, it also returns the correct SIZEOF and LENGTHOF.

It comes in two flavours:
  wstr My$, "Testing the wstr macro:", 0  ; use exactly like AppName db "...", but in .code section
  mov eax, wchr$("Hello",13,10,"World")  ; use exactly like chr$

Quote
The "right" way to store unicode string data is in the resource section as a string resource.

Hey, you are becoming dogmatic! No longer "everything goes as long as it suits you"? By the way: Why did you put a WSTR macro into \masm32\macros\ucmacros.asm if that is politically incorrect?

My intention was to create a macro that makes using Unicode as easy as doing the same with Ansi:

   invoke MessageBoxW,0,wchr$("It would be great if",13,10,"macros simply did their job"),
   wchr$("Testing the wstr macro:"), MB_OK

   invoke MessageBox,0,chr$("It would be great if",13,10,"macros simply did their job"),
   chr$("Testing the wstr macro:"), MB_OK

:bg

[attachment deleted by admin]

hutch--

JJ,

This is a much better procedure and the conversions looks reasonably efficient, unfortunately you store BOTH the ANSI and UNICODE string. Just open the exe in a hex editor to see what I mean.

I still think the best way to do it is to write the data directly to the .DATA section at assembly time, you can use C style escapes "test\n" to place the CRLF and any other non printable characxter or in MASMs case reserved characters and then write it directly in hex to the .DATA section as WORD size characters.

You can get the extended length by splitting the lines in the .DATA section on the normal basis,


jjstring dw 0000,0000,0000,0000
          dw 0000,0000,0000,0000  etc ....


I think this macro can be done under existing MASM capacity but the line length limit will still be a pain on the input text. I am just playing with JWASM in this area and it seems to be able to handle a 16k string by tweaking the source which is a vast improvement on the 256 char limit in MASM.

PS: ANything DOES go if it works but f you are a TROO BLOO purist, the resource method is the bona fide, registered trademark Microsoft[tm] way of doing it and it is built into the system in the resource file format.  :bg
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on November 11, 2008, 11:44:38 AM
This is a much better procedure and the conversions looks reasonably efficient, unfortunately you store BOTH the ANSI and UNICODE string. Just open the exe in a hex editor to see what I mean.

My wstr stores the string to .data and writes it at runtime to .data? - the unicode you see in the hex editor is written by Masm32Lib WSTR (watch for the LF's that I inserted since WSTR doesn't accept the usual "string", 13, 10 syntax).

As to efficiency, there might be purists complaining that lodsb and stosw are slow, but have you ever used a chr$ in an innermost loop? The proggie is 42 bytes, each call to the macro is 20 bytes plus the ansi size.

That having said, I agree that resource files have their use: If you really need Chinese, then wstr will not help you, since it's based on ansi. But that holds for Masm32Lib WSTR, too. Anyway, you can mix them:

  invoke MessageBoxW, 0, addr MyChineseStringFromRsrc, wchr$("Test:"), MB_OK


Quote
PS: ANything DOES go if it works but f you are a TROO BLOO purist, the resource method is the bona fide, registered trademark Microsoft[tm] way of doing it and it is built into the system in the resource file format.  :bg

That applies to ansi strings, too, and yet I see frequently people, even in this forum :bg, using MyString db "Hello", 0 instead of resource files.