Ensure only ASCII character set used by davegill · Pull Request #12 · wrf-model/WRF

davegill · 2016-10-10T18:43:54Z

No description provided.

mkavulich · 2016-10-10T19:58:58Z

tools/finder.F

+   ELSE
+      PRINT *,' '
+      PRINT *,'Troubles, with ',problem_line_count,' lines.'
+      PRINT *,'File uses only ISO-8859 character codes, outside the standard ASCII range of ',FIRST_VALID,' to ',LAST_VALID


Instead of "File uses only ISO-8859 character codes, outside the standard ASCII range of", perhaps something like "File uses character codes outside the standard ASCII range of"

Mike,
The new print statement:
PRINT *,'File uses character codes outside the standard ASCII range of ',FIRST_VALID,' to ',LAST_VALID

mkavulich · 2016-10-10T20:00:54Z

tools/finder.F

+         EXIT big_read_loop
+      END IF
+
+      DO ind = 1 , MAX_LENGTH


The \tab character is fairly prominent in our code, and has ASCII code 9 (which is outside of the 32-127 range). Should this loop include that exception?

Coincidentally, the 9 files that I tested, none of them do have a tab character. I have now fixed the logic in the code to ignore all tab character (ASCII code #9).

mkavulich · 2016-10-10T20:01:46Z

tools/finder.F

+! usage:
+! a.out < file.F
+
+PROGRAM finder


Could we name this something more self-descriptive? Like "nonasciifinder"?

Mike,
The source code now lists the main program as "non_ascii_finder"

mgduda · 2016-10-10T22:09:33Z

tools/finder.F

+   line_count = 1
+   problem_line_count = 0
+
+   !  Loop over eah line of the input file.


"eah" -> "each"

Mike,
"each" now has a "c", also added "ubiquitously" to get right back up on that spelling horse.

davegill · 2016-10-11T17:35:36Z

chem/module_mozcart_wetscav.F

-!       Output:  GA --- â(x)
+!       Purpose: Compute the gamma function Ahat(x)
+!       Input :  x  --- Argument of Ahat(x)
+!                       ( x is not equal to 0,-1,-2,WHAT GOES HERE )


@ravanah
What should be in that original u'u'u' string?

Ravan,
It was pointed out, maybe this in supposed to be "..."?
Dave

I agree that this was probably the indication for etcetera,

jamiebresch · 2016-10-14T15:39:46Z

Aren't you going to commit updated var/convertor/wave2grid_kma/pvchkdv.F as well?

jamiebresch · 2016-10-14T17:55:00Z

tools/finder.F

+! line number and column count (for subsequent editing).
+
+! usage:
+! a.out file.F


How about adding the following (from your commit message) in the program itself.

build the finder program: gfortran -ffree-form finder.F
a.out some-file-name.F

…excluded TYPE: bug fix KEYWORDS: ISO, ASCII, sed, byte SOURCE: internal DESCRIPTION OF CHANGES: Authors of a few physics schemes likely used a "cut-and-paste" technique for including references and for units. The offending references used quite a few different characters for an intended dash (minus sign). The offending units all used a superscript numeral 2 to mean "squared", as in W/m^2. I changed some to m^2 and some to m2, as both are used in the modified schemes. There were a few other single modifications (an "a" with a carat hat, etc). All of the changes were to commented lines. The change are necessary to allow the use of sed to process the source code. Outside of the physics directory, a number of files also had characters outside of the Fortran character set (32-127). These were all in comments, but are still being removed. LIST OF MODIFIED FILES: chem/module_cam_mam_newnuc.F chem/module_gocart_dmsemis.F chem/module_gocart_seasalt.F chem/module_mozcart_wetscav.F chem/module_sea_salt_emis.F dyn_em/module_sfs_driver.F dyn_em/module_sfs_nba.F frame/module_cpl.F hydro/Routing/module_gw_gw2d.F phys/module_bl_mfshconvpbl.F phys/module_gocart_seasalt.F phys/module_ltng_cpmpr92z.F phys/module_ltng_crmpr92.F phys/module_ltng_iccg.F phys/module_mp_nssl_2mom.F phys/module_mp_wdm6.F phys/module_sf_bem.F phys/module_sf_bep.F phys/module_sf_bep_bem.F var/convertor/wave2grid_kma/pvchkdv.F (Thanks Jamie!) TESTS CONDUCTED: The sed program works on the modified files, and does not work on the original files.

davegill · 2016-10-17T21:19:09Z

@jamiebresch and @mkavulich
I have added the file that Jamie fixed up for the var directory (it was in Japanese). I added a short blurb at the beginning of the new main program on how to build it, I swapped the default behavior to work with the "find" command. I also swapped out the "WHAT IS THIS SUPPOSED TO BE" with the ellipsis "..." in one of the chem files.

Can you guys review again to see if my pull request may now proceed?

Thanks
Dave

jamiebresch · 2016-10-17T21:52:14Z

@davegill We would like to have tools/find.F renamed to tools/non_ascii_finder.F

mgduda · 2016-10-17T22:36:40Z

@davegill Would it be hard to allow the verbosity level to be specified on the command-line, and to let the user give a list of files to be scanned as command-line arguments? I'm imagining something like this:

verbose = 0
if (command_argument_count() == 0) then
    write(0,*) 'Usage: non_ascii_finder [verbosity option] <filenames>'
    stop 1
end if
do iarg = 1, command_argument_count()
    call get_command_argument(iarg, argval)
    if (trim(argval) == '-v') then
        verbose = 1
    else if (trim(argval) == '-V') then
        verbose = 2
    else
        ! Assume this is a filename... do the normal scanning code
    end if
end do

davegill · 2016-10-17T23:48:09Z

Michael,
Good suggestion. I implemented this, and here are the quick test results. I added a superscript 2 to the test program, so that it can be run on its own source code to detect a problem. I run the executable on other Fortran codes, and got the "nothing wrong here" behavior. I ran on a non existing file, and that was detected also.

a.out
Usage:
./a.out
where is -v when using this program with "find", and
is -V when processing a single file
where is a WRF Fortran source file
STOP 1

a.out -v
Usage:
./a.out
where is -v when using this program with "find", and
is -V when processing a single file
where is a WRF Fortran source file
STOP 1

a.out -v non_ascii_finder.F
non_ascii_finder.F

a.out -V non_ascii_finder.F
non_ascii_finder.F
Found something on line # 25
! --> this line has a problem with the superscript numeral 2: [W/m²]
Character # 69 is a ?, which is character code 194
non_ascii_finder.F
Found something on line # 25
! --> this line has a problem with the superscript numeral 2: [W/m²]
Character # 70 is a ?, which is character code 178
Troubles, with 2 lines.
File uses character codes outside the standard ASCII range of 32 to 127

a.out -VV fortran_2003_fflush_test.G
Hmmm, troubles trying to open fortran_2003_fflush_test.G for READ.
STOP 4

a.out -v fortran_2003_fflush_test.F

a.out -V fortran_2003_fflush_test.F

a.out -VV fortran_2003_fflush_test.F
fortran_2003_fflush_test.F, End of file after attempting to read line # 15
OK, File uses only ASCII character codes from 32 through 127

Dave

mgduda · 2016-10-18T00:13:48Z

tools/non_ascii_finder.F

+      PRINT *,'where <verbose level> is -v when using this program with "find", and'
+      PRINT *,'      <verbose level> is -V when processing a single file'
+!     PRINT *,'      <verbose level> is -VV is for developers and debugging'
+      PRINT *,'where <filename> is a WRF Fortran source file'


@davegill I suppose I may be nit-picking at this point, but why does the input have to be a WRF Fortran source file? Taking a more general view, this utility tells whether there are any characters outside the set of printable ASCII characters (or characters not acceptable to cpp, or whatever). Also, stating that -v is used when using the program with find doesn't really help anyone to understand what the effect of using -v actually is. Also, stating that -V is used when processing a single file can suggest that the program can process multiple files.

Generally, we could reconsider the printout produced by this program with a broader view of what the program could potentially be used for.

mgduda · 2016-10-18T01:27:49Z

@davegill It's purely academic at this point, but it might be interesting to try to detect UTF-8 multi-byte encodings. For example, in the module_sf_bep_bem.F file, the non_ascii_finder reports:

 ./WRF/phys/module_sf_bep_bem.F
 Found something on line #        4204
       real sfr(ndm,nz_um)      ! Sensible heat flux from roofs [W/m²] 
 Character #           68  is a ?, which is character code          194
 ./WRF/phys/module_sf_bep_bem.F
 Found something on line #        4204
       real sfr(ndm,nz_um)      ! Sensible heat flux from roofs [W/m²]  
 Character #           69  is a ?, which is character code          178

Really, though, the 68th and 69th characters together form a UTF-8 character; their binary encoding is 1100 0010 1011 0010, which from this table suggests the two-byte character U+00B2: https://en.wikipedia.org/wiki/UTF-8#Description .

This explains why the line

       real sfr(ndm,nz_um)      ! Sensible heat flux from roofs [W/m²]

correctly shows the superscript 2 as a single character, but the lines

 Character #           68  is a ?, which is character code          194
 Character #           69  is a ?, which is character code          178

can't show any character.

jamiebresch · 2016-10-18T15:15:42Z

@mgduda There are tons of languages and encoding methods. For the source code purpose, I think ASCII-only is a good rule.

mgduda · 2016-10-18T17:31:27Z

@jamiebresch Agreed. To be clear, I was definitely not suggesting that we allow anything but printable ASCII characters in the source code (I think this may even be part of the Fortran standard); rather, I was only saying that, because UTF-8 can be used to encode more or less every language, and it is by far the largest encoding used on e.g., the web, that the checker program could be more clever and recognize multi-byte UTF-8 encodings for what they are, rather than printing two, three, or four error messages for the same multi-byte character.

My previous comment only came about because I noticed that some of the messages from the checker referenced two characters, when I could only find one in the source code (e.g., the superscript 2), and I started looking further and thought the UTF-8 encoding bit was pretty cool.

cmattocks · 2016-10-19T03:06:17Z

Below is a script version of a UTF-8 -> ASCII character converter.

Craig

#!/bin/sh

@(#) utf2ascii Convert UTF-8 to ASCII text

################################################

Convert file encoded in UTF-8 to ASCII text.

Usage: utf2ascii filename

################################################

###############################

Set trap to abort on signal

###############################
trap 'echo “\nOuch. Dude! Ya fragged me." 1>&2; exit' 1 2 3 15

#####################################

Process command-line argument(s):

#####################################
case $# in
0) # User forgot to enter a filename:
[ -t 0 ] && echo "USAGE: basename $0 filename" >&2 && exit 1 ;;
) # Process input file:
for file in $
do
if [ ! -r $file ]; then
echo "Cannot read input file "$file"" >&2 && exit 2
else
echo "Converting file "$file" ..."
\rm -f "$file.orig"
mv -f "$file" "$file.orig"
iconv -sc -f UTF8 -t ASCII//TRANSLIT "$file.orig" > "$file"
fi
done
esac

exit

On Oct 18, 2016, at 1:31 PM, Michael Duda notifications@github.com wrote:

@jamiebresch Agreed. To be clear, I was definitely not suggesting that we allow anything but printable ASCII characters in the source code (I think this may even be part of the Fortran standard); rather, I was only saying that, because UTF-8 can be used to encode more or less every language, and it is by far the largest encoding used on e.g., the web, that the checker program could be more clever and recognize multi-byte UTF-8 encodings for what they are, rather than printing two, three, or four error messages for the same multi-byte character.

My previous comment only came about because I noticed that some of the messages from the checker referenced two characters, when I could only find one in the source code (e.g., the superscript 2), and I started looking further and thought the UTF-8 encoding bit was pretty cool.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

…rt_registry minor inconsequential removal of extra quote on memetum preturbations…

Synching up namelist templates

mkavulich suggested changes Oct 10, 2016

View reviewed changes

mgduda reviewed Oct 10, 2016

View reviewed changes

davegill commented Oct 11, 2016

View reviewed changes

davegill changed the title ~~Insure only ASCII character set used~~ Ensure only ASCII character set used Oct 11, 2016

jamiebresch reviewed Oct 14, 2016

View reviewed changes

mgduda reviewed Oct 18, 2016

View reviewed changes

mkavulich approved these changes Oct 28, 2016

View reviewed changes

davegill merged commit 4529d9a into wrf-model:master Oct 28, 2016

davegill deleted the ASCII_CODES_ONLY branch March 27, 2017 16:25

syalavarthi mentioned this pull request Jul 10, 2020

add configure.defaults entry for Aarch64 #1249

Closed

Tuanph1970 mentioned this pull request Aug 18, 2020

Could not open new netcdf file structure #1274

Open

twjuliano pushed a commit to twjuliano/WRF that referenced this pull request Jun 15, 2021

Merge pull request wrf-model#12 from domingom/feature/minorfix_mom_pe…

23dfe67

…rt_registry minor inconsequential removal of extra quote on memetum preturbations…

htan2013 mentioned this pull request Sep 6, 2021

Received signal SIGFPE(8) when initializing WRF4.3 with ERA5 and modis data #1549

Open

jinjiuping mentioned this pull request Dec 27, 2021

Program received signal SIGABRT: Process abort signal. #1620

Closed

janmandel added a commit to wrf-sfire/WRF-SFIRE that referenced this pull request Feb 5, 2022

Making check for too small patch only a warning. Fixes wrf-model#12

ef36e0a

twjuliano pushed a commit to twjuliano/WRF that referenced this pull request Jun 13, 2022

Merge pull request wrf-model#12 from phawbeck/feature/bathymetry

b0f784b

Synching up namelist templates

Conversation

davegill commented Oct 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davegill Oct 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamiebresch commented Oct 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davegill commented Oct 17, 2016

Uh oh!

jamiebresch commented Oct 17, 2016

Uh oh!

mgduda commented Oct 17, 2016

Uh oh!

davegill commented Oct 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgduda commented Oct 18, 2016

Uh oh!

jamiebresch commented Oct 18, 2016

Uh oh!

mgduda commented Oct 18, 2016

Uh oh!

cmattocks commented Oct 19, 2016

@(#) utf2ascii Convert UTF-8 to ASCII text

Convert file encoded in UTF-8 to ASCII text.

Usage: utf2ascii filename

Set trap to abort on signal

Process command-line argument(s):

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

davegill Oct 10, 2016 •

edited

Loading