The BBC and Master Computer Public Domain Library

8DB2 Tokenise Command Line

Submitted by Steve Fewell

Description:

Variables:: &3B = Zero if at the start of a Statement or #&FF if in the middle of a Statement.; &3C = #&FF is a Line Number is expected at the current location.; [On entry &3B is set to 0 (start of statement) and &3C is set to #&FF (Line Number expected).
as soon as a non-line number character is found, location &3C is set to 0, as a Line Number is no longer expected].

During the tokenisation process, the characters on the command line are checked in a specific order in order to ensure
that the Line is tokenised correctly.

Each character is checked against each character, and tokenised when the character is found, in the order specified by this list:: <cr> (carriage return) - End of line; <sp> (space) - separator; & - Hex Number literal; " - String literal; : - Statement separator; , - Comma (field separator); * - Operating System call; * - Multiplication Operator; '.' (decimal point) - Numeric literal; Digit - Line Number; Digit or '.' (decimal point) - Numeric literal; Characters less then 'A' - Other symbol characters (skipped); Characters greater then of equal to "X" - non Keyword start characters (skipped); Characters between "A" and "W" - Possible Keywords

The following is a description of the routine in more detail.
[&8DAF] Increment the (&37, &38) pointer.
[&8DB2] Get the next character pointed to by (&37, &38).
If the character is '<cr>' [carriage-return] then the end of the line has been reached, so exit as tokenisation is complete.

If the character is a space then skip all spaces by jumping back to &8DAF until we find a non-space character.

If the character is '&', then the Hex number is skipped (as is doesn't need to be tokenised) as follows:
   * Keep Incrementing and reading the next character from (&37, &38) until the next character is not '0'-'9'
     or 'A' to 'F' (i.e. a Hex digit).
   * If the first character after the Hex number is less than 'A' then jump back to &8DB2 to check whether
     this character (to see if it is '<cr>' or '&'.
   * If the first character after the Hex number is greater than 'F' then continue to &8DD0 (as it cannot
     be ' or '&'!.

[&8DD0] If the character is a quote ("), then check the next character. If the next character is also a quote then
the string is complete, so jump to &8DAF to continue with the next character after the "" (this method also works
when quote characters are doubled-up within a String literal, as we are only skipping the string literal - and not bothered
about obtaining its actual value).
Skip any non quote (") characters in the string literal but if a <cr> (carriage return) is found then exit as the line has
been tokenised. When a quote character is found, jump back to &8DAF to tokenise the rest of the line.

[&8DE0] (the character is not a quote). If the character is ':' then increment the pointer to the next character
(in order to skip the ':' character), zero location &3B to indicate that we are at the start of a statement,
zero location &3C to indicate that a Line Number is no longer expected & jump to &8DB2 to check the next character.

[&8DED] If the character is a comma then jump to &8DAF (skip comma as it doesn't need to be tokenised).

[&8DF1] If the character is '*' (star), then check location &3B. If location &3B is 0 then we are at the start
of the statement, so the '*' is an Operating System call (not a multiplication operator). As the whole command line after
the '*' will be passed to the Operating System on execution, the rest of the line should not be tokenised - so exit from
the tokenisation routine.
Otherwise, the '*' is a multiplication operator, so store #&FF in location &3B (to indicate that we are no longer
at the start of a statement - note: this should already be the case anyway), zero location &3C as we are not
expecting a line number, and then jump to &8DAF to check the next character (no further tokenisation of the '*' required).

[&8E01] If the character is a '.' [Decimal point] then a numeric literal has been found, so [&8E13] keep checking the character
pointed to by (&37, &38). If the character at (&37, &38) is "." or a digit('0' to '9') then increment the
(&37, &38) pointer and jump back to &8E13 to check the next character. This ensures that the numeric literal is
ignored, as it does not need to be tokenised.
When a non-digit (and non-'.') character is found, Store #&FF in location &3B (meaning that we are not at the start of
a statement), Zero &3C (as a line number is not expected) and jump back to &8DB2 to check the new character for
tokenisation.

[&8E05] If the character is a digit ['0' to '9'] and location &3C is not 0 (i.e. &FF) then a line number
is expected and we have found the start of a line number, so call routine &8D04 to tokenise the line number.
If carry is set on return from &8D04, then there was a error and the numeric value was not a valid Line number,
so continue to &8E13 to treat the value as a numeric literal instead of a line number. Otherwise, if carry is clear,
the Line Number was tokenised correctly, so jump back to &8DAF to check the next character on the command line.

[&8E05] If the character is a digit ['0' to '9'] and location &3C is 0 then we have found a numeric literal
(it cannot be a line number as a line number is not expected at this time), so [&8E13] keep checking the character
pointed to by (&37, &38). If the character at (&37, &38) is "." or a digit('0' to '9') then increment the
(&37, &38) pointer and jump back to &8E13 to check the next character. This ensures that the numeric literal is
ignored, as it does not need to be tokenised.
When a non-digit (and non-'.') character is found, Store #&FF in location &3B (meaning that we are not at the start of
a statement), Zero &3C (as a line number is not expected) and jump back to &8DB2 to check the new character for
tokenisation.

[&8E36] If the character is less than 'A' then the character does not need to be tokenised, so set location &3B
to #&FF (to indicate that we are not at the start of a statement), zero location &3C (to indicate that a Line
Number is not expected at this point (as we are no longer at the start of the line) and jump to &8DAF to check the next
character on the command line.

[&8E3A] If the character is greater than or equal to 'X' then it cannot be the start of a Keyword (as no BASIC keywords
begin with 'X', 'Y' or 'Z') so jump to &8E25 to check for a variable name.

[&8E3E] The character is between 'A' and 'W', so check whether it is the start of a BASIC Keyword, as follows:: * Set pointer (&39, &3A) to point to &8456, which is the address of the beginning of the BASIC Keyword
table within the BASIC rom.; * [&8E46] Compare the character (in A) with the first charater of the next BASIC Keyword at (&39, &3A).; * If the character is less than that of the start character of the next BASIC keyword then no tokenisation is required
as the character belongs to a variable name and not a BASIC Keyword, so jump to &8E2A to skip the rest of the
characters in the variable name, store #&FF in location &3B (meaning that we are not at the start
of a statement), zero location &3C (as a line number is not expected) and jump back to &8DB2 to check the new
character (the next character after the variable name) for tokenisation.; * If the character (in A) is not equal to the first character of the next Keyword then check the next character in the
Keyword table the jump to &8E5D to advance to the next Keyword in the Keyword table.; * [&8E4E] Otherwise, increment the character index (Y - the position of the current character within the Keyword).; * If the next character is negative (>= #&80 - i.e. a Token value) then all the characters from the current
word on the command line (pointed to by (&37, &38)) match with the Keyword, so goto &8E84 to tokenise the keyword.
Otherwise, compare the next character of the Keyword with that of the next character on the Command Line. If the next character
matches then jump to &8E4E to check the next character.; * If the characters do not match and the character is a dot '.', then jump to &8E68 to advance (&39,&3A) to the next
token value and jump to &8E84 to tokenise (replace the Keyword on the command line with that token).; * [&8E5D] Advance (&39,&3A) to point to the token value (the byte that is >= #&80 and located
directly after the keyword).; * If the token value is &FE (WIDTH, the last token in the BASIC Keyword table) then we have reached the end of
the token table, so jump to &8E2A to skip the rest of the characters in the variable name, store #&FF in
location &3B (meaning that we are not at the start of a statement), zero location &3C (as a line number is not
expected) and jump back to &8DB2 to check the new character (the next character after the variable name) for
tokenisation.; * Otherwise, jump to &8E75 to advance to the next Keyword in the Keyword table and then
jump to &8E46 to compare the character with the next BASIC Keyword.

[&8E25] Check for a variable name:
Call routine &8D84 to check whether character is valid within a Variable name (letter, '_', or digit).
If the character is not a valid variable name character then the character does not need to be tokenised, so set location
&3B to #&FF (to indicate that we are not at the start of a statement), zero location &3C (to indicate that a
Line Number is not expected at this point (as we are no longer at the start of the line) and jump to &8DAF to check the
next character on the command line.
Otherwise (valid variable name character found), so [&8E2A] keep checking the character pointed to by (&37, &38).
If the character at (&37, &38) is a valid variable name character ('A'-'Z', '_' or '0'-'9') then increment the
(&37, &38) pointer and jump back to &8E2A to check the next character. This ensures that the entire variable name is
ignored, as it does not need to be tokenised.
When a variable name character is found (we have reached the end of the variable name), Store #&FF in location &3B
(meaning that we are not at the start of a statement), Zero &3C (as a line number is not expected) and jump back to &8DB2
to check the new character for tokenisation.

[&8E84] Tokenise the Keyword:
Set X to the value in A. Now, X = the token value of the Keyword that was matched against the text at BASIC Text Pointer A,
and Y is the offset for the token value in the BASIC Keyword table (pointed to by (&39, &3A).
Store the flag for the BASIC Keyword (from the BASIC Keyword table - (&39, &3A) + Y + 1) in location &3D.
This flag specifies certain attributes of that particular Keyword.

If bit 0 of the flag (meaning 'Don't tokenise if Keyword is followed by an alphabetic character') is set then:: Load the next character from the command prompt location (&37, &38) call &8D84 to check whether the character
is valid for a variable name (i.e. it's a digit, a letter or '_'). If it is a valid variable name character then do not
tokenise the Keyword (as, in this context, it is not a Keyword, but a variable name) and jump to &8E2A to skip the rest
of the characters in the variable name, store #&FF in location &3B (meaning that we are not at the start
of a statement), zero location &3C (as a line number is not expected) and jump back to &8DB2 to check the new
character (the next character after the variable name) for tokenisation.

[&8E95] Set A to the Token Value (in X).

If bit 6 of the &3D flag (meaning 'Pseudo Variable - where the keyword can be on either side of an assignment, i.e. PAGE= and =PAGE') is set then:: If location &3B is 0 (we are at the start of a statement, i.e. 'PAGE=') add #&40 to the token value, as Pseudo
variable Keywords at the start of a statement are being assigned to, and so have a different token value.

[&8EA0] Decrement Y (to point to the last character of the Keyword).
Call routine &8CEB to replace the ASCII Keyword with the (1-byte) token value.

If bit 1 of the Keyword flag (meaning 'Go into middle of statement mode' - i.e. Keywords IF & LET) is set then:: Set location &3B to value #&FF and zero byte &3C. This sets the tokenise routine to middle of statement
mode, and clears the 'Line Number expected' byte.

If bit 2 of the Keyword flag (meaning 'Go into Start of Statement mode' - i.e. Keywords THEN & FOR) is set then:: Clear location &3B (to tell the tokenise routine to go into start of statement mode) and clear location &3C
(as a line number is no longer expected).

If bit 3 of the Keyword flag (meaning 'The Keyword is FN or PROC' - so don't tokenise the subroutine name) is set then:: Push A (Flag) to the stack. Skip any alphabetic characters (including digits and '_' characters) on the program line
after the 'FN' or 'PROC' token and then, after the name has been skipped (a non-variable name character is found),
retrieve A back (the keyword flag) from the stack again.

If bit 4 of the Keyword flag (meaning 'Tokenise a Line Number next' - i.e. Keywords GOTO, GOSUB, ELSE, THEN) is set then:: Set location &3C to #&FF (i.e. tell the tokenise routine to expect a Line Number next - however, if no line
number is found next then this flag is ignored).

If bit 5 of the Keyword flag (meaning 'Don't tokenise the rest of the line' - i.e. Keywords REM and DATA) is set then:: exit from the tokenise line routine (so that the rest of the line is not tokenised) - and should instead be ignored
as no more keywords are valid on this line after a REM or DATA keyword.

Disassembly for the Tokenise Command Line routine

8DAF		032 162 141	20 A2 8D	JSR &8DA2 Increment (&37, &38) pointer
8DB2	7	178 055	B2 37	LDA (&37)
8DB4		201 013	C9 0D	CMP#&0D
8DB6	'	240 039	F0 27	BEQ 39 --> &8DDF [RTS (exit when &0D char found)]
8DB8		201 032	C9 20	CMP#&20
8DBA		240 243	F0 F3	BEQ -13 --> &8DAF
8DBC	&	201 038	C9 26	CMP#&26
8DBE		208 016	D0 10	BNE 16 --> &8DD0
8DC0		032 169 141	20 A9 8D	JSR &8DA9 Increment and read character at (&37, &38) pointer
8DC3		032 148 141	20 94 8D	JSR &8D94 Check for numeric digit [Line Number]
8DC6		176 248	B0 F8	BCS -8 --> &8DC0
8DC8	A	201 065	C9 41	CMP#&41
8DCA		144 230	90 E6	BCC -26 --> &8DB2 Continue to Tokenise
8DCC	G	201 071	C9 47	CMP#&47
8DCE		144 240	90 F0	BCC -16 --> &8DC0
8DD0	"	201 034	C9 22	CMP#&22
8DD2		208 012	D0 0C	BNE 12 --> &8DE0
8DD4		032 169 141	20 A9 8D	JSR &8DA9 Increment and read character at (&37, &38) pointer
8DD7	"	201 034	C9 22	CMP#&22
8DD9		240 212	F0 D4	BEQ -44 --> &8DAF
8DDB		201 013	C9 0D	CMP#&0D
8DDD		208 245	D0 F5	BNE -11 --> &8DD4
8DDF	`	096	60	RTS
8DE0	:	201 058	C9 3A	CMP#&3A
8DE2		208 009	D0 09	BNE 9 --> &8DED
8DE4		032 162 141	20 A2 8D	JSR &8DA2 Increment (&37, &38) pointer
8DE7	d;	100 059	64 3B	STZ &3B
8DE9	d<	100 060	64 3C	STZ &3C
8DEB		128 197	80 C5	BRA -59 --> &8DB2 Continue to Tokenise
8DED	,	201 044	C9 2C	CMP#&2C
8DEF		240 190	F0 BE	BEQ -66 --> &8DAF
8DF1	*	201 042	C9 2A	CMP#&2A
8DF3		208 012	D0 0C	BNE 12 --> &8E01
8DF5	;	165 059	A5 3B	LDA &3B
8DF7		240 230	F0 E6	BEQ -26 --> &8DDF [RTS (as '*' Star command, don't tokenise line)]
8DF9		162 255	A2 FF	LDX#&FF
8DFB	;	134 059	86 3B	STX &3B
8DFD	d<	100 060	64 3C	STZ &3C
8DFF		128 174	80 AE	BRA -82 --> &8DAF
8E01	.	201 046	C9 2E	CMP#&2E
8E03		240 014	F0 0E	BEQ 14 --> &8E13
8E05		032 148 141	20 94 8D	JSR &8D94 Check for numeric digit [Line Number]
8E08	,	144 044	90 2C	BCC 44 --> &8E36
8E0A	<	166 060	A6 3C	LDX &3C
8E0C		240 005	F0 05	BEQ 5 --> &8E13
8E0E		032 004 141	20 04 8D	JSR &8D04 Tokenise Line Number
8E11		144 156	90 9C	BCC -100 --> &8DAF
8E13	7	178 055	B2 37	LDA (&37)
8E15		032 155 141	20 9B 8D	JSR &8D9B If character is not "." then check for Digit (Carry is set if found)
8E18		144 005	90 05	BCC 5 --> &8E1F
8E1A		032 162 141	20 A2 8D	JSR &8DA2 Increment (&37, &38) pointer
8E1D		128 244	80 F4	BRA -12 --> &8E13
8E1F		162 255	A2 FF	LDX#&FF
8E21	;	134 059	86 3B	STX &3B
8E23		128 196	80 C4	BRA -60 --> &8DE9
8E25		032 132 141	20 84 8D	JSR &8D84 Check whether character is valid within a Variable name (letter, '_', or digit)
8E28		144 207	90 CF	BCC -49 --> &8DF9
8E2A	7	178 055	B2 37	LDA (&37)
8E2C		032 132 141	20 84 8D	JSR &8D84 Check whether character is valid within a Variable name (letter, '_', or digit)
8E2F		144 238	90 EE	BCC -18 --> &8E1F
8E31		032 162 141	20 A2 8D	JSR &8DA2 Increment (&37, &38) pointer
8E34		128 244	80 F4	BRA -12 --> &8E2A
8E36	A	201 065	C9 41	CMP#&41
8E38		144 191	90 BF	BCC -65 --> &8DF9
8E3A	X	201 088	C9 58	CMP#&58
8E3C		176 231	B0 E7	BCS -25 --> &8E25
8E3E	V	162 086	A2 56	LDX#&56
8E40	9	134 057	86 39	STX &39
8E42		162 132	A2 84	LDX#&84
8E44	:	134 058	86 3A	STX &3A
8E46		160 000	A0 00	LDY#&00
8E48	9	210 057	D2 39	CMP (&39)
8E4A		144 222	90 DE	BCC -34 --> &8E2A
8E4C		208 015	D0 0F	BNE 15 --> &8E5D
8E4E		200	C8	INY
8E4F	9	177 057	B1 39	LDA (&39),Y
8E51	01	048 049	30 31	BMI 49 --> &8E84
8E53	7	209 055	D1 37	CMP (&37),Y
8E55		240 247	F0 F7	BEQ -9 --> &8E4E
8E57	7	177 055	B1 37	LDA (&37),Y
8E59	.	201 046	C9 2E	CMP#&2E
8E5B		240 011	F0 0B	BEQ 11 --> &8E68
8E5D		200	C8	INY
8E5E	9	177 057	B1 39	LDA (&39),Y
8E60		016 251	10 FB	BPL -5 --> &8E5D
8E62		201 254	C9 FE	CMP#&FE
8E64		208 015	D0 0F	BNE 15 --> &8E75
8E66		176 194	B0 C2	BCS -62 --> &8E2A
8E68		200	C8	INY
8E69	9	177 057	B1 39	LDA (&39),Y
8E6B	0	048 023	30 17	BMI 23 --> &8E84
8E6D	9	230 057	E6 39	INC &39
8E6F		208 248	D0 F8	BNE -8 --> &8E69
8E71	:	230 058	E6 3A	INC &3A
8E73		128 244	80 F4	BRA -12 --> &8E69
8E75	8	056	38	SEC
8E76		200	C8	INY
8E77		152	98	TYA
8E78	e9	101 057	65 39	ADC &39
8E7A	9	133 057	85 39	STA &39
8E7C		144 002	90 02	BCC 2 --> &8E80
8E7E	:	230 058	E6 3A	INC &3A
8E80	7	178 055	B2 37	LDA (&37)
8E82		128 194	80 C2	BRA -62 --> &8E46
8E84		170	AA	TAX
8E85		200	C8	INY
8E86	9	177 057	B1 39	LDA (&39),Y
8E88	=	133 061	85 3D	STA &3D
8E8A		136	88	DEY
8E8B	J	074	4A	LSR A
8E8C		144 007	90 07	BCC 7 --> &8E95
8E8E	7	177 055	B1 37	LDA (&37),Y
8E90		032 132 141	20 84 8D	JSR &8D84 Check whether character is valid within a Variable name (letter, '_', or digit)
8E93		176 149	B0 95	BCS -107 --> &8E2A
8E95		138	8A	TXA
8E96	$=	036 061	24 3D	BIT &3D
8E98	P	080 006	50 06	BVC 6 --> &8EA0
8E9A	;	166 059	A6 3B	LDX &3B
8E9C		208 002	D0 02	BNE 2 --> &8EA0
8E9E	i@	105 064	69 40	ADC#&40
8EA0		136	88	DEY
8EA1		032 235 140	20 EB 8C	JSR &8CEB Replace untokenised value with token
8EA4		162 255	A2 FF	LDX#&FF
8EA6	=	165 061	A5 3D	LDA &3D
8EA8	J	074	4A	LSR A
8EA9	J	074	4A	LSR A
8EAA		144 004	90 04	BCC 4 --> &8EB0
8EAC	;	134 059	86 3B	STX &3B
8EAE	d<	100 060	64 3C	STZ &3C
8EB0	J	074	4A	LSR A
8EB1		144 004	90 04	BCC 4 --> &8EB7
8EB3	d;	100 059	64 3B	STZ &3B
8EB5	d<	100 060	64 3C	STZ &3C
8EB7	J	074	4A	LSR A
8EB8		144 016	90 10	BCC 16 --> &8ECA
8EBA	H	072	48	PHA
8EBB		160 001	A0 01	LDY#&01
8EBD	7	177 055	B1 37	LDA (&37),Y
8EBF		032 132 141	20 84 8D	JSR &8D84 Check whether character is valid within a Variable name (letter, '_', or digit)
8EC2		144 005	90 05	BCC 5 --> &8EC9
8EC4		032 162 141	20 A2 8D	JSR &8DA2 Increment (&37, &38) pointer
8EC7		128 244	80 F4	BRA -12 --> &8EBD
8EC9	h	104	68	PLA
8ECA	J	074	4A	LSR A
8ECB		144 002	90 02	BCC 2 --> &8ECF
8ECD	<	134 060	86 3C	STX &3C
8ECF	J	074	4A	LSR A
8ED0		176 013	B0 0D	BCS 13 --> &8EDF
8ED2	L	076 175 141	4C AF 8D	JMP &8DAF Keep tokenising until end of line found

If character is not "." then check for digit (Line Number) (carry set if found)

8D9B	.	201 046	C9 2E	CMP#&2E
8D9D		208 245	D0 F5	BNE -11 --> &8D94 Check for numeric digit [Line Number]
8D9F	`	096	60	RTS