An Interest In:
Web News this Week
- April 23, 2024
- April 22, 2024
- April 21, 2024
- April 20, 2024
- April 19, 2024
- April 18, 2024
- April 17, 2024
Do You Actually Know What A String In JavaScript Is? Here's What I Found.
We preferred to think that String in JavaScript is an array of characters.
const name = Nickconsole.log(name.length) // 4
Variable name
has 4 characters N, i, c, k and length is also 4.
Everything seems logical.
Lets go further and add emoji to my name.
const name = Nick console.log(name.length) // 7
Hmm, strange.
Variable name
must have 6 characters N, i, c, k, (whitespace) and
But have 7.
It seems like the bull has 2 characters.
const emoji = console.log(emoji.length) // 2
Interesting
Lets figure out why.
We go to the official documentation of ECMAScript (its a programming language on which JavaScript is based).
Scroll to 6.1.4 The String Type.
And find this:
The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (elements) up to a maximum length of 2 - 1 elements. The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a UTF-16 code unit value.
So string in JavaScript is a sequence of UTF-16 code unit values.
What is UTF-16?
A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point to a unique byte sequence.
One UTF-16 code unit value is a number from 0x0000 to 0xFFFF.
What is 0x0000 and 0xFFFF?
0x represent the hexadecimal numeral system, often shortened to "hex", is a numeral system made up of 16 symbols (base 16). The standard numeral system is called decimal (base 10) and uses ten symbols: 0,1,2,3,4,5,6,7,8,9. Hexadecimal uses the decimal numbers and six extra symbols.
If we convert my name Nick to UTF-16 (like JavaScript see it) we will get 0x004e 0x0069 0x0063 0x006b.
0x004e = N
0x0069 = i
0x0063 = c
0x006b = k
But how does JavaScript treat emojis?
In UTF-16, Unicode characters from the Basic Multilingual Plane (contains characters for almost all modern languages) are encoded with one code unit.
Other characters from the non-Basic Multilingual Plane (emojis, musical notations, cards, hieroglyphs, etc) require two code units.
So UTF-16 format represents emoji with two code units (0Xd83d 0Xdc03).
Thats why .length
gives 2.
To consolidate everything we have learned, lets play a little with Unicode and JavaScript.
const name = Nickconst nameInUnicode = \u004e\u0069\u0063\u006bconsole.log(name === nameInUnicode) // trueconsole.log(nameInUnicode.length) // 4const fullName = Nick const fullNameInUnicode = \u004e\u0069\u0063\u006b\u0020\ud83d\udc03console.log(fullName === fullNameInUnicode) // trueconsole.log(fullNameInUnicode.length) // 7
What is \u?
A Unicode character escape sequence represents the single Unicode code point formed by the hexadecimal number following the \u or \U characters.
In the end
Knowing that string in JavaScript is a sequence of UTF-16 code unit values can save you from unpredictable bugs when you work with different characters not from BMP, like emojis.
If you like this article, share it with your friends and follow me on Twitter.
Also, every week I send out a "321" newsletter with 3 tech news, 2 articles, and 1 piece of advice for you.
Original Link: https://dev.to/nickbulljs/do-you-actually-know-what-string-in-javascript-is-here-s-what-i-found-23l7
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To