Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

April 15, 2022 11:13 pm GMT

Handling text in programming, where to start from as a student

Since the general agreement about printing Hello, World! as your first program in every programming language, until you become a regex enthusiast, youll always be handling text one way or another, and youll notice that this is a subject of its own once you start using a somewhat lower-level language like Java, as I did once I replaced JavaScript by it.

This article is intended to be read by beginners that want to know the basic subjects of text handling in programming to study.

Terminology

First of all, we need to know what each part of a text/character is called. In other words, we need to know some terms that surround this subject.

Diacritic marks: Marks placed above or below (or sometimes next to) a letter in a word to indicate a particular pronunciation in regard to accent, tone, or stress as well as meaning, especially when a homograph exists without the marked letter or letters;
Scripts: A writing system or an orthography, consisting of a set of visible marks, forms, or structures called characters or graphs that are related to some structure in the linguistic system;
Character: Single visual object used to represent text, numbers, or symbols;
String: Data values that are made up of ordered sequences of characters, such as "hello world". A string can contain any sequence of characters, visible or invisible, and characters may be repeated.
Character set: Collection of characters that might be used by multiple languages. Example: The Latin character set is used by English and most European languages, though the Greek character set is used only by the Greek language;
Coded character set: Character set in which each character corresponds to a unique number;
Code point: Any allowed value in the character set or code space;
Code space: Range of integers whose values are code points. Some sources say this range of integers is directly related to the number of possible characters of a given encoding system;
Code unit: Word size of the character encoding scheme, such as 7-bit, 8-bit, 16-bit. In some schemes, some characters are encoded using multiple code units, resulting in a variable-length encoding. A code unit is referred to as a code value in some documents;
Octet: Eight-bit byte (Bytes are not eight bits in all computer systems).

Standards and Encoding

In the mid-1960s, the US settled on the ASCII (American Standard Code for Information Interchange) standard to define the characters and their encoding for any teleprinter made to write in the English language. ASCII has a 7-bit code unit, in other words, it has a range of 127 possible characters going through letters, digits, punctuation, and some control characters like backspace or new line.

ASCII wasnt the only standard for character encoding, though, some countries made their standards based on ASCII, and other countries whose alphabets had nothing to do with Englishs made their standards from scratch. Then computers happen, and soon though not common at all people had the opportunity to send documents across different countries. This was so much of a mess that Japan, which had 4 different encoding systems completely incompatible with each other, even invented a word for when you try to read a document written in a different encoding system and the characters got messy: Mojibake (or ). For this purpose, it was just much better to send a fax across the world.

Then the Internet happen, and suddenly, it was just so easy to even have a live conversation with someone in another country, and for that reason, the encoding system that ran in your computer became an important subject of discussion. At this point, we needed a standard and encoding system that was fully compatible with any language, and that could also meet some computers specific criteria, as some of them at the time could interpret 8 zeroes in a row (bits) as the end of a string. For that, Unicode was created.

Unicode is a standard (and only a standard, as opposed to ASCII) that defines hundreds of thousands of characters (for now), covering 159 modern and historic scripts, as well as symbols, emojis, and non-visual control and formatting codes. To encode all these characters, the UTF-8 encoding system was created, which was not the first, but is the most popular to this day, accounting for 98% of all web pages, and up to 100.0% for some languages (as of 2022).

As opposed to ASCII, which simply assigned a character to any natural number that can be represented by 7 bits, UTF-8 made it differently and beautifully thought. To begin with, each character is assigned to a natural number, like upper case A to 65, and lower case A to 97 (just like ASCII for compatibility purposes). After that, we have each byte (up to 4 bytes) divided into sections. The first byte has its first digits defining how many bytes that character will take, so if the first byte starts with 110 (zero meaning stop counting) it means that the character takes 2 bytes. Every byte besides the first one must start with 10, meaning its a continuation of the previous one. Every bit other than the mentioned sum up to a sequence that will be traduced to a hexadecimal number that forms the code point, which alongside the prefix U+, represents the needed character, like so:

Character	Unicode code point	Binary UTF-8
:	U+58	00111010
	U+29384	11100111 10001011 10001000
	U+198	11000011 10000110
	U+167	11000010 10100111
	U+8721	11100010 10001000 10010001

There are multiple encoding systems, such as UTF-16, UTF-32, UCS-2, etc. Each one with its advantages and disadvantages, so its worth reading about them once you have to choose or work with one. Im not covering all of them here because that's not my objective.

Recommended reading: Unicode at Wikipedia, UTF-8 at Wikipedia, UTF-16 at Wikipedia, UTF-32 at Wikipedia.

Unicode APIs

Character set

Most languages have APIs to handle code points, and thats all we need. As youll probably be using UTF-8 and UTF-16 anyways, there are not many use cases to know the encoding system of a string (or character set, as some languages call it), but if you want to and you are using Java 11, you can use a library like Apache Tika to detect the encoding of a String:

import org.apache.tika.parser.txt.CharsetDetector;import java.nio.charset.Charset;public class CharsetHandler {  // Get default text encoding for this JVM, if you wish  public String defaultCharset = Charset.defaultCharset().name();  public static void main(String[] args) {     CharsetDetector detector = new CharsetDetector();     String ASCII = "Test";     String Unicode = "";     detector.setText(ASCII.getBytes());     System.out.println(detector.detect().getName()); // Output: "ISO-8859-2"     detector.setText(Unicode.getBytes());     System.out.println(detector.detect().getName()); // Output: "UTF-8"  }}

If you take a time to test it yourself youll notice that this is not consistent at all. The string English gives you the output UTF-8, the string Test gives you ISO-8859-2, and the string Testenglish gives you ISO-8859-1. This is an inherent issue of this kind of operation. TIkas documentation explicitly says:

Character set detection is at best an imprecise operation. The detection process will attempt to identify the charset that best matches the characteristics of the byte data, but the process is partly statistical in nature, and the results can not be guaranteed to always be correct.

If your application depends on operating with a specific text encoding, you can have it set on your database, JVM (In case of using Java), Toolkit (like GTK, in case of GUI applications), Web Browser (by specifying it on your HTML file), etc. If your language does not support custom encoding at runtime, you can take your text in a foreign encoding, convert it to your languages encoding, and export it to the original encoding if you ever need to. As your default languages encoding will certainly be UTF-8 or UTF-16 (depending on your operating system), theres no need to be afraid of incompatibility between encodings. As Ive already mentioned, Unicode has more than a hundred thousand characters available, and this number is far away from UTF-8 and 16s limit.

Code unit

Knowing that detecting a character set is at best an imprecise operation, and that you probably already know it if you are working with a database of a GUI framework (thus knowing its code unit), there are not many reasons to get this information through code, but if you want, you can use a method inside your language or check its documentation. Theres no reason to do something like this:

public static long[] minAndMaxCodeUnits(String input) {  char lowerCodePointChar =        Character.toChars(input.codePoints().min().getAsInt())[0];  char higherCodePointChar =        Character.toChars(input.codePoints().max().getAsInt())[0];  long[] result = {        InstrumentationAgent.getObjectSize(lowerCodePointChar),        InstrumentationAgent.getObjectSize(higherCodePointChar)  };  return result;}

First of all, because your language probably cant handle a per-string character set, so every string will have the same one, thus having the same code unit; and also because some languages, like Java, treat some or everything as objects, so most values will have different sizes in memory.

Code point and Code space

Though a code point is a hexadecimal number, the limit of possible characters in UTF-8 and UTF-16 is enough to be represented by a 4-byte integer, which is the data type requested by code point operations. For example, we can make a method that apply the Caesar cipher to a string:

public static String caesarCipher(String input, int shift) {  IntStream stream = input.codePoints().map((codepoint) -> {     boolean isLetter = Character.isLetter(codepoint);     int newcodepoint;     if (!isLetter)        return codepoint;     else        newcodepoint = codepoint + shift;     if (!Character.isLetter(newcodepoint))        newcodepoint -= 26;     return newcodepoint;  });  return new String(stream.toArray(), 0, input.length());}

Or a method that reverts the case of every letter:

public static String revertCase(String input) {  IntStream stream = input.codePoints().map((codepoint) -> {     if (!Character.isLetter(codepoint))        return codepoint;     else if (Character.isLetter(codepoint) && codepoint > 96)        return codepoint - 32;     else        return codepoint + 32;  });  return new String(stream.toArray(), 0, input.length());}

The latter example could be done in other ways, such as this one. Like every single mathematical operation in programming, the limit is your creativity (check out the Fast Inverse Squareroot from Quake III to know what Im talking about).

Both examples are also good samples of controlling a strings code space. On them what we needed to do was to maintain the characters code point in the ranges where letters laid (65 <= x <= 90 for upper case, and 97 <= x <= 122 for lower case).

Regular Expressions

A regular expression is like a language built-in another language. It works by interpreting a string of specific characters in a specific order as a whole complex operation that returns matches for a given pattern in a given string. You then can choose what to do with these substrings.

Heres an operation with regular expressions that removes all emojis from a string:

public String removeEmoji(String input) {  String regex = "[^\\p{L}\\p{N}\\p{P}\\p{Z}]";  return Pattern.compile(regex, Pattern.UNICODE_CHARACTER_CLASS)        .matcher(input)        .replaceAll("");}

Then an operation that replaces every whitespace by an underline:

public String replaceWhitespaceByUnderline(String input) {  String regex = "[\\s]";  return input.replaceAll(regex, "_");}

And a more complex operation than take any kebab-case words and turn them into camel-case words:

public String kebabCase2CamelCase(String input) {  String regex = "(?:([\\p{IsAlphabetic}]*)(-[\\p{IsAlphabetic}]+))+";  Matcher m = Pattern.compile(regex).matcher(input);  boolean hasSubSequence = m.find();  if (hasSubSequence) {     Matcher kebabCaseMatches = Pattern.compile(regex).matcher(input);     while (kebabCaseMatches.find()) {        String currentOccurence = kebabCaseMatches.group();        while (currentOccurence.contains("-")) {           // Indented for better understanding           currentOccurence = currentOccurence.replaceFirst("-[\\p{IsAlphabetic}]",                 Character.toString(                       Character.toUpperCase(                             currentOccurence.charAt(currentOccurence.indexOf("-") + 1)                       )                 )           );        }        input = input.replaceFirst(regex, currentOccurence);     }  }  return input;}

Recommended reading: Regular-Expressions.info, or regexr.com if you want to practice.

Conclusion

Handling text on programming is a complex task, as it is easy to make it overcomplicated at a design standpoint, and even the tools created to make it easy are not simple either.

Depending on what you want to do, you can get away with a very simple regular expression, and to some extend (really low one) its not so complicated. Code point operations, though, require some creativity to be done, and even more to be done efficiently.

Id like to make a quote from u/blablahblah on Reddit:

Keep in mind this all becomes way more complicated if you deal with non-English text.
For example, the German letter was historically capitalized as "SS", not exactly the same thing as subtracting 32 from the code point. They have a capital version of that letter now, but it's not 32 before the lower case version.
For a Caesar cypher, what do you do if you get Spanish text with an or in it? Does is differ if the is represented as one code point (U+00E1) or two (U+0301 U+0061)? Unicode normalization identifying that those mean the same thing is a huge and complicated thing.

Note that Im not trying to make it look easy. If you ever need to choose or work directly with an encoding system or standard, or make complex operations with text, it is highly recommended that you study this subject, as it is for every subject in programming.

Original Link: https://dev.to/marcosdly/handling-text-in-programming-where-to-start-from-53j6

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To