String Tokenizer in C#

Muhammad Maisam Abbas Feb 16, 2024
String Tokenizer in C#

This tutorial will discuss tokenizing a string into multiple sub-strings in C#.

String Tokenizer Using the String.Split() Function in C#

In natural language processing, string tokenization is the method of splitting up a sentence into all individual words present in the sentence. These individual words are called tokens.

We have the StringTokenizer class in Java for similar purposes. In C#, we don’t directly have an implementation of the StringTokenizer class, but we can achieve similar results using the String.Split() function available in C#.

The String.Split() function can divide a given string into an array of sub-strings based on some separator or delimiter. This function takes the regular expression for the delimiter or separator and returns an array of sub-strings.

To tokenize a given string, we can divide it into substrings using a blank space as a separator or delimiter.

The following code snippet shows how we can use the String.Split() function to tokenize a string in C#.

string inputString = "This is some input String";
string[] tokens = inputString.Split(' ');
foreach (string token in tokens) {
  Console.WriteLine(token);
}

Output:

This
is
some
input
String

The output shows the original string This is some input String divided into individual words with the String.Split() method in C#.

This string tokenizer is more powerful than the StringTokenizer available in Java. The simple StringTokenizer only allows one delimiter, whereas the above method can split the input string based on multiple delimiters.

The following code snippet shows an example to demonstrate the power of the String.Split() function.

string inputString =
    "This is some input String, but, is it actually a good string? The answer is upto you.";
string[] tokens = inputString.Split(new char[] { ' ', ',', '?' });
foreach (string token in tokens) {
  Console.WriteLine(token);
}

Output:

This
is
some
input
String

but

is
it
actually
a
good
string

The
answer
is
upto
you.

The above code snippet takes the input string:

This is some input String, but, is it actually a good string? The answer is upto you.

The code splits it into tokens based on multiple delimiters. The empty entries in the output can be removed by specifying StringSplitOptions.RemoveEmptyEntries as a second parameter to the String.Split() function.

The advantage of the StringTokenizer class over this method is that it can also store all the delimiters or tokens inside the given string, whereas the String.Split() function discards the delimiters.

Muhammad Maisam Abbas avatar Muhammad Maisam Abbas avatar

Maisam is a highly skilled and motivated Data Scientist. He has over 4 years of experience with Python programming language. He loves solving complex problems and sharing his results on the internet.

LinkedIn

Related Article - Csharp String