How to Remove HTML Tags From a String in C#

Saad Aslam Feb 02, 2024
  1. Use Regex to Remove HTML Tags From a String in C#
  2. Use HTML Agility Pack to Remove HTML Tags From a String in C#
How to Remove HTML Tags From a String in C#

In this post, we’ll demonstrate how to remove all HTML tags from a string without knowing which tags are included inside it.

There are many ways to achieve this task, but none will guarantee you to remove all the tags. We’ll look at some of its methods.

Use Regex to Remove HTML Tags From a String in C#

public static string StripHTML(string input) {
  return Regex.Replace(input, "<[a-zA-Z/].*?>", String.Empty);
}

This function passes a string parameter, and we use the Replace() function of the regex to remove the tags as the signature of the tags is given in the function input.

It does not work for all the cases, but most worked fine. You will need to write your algorithm for removing all the tags from a string input.

Use HTML Agility Pack to Remove HTML Tags From a String in C#

Another solution is to use the HTML Agility Pack.

internal static string RmvTags(string d) {
  if (string.IsNullOrEmpty(d))
    return string.Empty;

  var doc = new HtmlDocument();
  doc.LoadHtml(d);

  var accTags = new String[] { "strong", "em", "u" };
  var n = new Queue<HtmlNode>(doc.DocumentNode.SelectNodes("./*|./text()"));
  while (n.Count > 0) {
    var no = nodes.Dequeue();
    var pNo = no.ParentNode;

    if (!accTags.Contains(no.Name) && no.Name != "#text") {
      var cNo = no.SelectNodes("./*|./text()");

      if (cNo != null) {
        foreach (var c in cNo) {
          n.Enqueue(c);
          pNo.InsertBefore(c, no);
        }
      }
      pNo.RemoveChild(no);
    }
  }
  return doc.DocumentNode.InnerHtml;
}

This will work fine except for the strong, em, u, and raw text nodes. This function takes a string as a parameter in the d variable.

The if(string.IsNullOrEmpty(d)) line checks if the string is already empty then return the empty string.

var doc = new HtmlDocument();
doc.LoadHtml(d);

These statements create a new HTML document and load the data into the document. It is already an HTML tag string and will follow the HTML pattern.

The var accTags = new String[] { "strong", "em", "u"}; line tells which tags is to be allowed. You can change, add or remove the tags as per your requirements.

Then in the while loop, it uses the queue to add all the document nodes, dequeues each node, and removes the HTML tag.

The process continues until all the data is sanitized, and then it returns the inner HTML of the HTML document, which is already a sanitized text.

As said earlier, there is no hard and fast rule or method to achieve this task. There are multiple ways, and no way is completely reliable.

This code has been tested for a low data set. We can never trust the user’s input.

Author: Saad Aslam
Saad Aslam avatar Saad Aslam avatar

I'm a Flutter application developer with 1 year of professional experience in the field. I've created applications for both, android and iOS using AWS and Firebase, as the backend. I've written articles relating to the theoretical and problem-solving aspects of C, C++, and C#. I'm currently enrolled in an undergraduate program for Information Technology.

LinkedIn

Related Article - Csharp String