Swift - Capturing URLs, Hashtags and Twitter handles in a String

There are times when you want to find out if there are any urls, hashtags or twitter handles in a sentence. This can be required for many reasons, including parsing them from the JSON you might have downloaded.

Parsing Strings

The first challenge is parsing strings, after all you need to find what it contains. The first thing that many developers would suggest is to use regex (Regular Expressions). Using regex is not an easy start for most, after all there are complex characters to know and then handle the captures accordingly. So many shy away from this.

Alternatives

There are several ways to achieve something, while regex might be a more "I am a superior developer way' there is another simpler logical way to achieve this. This might not be as fast or efficient, but it works and a you would not be using this with real time apps. Lastly if your data is coming in from REST APIs or in JSON, then you have to process that anyways, so a couple of additional seconds where they have waited for some would not make much of a difference.

Step 1 - Split the sentence into words

This is the first step, let us get the words in the sentence. The simplest way is to split the sentence into components at a space.
let sentence = "This is a test sentence with a couple of words in it to split."
let words = sentence.componentsSeparatedByString(" ")

print("There are \(words.count) words in this sentence")

Step 2 - Check if the word is special

The next check is to see if the word is a special work, i.e. starts with a special character/pattern. The urls would mainly start with a 'http://' or a 'https://' and the twitter handle would start with a '@' and hashtags with '#'.

so parsing this is easy you can simply try,
for word in words {
   if word.hasPrefix("http://") || word.hasPrefix("https://") {
     print("This is a url \(word)")
   }
 }
You wouldn't get any results as there are no urls in the sentence, change the sentence to contain some url's, hashtags and twitter handles like so
let sentence = "This is a sentence to test if http://www.oz-apps.com is displayed as a #url and you can follow more articles on #swift at @LearnSwift or @OZApps."

and to check for hashtags check if the word has a prefix of '#' or '@' for twitter addresses.
for word in words {
   if word.hasPrefix("http://") || word.hasPrefix("https://") {
     print("This is a url '\(word)'")
   } else if word.hasPrefix("@") {
     print("This word is a twitter address '\(word)'")
   } else if word.hasPrefix("#") {
     print("This word is a twitter hashtag '\(word)'")
   }
 }


Summary

This is as easy and simple as that, you can try to delve into regex. which would be more accurate to parse and validate a string. We are not really validating if there is a webaddress following the http:// and it that is proper. We are assuming that this is valid. Here's an example of regex to parse emails, so you would get an idea and can compare.

let emailRegEx: String = "^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$"

That is the regex pattern for capturing an email address. you can use whatever works for you, if accuracy is important then you can use regex or simply use string functions to fix your words.

Some words like '@OZApps.' and 'dev@oz-apps.com.' would render as invalid but are valid in the sentence "My email address is dev@oz-apps.com. The twitter handle to use is @OZApps."


Special Filtering / Functionality

While at it, you can also get a list of unique words by simply creating a set, this will remove all duplicates from the array, you have to do nothing, literally nothing.

let uniqueWords = Set(words)
print("\(uniqueWords.count)")

These are the unique words in the sentence.

Finding words that are x characters or longer

You can also get a list of the words that are greater than a particular length using the filter function as
let charLen = 5
let longWords = uniqueWords.filter{$0.characters.count > charLen}
print("Words that are longer than \(charLen) are \(longWords.count)")

Sorting based on character lengths

The uniqueWords can then be sorted based on their lengths using the easy one-liner
 let sortedArray = uniqueWords.sort{$0.characters.count < $1.characters.count}

Getting the length of the longest word

You can also get the length of the longest word by iterating through the array, though there are a couple of alternatives to achieve the same. 1. Calculate manually by iterating through the array
 var longestWordLength = 0
 for eachItem in uniqueWords {
   if eachItem.characters.count > longestWordLength { longestWordLength = eachItem.characters.count }
 }

 print("The longest word length is \(longestWordLength)")
2. Calculate without iterating through the items (I feel this can be slower)
 let _temp_ = uniqueWords.sort { $0.characters.count > $1.characters.count}
 let longestWordLengthSort = _temp_.first!.characters.count
 print("The longest word length is \(longestWordLengthSort)")

More string functions to get Left/Right/Mid etc in subsequent articles. All feedback is welcome.

Comments

Popular Posts