R – How to Trim Leading and Trailing White Space

rr-faqremoving-whitespacetrimwhitespace

I am having some trouble with leading and trailing white space in a data.frame.

For example, I look at a specific row in a data.frame based on a certain condition:

> myDummy[myDummy$country == c("Austria"),c(1,2,3:7,19)] 



[1] codeHelper     country        dummyLI    dummyLMI       dummyUMI       

[6] dummyHInonOECD dummyHIOECD    dummyOECD      

<0 rows> (or 0-length row.names)

I was wondering why I didn't get the expected output since the country Austria obviously existed in my data.frame. After looking through my code history and trying to figure out what went wrong I tried:

> myDummy[myDummy$country == c("Austria "),c(1,2,3:7,19)]
   codeHelper  country dummyLI dummyLMI dummyUMI dummyHInonOECD dummyHIOECD
18        AUT Austria        0        0        0              0           1
   dummyOECD
18         1

All I have changed in the command is an additional white space after Austria.

Further annoying problems obviously arise. For example, when I like to merge two frames based on the country column. One data.frame uses "Austria " while the other frame has "Austria". The matching doesn't work.

Is there a nice way to 'show' the white space on my screen so that I am aware of the problem?
And can I remove the leading and trailing white space in R?

So far I used to write a simple Perl script which removes the whites pace, but it would be nice if I can somehow do it inside R.

Best Answer

As of R 3.2.0 a new function was introduced for removing leading/trailing white spaces:

trimws()

See: Remove Leading/Trailing Whitespace

The base R approach: `gsub`

gsub replaces all instances of a string (fixed = TRUE) or regular expression (fixed = FALSE, the default) with another string. To remove all spaces, use:

gsub(" ", "", x, fixed = TRUE)
## [1] "xy"                            "←→"             
## [3] "\t\n\r\v\fx\t\n\r\v\fy\t\n\r\v\f" NA

As DWin noted, in this case fixed = TRUE isn't necessary but provides slightly better performance since matching a fixed string is faster than matching a regular expression.

If you want to remove all types of whitespace, use:

gsub("[[:space:]]", "", x) # note the double square brackets
## [1] "xy" "←→" "xy" NA 

gsub("\\s", "", x)         # same; note the double backslash

library(regex)
gsub(space(), "", x)       # same

"[:space:]" is an R-specific regular expression group matching all space characters. \s is a language-independent regular-expression that does the same thing.

The `stringr` approach: `str_replace_all` and `str_trim`

stringr provides more human-readable wrappers around the base R functions (though as of Dec 2014, the development version has a branch built on top of stringi, mentioned below). The equivalents of the above commands, using [str_replace_all][3], are:

library(stringr)
str_replace_all(x, fixed(" "), "")
str_replace_all(x, space(), "")

stringr also has a str_trim function which removes only leading and trailing whitespace.

str_trim(x) 
## [1] "x y"          "← →"          "x \t\n\r\v\fy" NA    
str_trim(x, "left")    
## [1] "x y "                   "← → "    
## [3] "x \t\n\r\v\fy \t\n\r\v\f" NA     
str_trim(x, "right")    
## [1] " x y"                   " ← →"    
## [3] " \t\n\r\v\fx \t\n\r\v\fy" NA

The `stringi` approach: `stri_replace_all_charclass` and `stri_trim`

stringi is built upon the platform-independent ICU library, and has an extensive set of string manipulation functions. The equivalents of the above are:

library(stringi)
stri_replace_all_fixed(x, " ", "")
stri_replace_all_charclass(x, "\\p{WHITE_SPACE}", "")

Here "\\p{WHITE_SPACE}" is an alternate syntax for the set of Unicode code points considered to be whitespace, equivalent to "[[:space:]]", "\\s" and space(). For more complex regular expression replacements, there is also stri_replace_all_regex.

stringi also has trim functions.

stri_trim(x)
stri_trim_both(x)    # same
stri_trim(x, "left")
stri_trim_left(x)    # same
stri_trim(x, "right")  
stri_trim_right(x)   # same

R – Fixing trimws Bug for Leading Whitespace

0xa0 is encoding another type of space (the non-breaking space) in R, while 0x20 is the white space.
trimws searches for white spaces or tabs or linebreaks or carriage returns (represented by [ \t\r\n]+) but not for non-breaking spaces, hence it does not work.
You can use sub (to suppress either leading or trailing spaces) or gsub (to suppress both trailing and leading spaces) to remove any kind of trailing or leading space(s) (including the one represented by 0xa0):

sub("^\\s+", "", x)
[1] "11.132592"

And for removing leading and trailing spaces:

gsub("(^\\s+)|(\\s+$)", "", x)

Best Answer

Related Solutions

R – How to Remove All Whitespace from String

The base R approach: gsub

The stringr approach: str_replace_all and str_trim

The stringi approach: stri_replace_all_charclass and stri_trim

R – Fixing trimws Bug for Leading Whitespace

Related Question

The base R approach: `gsub`

The `stringr` approach: `str_replace_all` and `str_trim`

The `stringi` approach: `stri_replace_all_charclass` and `stri_trim`