R String Algorithm – How to Determine an Overlapping Sequence of Words Between Two Texts

algorithmlcsperformancerstring

In one of our digital assignments, I had asked my students to read an article and write a few things they learned from that article. Students were told that they were supposed to write using their own words. I also had reasons to expect that copying and pasting a block of text or all of it was disabled. But I was so wrong. I received over 9000 entries of texts where many of them looked like they were copied and pasted directly from the digital assignments. Some had some differences in punctuations and capitalizations but I cannot imagine that they literally sat there and typed most of the article out.

I have read through many of the students' assignments and tried to identify unique features from a copied and pasted entry versus an honest one so that hopefully some R function would help me to detect. However, I have not been successful. To demonstrate, here is an example that I made up. The passages are often long, between 300-800 words and I wonder if there's a relatively easy way to identify the common block of words that overlap between the two texts.

text_1 <- "She grew up in the United States. Her father was..."
text_2 <- "I learned that she grew up in the united states.Her father was ..."

Desired Outcome: "she grew up in the united states. Her father was …"

The desired outcome should print the sequence of words that overlapped between the two vectors, and capitalization or space differences do not matter

Thank you for reading and for any expertise you can share.

Best Answer

Using the data from @Bastián Olea Herrera:

library(tm)
library(slam)

text <- list("she grew up in the united states.Her father was",
             "She grew up in the United States. Her father was",
             "I learned that she grew up in the united states.Her father was",
             "The main character was born in the USA, his father being",
             "My favourite animals are raccoons, they are so silly and cute",
             "I didn't understand this assignment so I'm just answering gibberish",
             "she grew up in the united states.Her father was"
)

tdm <- VectorSource(sapply(text, \(x) gsub(".", " ", x, fixed = T), USE.NAMES = F)) |>
  SimpleCorpus() |>
  TermDocumentMatrix(
    control = list(tolower = TRUE,
                   removePunctuation = TRUE,
                   stopwords = TRUE))

cs <-  crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))
cs
#     Docs
# Docs         1         2         3         4 5 6         7
#    1 1.0000000 1.0000000 0.8944272 0.2236068 0 0 1.0000000
#    2 1.0000000 1.0000000 0.8944272 0.2236068 0 0 1.0000000
#    3 0.8944272 0.8944272 1.0000000 0.2000000 0 0 0.8944272
#    4 0.2236068 0.2236068 0.2000000 1.0000000 0 0 0.2236068
#    5 0.0000000 0.0000000 0.0000000 0.0000000 1 0 0.0000000
#    6 0.0000000 0.0000000 0.0000000 0.0000000 0 1 0.0000000
#    7 1.0000000 1.0000000 0.8944272 0.2236068 0 0 1.0000000

This is just an example that you could build on -- tm has a lot more functionality. The idea here is you can build a term document matrix and use that to compute a similarity score between documents. The similarity score computed here is cosine similarity but there are many others.

If you read the documentation for ?TermDocumentMatrix you can see you can do things like inverse weighting procedures, which gives more weight to uncommon words, for example.

The first column of the output compares the first text to all the text, the second column compares the second text to all the text and so forth. The diagonal is always one because it is comparing the text to itself. As you can see from the first column, the second, third and seventh text are all pretty similar to the first.


Alternatively, you can extract the longest common substring like so (using the text list from above). This compares the first element (your base text/digital assignment) to the remaining text (this should be student input):

library(PTXQC)
library(textclean)

standardize <- function(x) {
  x |>
    tolower() |>
    replace_contraction() |> 
    gsub("[[:punct:]]", " ", x = _, perl = T) |>
    replace_white()
}

std_text <- standardize(text)

lapply(std_text[-1], \(x) LCSn(c(std_text[[1]], x)))
# [[1]]
# [1] "she grew up in the united states her father was"
# 
# [[2]]
# [1] "she grew up in the united states her father was"
# 
# [[3]]
# [1] " in the u"
# 
# [[4]]
# [1] " the"
# 
# [[5]]
# [1] " un"
# 
# [[6]]
# [1] "she grew up in the united states her father was"

First, a little text cleaning is done to standardize the text. I add space around punctuation to address that issue in your text_2. This may introduce excess white space, but that is resolved with replace_white().

LCSn() has a min_LCS_length you can specify to ignore minimally overlapping text.

Note: PTXQC and textclean have a fair number of dependencies.