You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

118 lines
3.8 KiB

  1. # This fork...
  2. I'm maintaining this fork because the original author was not replying to issues or pull requests. For now I plan on maintaining this fork as necessary.
  3. ## Status
  4. [![Build Status](https://travis-ci.org/blevesearch/go-porterstemmer.svg?branch=master)](https://travis-ci.org/blevesearch/go-porterstemmer)
  5. [![Coverage Status](https://coveralls.io/repos/blevesearch/go-porterstemmer/badge.png?branch=HEAD)](https://coveralls.io/r/blevesearch/go-porterstemmer?branch=HEAD)
  6. # Go Porter Stemmer
  7. A native Go clean room implementation of the Porter Stemming Algorithm.
  8. This algorithm is of interest to people doing Machine Learning or
  9. Natural Language Processing (NLP).
  10. This is NOT a port. This is a native Go implementation from the human-readable
  11. description of the algorithm.
  12. I've tried to make it (more) efficient by NOT internally using string's, but
  13. instead internally using []rune's and using the same (array) buffer used by
  14. the []rune slice (and sub-slices) at all steps of the algorithm.
  15. For Porter Stemmer algorithm, see:
  16. http://tartarus.org/martin/PorterStemmer/def.txt (URL #1)
  17. http://tartarus.org/martin/PorterStemmer/ (URL #2)
  18. # Departures
  19. Also, since when I initially implemented it, it failed the tests at...
  20. http://tartarus.org/martin/PorterStemmer/voc.txt (URL #3)
  21. http://tartarus.org/martin/PorterStemmer/output.txt (URL #4)
  22. ... after reading the human-readble text over and over again to try to figure out
  23. what the error I made was (and doing all sorts of things to debug it) I came to the
  24. conclusion that the some of these tests were wrong according to the human-readable
  25. description of the algorithm.
  26. This led me to wonder if maybe other people's code that was passing these tests had
  27. rules that were not in the human-readable description. Which led me to look at the source
  28. code here...
  29. http://tartarus.org/martin/PorterStemmer/c.txt (URL #5)
  30. ... When I looked there I noticed that there are some items marked as a "DEPARTURE",
  31. which differ from the original algorithm. (There are 2 of these.)
  32. I implemented these departures, and the tests at URL #3 and URL #4 all passed.
  33. ## Usage
  34. To use this Golang library, use with something like:
  35. package main
  36. import (
  37. "fmt"
  38. "github.com/reiver/go-porterstemmer"
  39. )
  40. func main() {
  41. word := "Waxes"
  42. stem := porterstemmer.StemString(word)
  43. fmt.Printf("The word [%s] has the stem [%s].\n", word, stem)
  44. }
  45. Alternatively, if you want to be a bit more efficient, use []rune slices instead, with code like:
  46. package main
  47. import (
  48. "fmt"
  49. "github.com/reiver/go-porterstemmer"
  50. )
  51. func main() {
  52. word := []rune("Waxes")
  53. stem := porterstemmer.Stem(word)
  54. fmt.Printf("The word [%s] has the stem [%s].\n", string(word), string(stem))
  55. }
  56. Although NOTE that the above code may modify original slice (named "word" in the example) as a side
  57. effect, for efficiency reasons. And that the slice named "stem" in the example above may be a
  58. sub-slice of the slice named "word".
  59. Also alternatively, if you already know that your word is already lowercase (and you don't need
  60. this library to lowercase your word for you) you can instead use code like:
  61. package main
  62. import (
  63. "fmt"
  64. "github.com/reiver/go-porterstemmer"
  65. )
  66. func main() {
  67. word := []rune("waxes")
  68. stem := porterstemmer.StemWithoutLowerCasing(word)
  69. fmt.Printf("The word [%s] has the stem [%s].\n", string(word), string(stem))
  70. }
  71. Again NOTE (like with the previous example) that the above code may modify original slice (named
  72. "word" in the example) as a side effect, for efficiency reasons. And that the slice named "stem"
  73. in the example above may be a sub-slice of the slice named "word".