Skip to content

How to build kegome v2 on web? #263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jomin398 opened this issue Nov 21, 2021 · 3 comments
Closed

How to build kegome v2 on web? #263

jomin398 opened this issue Nov 21, 2021 · 3 comments

Comments

@jomin398
Copy link

I want to host kagome v2 for korean and Japanese tokenizer on git page.

I know the main file is in sample/demo.html...
Looking at the example, "dic" is in Japanese and Chinese.
I want to change the example to Japanese and Korean.
How to do it?

@ikawaha
Copy link
Owner

ikawaha commented Nov 21, 2021

The sample in the sample/wasm directory uses a dictionary for Japanese (this dictionary does not support Chinese). To analyze Korean, you need to switch this dictionary used in this sample to a Korean dictionary.

Here are the steps to follow:

1. Fix main.go to use a Korena dictionary.

The working directory is sample/wasm.

./sample/wasm/main.go

package main

import (
	"strings"
	"syscall/js"

	ko "github.com/ikawaha/kagome-dict-ko"      // ← ※
	"github.com/ikawaha/kagome/v2/tokenizer"
)

func igOK(s string, _ bool) string {
	return s
}

func tokenize(_ js.Value, args []js.Value) interface{} {
	if len(args) == 0 {
		return nil
	}
	t, err := tokenizer.New(ko.Dict(), tokenizer.OmitBosEos())  // ← ※
	if err != nil {
		return nil
	}
	var ret []interface{}
	tokens := t.Tokenize(args[0].String())
	for _, v := range tokens {
		//fmt.Printf("%s\t%+v%v\n", v.Surface, v.POS(), strings.Join(v.Features(), ","))
		ret = append(ret, map[string]interface{}{
			"word_id":       v.ID,
			"word_type":     v.Class.String(),
			"word_position": v.Start,
			"surface_form":  v.Surface,
			"pos":           strings.Join(v.POS(), ","),
			"base_form":     igOK(v.BaseForm()),
			"reading":       igOK(v.Reading()),
			"pronunciation": igOK(v.Pronunciation()),
		})
	}
	return ret
}

func registerCallbacks() {
	_ = ko.Dict()  // ← ※
	js.Global().Set("kagome_tokenize", js.FuncOf(tokenize))
}

func main() {
	c := make(chan struct{}, 0)
	registerCallbacks()
	println("Kagome Web Assembly Ready")
	<-c
}
diff
diff --git a/sample/wasm/go.mod b/sample/wasm/go.mod
index 89d4416..7fea152 100644
--- a/sample/wasm/go.mod
+++ b/sample/wasm/go.mod
@@ -1,3 +1,8 @@
 module sample

 go 1.16
+
+require (
+	github.com/ikawaha/kagome-dict-ko v1.1.0
+	github.com/ikawaha/kagome/v2 v2.7.0
+)
diff --git a/sample/wasm/main.go b/sample/wasm/main.go
index 6d42af1..1379a06 100644
--- a/sample/wasm/main.go
+++ b/sample/wasm/main.go
@@ -4,7 +4,7 @@ import (
 	"strings"
 	"syscall/js"

-	"github.com/ikawaha/kagome-dict/ipa"
+	ko "github.com/ikawaha/kagome-dict-ko"
 	"github.com/ikawaha/kagome/v2/tokenizer"
 )

@@ -16,7 +16,7 @@ func tokenize(_ js.Value, args []js.Value) interface{} {
 	if len(args) == 0 {
 		return nil
 	}
-	t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
+	t, err := tokenizer.New(ko.Dict(), tokenizer.OmitBosEos())
 	if err != nil {
 		return nil
 	}
@@ -39,7 +39,7 @@ func tokenize(_ js.Value, args []js.Value) interface{} {
 }

 func registerCallbacks() {
-	_ = ipa.Dict()
+	_ = ko.Dict()
 	js.Global().Set("kagome_tokenize", js.FuncOf(tokenize))
 }

2. Build WASM and prepare WASM libs.

Build main.go

GOOS=js GOARCH=wasm go build -trimpath -o kagome.wasm main.go

Copy wasm_exec.js to the current directory.

cp $(go env GOROOT)/misc/wasm_exec.js .

3. Serve HTTP server and access it.

Prepare a simple script to set up an HTTP server on your local host.

server.py

# -*- coding: utf-8 -*-
import http.server
from http.server import HTTPServer, BaseHTTPRequestHandler
import socketserver

PORT = 8080

Handler = http.server.SimpleHTTPRequestHandler

Handler.extensions_map={
    '.manifest': 'text/cache-manifest',
    '.html': 'text/html',
    '.png': 'image/png',
    '.jpg': 'image/jpg',
    '.svg':	'image/svg+xml',
    '.css':	'text/css',
    '.js':	'application/x-javascript',
    '.wasm': 'application/wasm',
    '': 'application/octet-stream', # Default
}

httpd = socketserver.TCPServer(("", PORT), Handler)

print("serving at port", PORT)
httpd.serve_forever()

Serve a HTTP server.

python3 server.py

Access it!

Kagome_WebAssembly_Demo_-_Japanese_morphological_analyzer

This sample only supports Korean analysis. To support both Japanese and Korean, you would need to prepare Japanese and Korean tokenizers and switch between them. However, loading both dictionaries may require too much memory to run in a web browser.

I hope this will help.

P.S.
see also

@jomin398
Copy link
Author

The build was successful. most function is working.
additionally, how to add custom userdict like japanese dic ? (i want to edit it for korean)

@ikawaha
Copy link
Owner

ikawaha commented Nov 23, 2021

A sample file of the user dictionary can be found at sample/userdict.txt. This sample is in Japanese, but the same applies to Korean.

A simple morpheme specification is in the following format:
<text>,<text>,<reading>,<part-of-speech>

for example,

냄비,냄비,냄비,NNBC

(I can't read or write Korean, so the Korean part above may not be appropriate. :p )

In your program, you can use your user dictionary as follows:

package main

import (
	"fmt"

	ko "github.com/ikawaha/kagome-dict-ko"
	"github.com/ikawaha/kagome-dict/dict"
	"github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
	// load a user dictionary.
	udic, err := dict.NewUserDict("userdict.txt")
	if err != nil {
		panic(err)
	}
	// specify the user dictionary.
	t, err := tokenizer.New(ko.Dict(), tokenizer.UserDict(udic), tokenizer.OmitBosEos())
	if err != nil {
		panic(err)
	}
	tokens := t.Tokenize("두부냄비")
	for _, token := range tokens {
		fmt.Printf("%s\t%v\n", token.Surface, token.Features())
	}
}

Output:

두부	[NNG * F 두부 * * * *]
냄비	[NNBC 냄비 냄비]            // ← this token derived from the user dictionary.

See also: https://zenn.dev/ikawaha/books/kagome-v2-japanese-tokenizer/viewer/user_dictionary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants