这是indexloc提供的服务,不要输入任何密码
Skip to content
This repository was archived by the owner on Dec 10, 2021. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/).

### Added

- Add Wikipedia example #35
- Support cznicb and leveldb #34
- Add CHANGES.md #29
- Add error handling for server startup #28.
Expand Down
43 changes: 43 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -578,3 +578,46 @@ You can execute the command in docker container as follows:
```bash
$ docker exec -it blast-index1 blast-index node --grpc-addr=:5050
```


## Wikipedia example

This section explain how to index Wikipedia dump to Blast.


### Download wikipedia dump

```bash
$ curl -o ~/tmp/enwiki-20190101-pages-articles.xml.bz2 https://dumps.wikimedia.org/enwiki/20190101/enwiki-20190101-pages-articles.xml.bz2
```


### Install wikiextractor

```bash
$ cd ${HOME}
$ git clone git@github.com:attardi/wikiextractor.git
```


### Parsing wikipedia dump

```bash
$ cd wikiextractor
$ ./WikiExtractor.py -o ~/tmp/enwiki --json ~/tmp/enwiki-20190101-pages-articles.xml.bz2
```


### Indexing wikipedia dump

```bash
$ for FILE in $(find ~/tmp/enwiki -type f -name '*' | sort)
do
cat ${FILE} | while read -r LINE; do
TIMESTAMP=$(date -u "+%Y-%m-%dT%H:%M:%SZ")
ID=$(echo ${LINE} | jq -r .id)
FIELDS=$(echo ${LINE} | jq -c -r '{url: .url, title_en: .title, text_en: .text, timestamp: "'${TIMESTAMP}'"}')
curl -X PUT "http://127.0.0.1:8080/documents/${ID}" -d "${FIELDS}"
done
done
```