这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@return42
Copy link
Member

To reduce the memory footprint, this patch no longer loads the JSON data completely into memory. Instead, there is an SQL database based on ExpireCacheSQLite.

The class CurrenciesDB is a simple DB application that encapsulates the DB (queries and initialization) and provides convenient methods like name_to_iso4217 and iso4217_to_name.

Related:

@return42 return42 requested a review from dalf May 22, 2025 17:49
@dalf
Copy link
Contributor

dalf commented May 22, 2025

If I understand correctly the workflow is:

  • keep the .json on the disk
  • when searxng starts:
    • create CurrenciesDB() without any data
  • when the first currency info is requested:
    • load the data from the .json to save them as SQLite temporary file, for all currencies --> the data are seen in memory and then garbage collected.
    • read one entry from the temporary SQLite file
  • when further currency infos are requested:
    • read one entry from the temporary SQLite file

Why not store directly the SQLite files in the git repository, and then read directly these files:

  • no issue with temporary files
  • the DB files can be opened in read only mode
    ?

To be able to read the automatic updates, the update scripts can write a .csv files in addition to the SQLite file.

See #3458 for example implementation. This PR 3458 missed a check in production to make sure the memory footprint does not grow over time, then it went into limbo.

@return42
Copy link
Member Author

return42 commented May 23, 2025

If I understand correctly the workflow is:

👍 yes, this is the "current workflow".

Why not store directly the SQLite files in the git repository,

Thinkable .. why not ..

  • drawback: SQLite's .db file is a binary --> we no longer have a diff which I often read before merge the PR.

On the long term I had a different solution in mind ..

🤔 .. the SQLite's .db file is from a DB application which was intended to be a key/value cache with expire date .. the name of the DB app is "ExpireCache" --> from SearXNG's POV it is cache(!)

  • separate all data (the json files) from SearXNG core (repo searxng/searxng) into its own repository searxng/data .. include the CI (data-update.yml) that updates theses JSON files

  • In SearXNG core, instead of read from a file we have a read from URL e.g.:
    https://raw.githubusercontent.com/searxng/data/refs/heads/master/<schema-version>/currencies.json

We need to versioning the DB schema (<schema-version>) and the workflow will be slightly different to the "current workflow": the tables in the DB will be maintained when the expire date has been reached (ExpireCache).


OT but to give an example: I implemented similar for the SVG weather icons in PR #4663 (comment) ..

@dalf
Copy link
Contributor

dalf commented May 23, 2025

drawback: SQLite's .db file is a binary --> we no longer have a diff which I often read before merge the #4740.

Look at #3458 which creates .csv files in addition to the .db files.
For example currencies.csv.

separate all data (the json files) from SearXNG core (repo searxng/searxng) into its own repository searxng/data

What are the benefits of a different repository?
What are the problem(s) with the current implementation which embed the data?

The data size is 8.6M, pretty much nothing if we look at the any web page.
No need for database versioning: code and data are always synchronized.
Also no error trying to load external data: there are already here locally.

from SearXNG's POV it is cache(!)

Is there an urge to merge this PR?

@return42
Copy link
Member Author

Look at #3458 which creates .csv files in addition to the .db files.

Above you asked ..

Why not store directly the SQLite file

When we replace .json by .cvs .. what have we won? / except a csv diff is hard to read, compared to json file with sorted keys ..

What are the benefits of a different repository?

The JSON files are dumps of databases; in my opinion, maintaining these databases shouldn't be part of the code.

For example, take a look at the https://github.com/ClearURLs project, whose rules are also made available to the application via a database (a JSON dump https://github.com/ClearURLs/Rules ).

By separating them, the database could be maintained independently of the code.

But we don't have to separate it, if you prefer to maintain the DB (the JSON file) here in the code, then we'll leave it here (for now / we can decide it later).

No need for database versioning: code and data are always synchronized.

As long we hold the DB here .. but when we separate DB from code, we need a versioning of the DB schema.

@dalf
Copy link
Contributor

dalf commented May 23, 2025

except a csv diff is hard to read

Because this is simplest way to make a human readable SQL dump : mapping from one SQL row to one CSV line using SELECT * FROM ... ORDER BY ... so things don't jump around at each updates.

But why not .json, however without foreign keys, everything is flatten to rows.

For example, take a look at the https://github.com/ClearURLs project

ClearURL is a browser extension where the users can pick different rules from different URLs.
There is no point in SearXNG to allow different URLs, as far I know?

In addition, all SearXNG instances are going to ping one more URL when they starts.

... maintaining these databases shouldn't be part of the code.
By separating them, the database could be maintained independently of the code.

I read again the messages, and feel sorry to insist: why we need to do that?

Currently if we change the format of a data file, it's simple:

  • update the searxng_extra/update script
  • run the script to update searx/data
  • update the code in searx/* and tests/*
  • one commit
  • done

Everyone is going to pick the updates [1], no connectivity issue, no GH token, no CloudFlare protection, it's a simple life.
[1] one git pull or container pull, and everything is here, up to date.

Most probably I miss your point.

@return42
Copy link
Member Author

I probably don't understand your point.

... we can stick with the current procedure, which is the simplest solution for us developers, as you already said.

As I already said, we can consider separation later if we ever have compelling reasons to do so. There's no need to decide that now.

Because this is the simplest way to make a human-readable SQL dump:

A dump, yes, but the diff is hard to read, and a diff is what we have to read with every PR review.

Anyway, I'd like to stick with JSON files for now... we can decide later whether it might be better to switch to a CVS file for one or another data source.

however without foreign keys, everything is flattened to rows.

For most cases, that's sufficient, and the ExpireCache we currently have in our code is sufficient for that. Regarding external bangs, I need to reconsider whether we need a different DB schema than a simple key/value cache (a DB table with more than just one primary key)... but I can't judge that at this point... perhaps this data is unsuitable for a relational DB... we'll see.

@dalf
Copy link
Contributor

dalf commented May 23, 2025

perhaps this data is unsuitable for a relational DB... we'll see.

See #4093 (comment)

Can we close this PR?

@return42
Copy link
Member Author

return42 commented May 24, 2025

perhaps this data is unsuitable for a relational DB... we'll see.

See #4093 (comment)

Can we close this PR?

No! ..why? and how is it related to a bug in the pipy engine .. I'm confused.

More PRs will follow for the other data objects. This is just one to demonstrate.

To reduce the memory footprint, this patch no longer loads the JSON data
completely into memory.  Instead, there is an SQL database based on
`ExpireCacheSQLite`.

The class CurrenciesDB is a simple DB application that encapsulates the
DB (queries and initialization) and provides convenient methods like
`name_to_iso4217` and `iso4217_to_name`.

Related:

- searxng#1892
- searxng#3458 (comment)
- searxng#4650

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
@return42
Copy link
Member Author

More PRs will follow for the other data objects. This is just one to demonstrate.

I need this patch for further development, since no technical defects could be found I will now merge this PR.

@return42 return42 merged commit 848c8d0 into searxng:master May 25, 2025
7 checks passed
@return42 return42 deleted the data-to-sql branch May 25, 2025 08:40
return42 added a commit to return42/searxng that referenced this pull request May 29, 2025
return42 added a commit that referenced this pull request May 29, 2025
Bnyro pushed a commit to Bnyro/searxng that referenced this pull request Jun 25, 2025
To reduce the memory footprint, this patch no longer loads the JSON data
completely into memory.  Instead, there is an SQL database based on
`ExpireCacheSQLite`.

The class CurrenciesDB is a simple DB application that encapsulates the
DB (queries and initialization) and provides convenient methods like
`name_to_iso4217` and `iso4217_to_name`.

Related:

- searxng#1892
- searxng#3458 (comment)
- searxng#4650

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Bnyro pushed a commit to Bnyro/searxng that referenced this pull request Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants