-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple storage implementations #168
Comments
Case study: SQLAlchemy create_engine(). The main entity in SQLAlchemy is the engine. An engine maintains a pool of DBAPI connections, on which it executes statements of a specific dialect. Connections are created lazily. create_engine() is responsible with wiring them all up; it takes a connection string, and a wide variety of keyword arguments which are routed to the appropriate components: create_engine(
"dialect[+driver]://user:password@host/dbname[?key=value...]",
key=value,
) It roughly works like this:
I omitted the following:
Dialects/drivers are imported by a plugin loader instance as follows:
|
Ok, here goes. At the moment, we don't actually need to do much (see the "Incremental implementation" at the end). RequirementsIt should be possible to use an already instantiated Storage. It should be easy to:
Additionally, it should be possible to use the same solution for alternate search providers. The current make_reader() signature must continue to work, and correspond to the the default storage: make_reader('path/to/database')
make_reader(url='path/to/database') BasicsIt seems natural to follow a model similar to that of sqlalchemy.create_engine: a connection string URL + additional parameters as keyword arguments. As such, make_reader() will take a storage connection string, followed by additional keyword arguments: make_reader("storage://...", storage_arg='value', ...) Based on the URL scheme, a storage class should be loaded from a registry of well-known storage implementations. This class will then be instantiated with the URL and the Assuming the Storage("storage://...", arg='value') Query string and keyword argumentsBy default, the URL and keyword arguments will be passed to the Storage as-is; it is the storage's responsibility to make sense of them. This is both to simplify initial implementation, and allow integration with other similar APIs; e.g., a SQLAlchemy-based storage could delegate most argument handling to sqlalchemy.create_engine: make_reader('sqlalchemy+postgresql://user:pass@host/db', storage_option=True)
# would end up calling
sqlalchemy.create_engine('postgresql://user:pass@host/db', option=True) If needed, in the future there may be an opt-in mode in which make_reader() merges the URL components, query parameters, and keyword arguments into single keyword argument dict to be passed to the storage class, with the values converted from strings to specific types declared by the storage. This would allow loading all the options from a configuration file; e.g. this config: [reader]
url = storage://host:1000/path?timeout=10
storage_option = value would result in a storage invocation like this: Storage(host='host', port=1000, path='path', timeout=10, option='value') The precedence would be: URL components > query parameters > keyword arguments. Storage discoveryIt should be possible to register storage implementations corresponding to a specific URL scheme into a global registry, both via a Setuptools entry point and at runtime, in a way similar to how SQLAlchemy does it for dialects. The discovery should have the following precedence: runtime > entry point > shipped with reader. In order to support integration with other URL-based APIs, the scheme should be split into
Storage instancesAlternative storage instances can be passed explicitly using the make_reader(storage=UnixSocketStorage(...)) This bypasses the URL and the Storage classesIt should also be possible to pass alternative storage classes using the make_reader('whatever://1.2.3.4:5678', storage_cls=TCPStorage) This bypasses storage discovery; the URL and other storage keyword arguments will be passed to This would allow overriding the storage implementation for a specific scheme locally. In order for this to work with the configuration file, make_reader() may try to import SearchSearch instantiation works the same as the storage one, except the arguments are prefixed by
make_reader(
"storage://..." storage_arg='value', ...,
search_url="search://...", search_arg='value', ...,
# alternatively,
search=Search(...),
# alternatively,
search_cls=Search, search_arg='value', ...,
# for convenience, allow search to be interpreted as search_url
search="search://...",
) In order to support the current "parasitic" / tightly-coupled search implementation, search classes may have a Also, storage classes may have a TBD: We should have a value/URL for "no search" and one for "default search" ( Backwards compatibilityThe scheme-less path invocation will continue to be supported, and will be equivalent to the The default ExamplesDefault storage and search, implicit: make_reader('path/to/database') This is equivalent to: make_reader(
'sqlite:path/to/database',
# we should probably find a better name for this
search_url='sqlite-parasitic:',
) The slashes should probably work in the way they work for SQLite URI filenames. Assuming make_reader('unix://:authkey@/path/to/unix/socket?path=/path/to/db')
make_reader('tcp://:[email protected]:5678?path=/path/to/db') Assuming make_reader('sqlalchemy+postgresql://user:pass@host/db', ) Incremental implementationMost likely, YAGNI (most of this). (But it was a "learning experience" to write.) For private use, moving the Storage and Search instantiation from the Reader constructor to make_reader() should be enough to allow using different implementations for them. If anyone else wants to use a different storage/search, we would probably get away with exposing We can probably live without allowing any keyword arguments until they're actually needed. If they receive heavy use and have types different than strings, it makes sense to implement the type conversion thing. |
I managed to use the diff below with bench.py to see what the difference between the vanilla storage and the "sqlite3 server" storage is; results below the diff (it ranges from 1 to 5x). diff --git a/scripts/bench.py b/scripts/bench.py
index 2c574ce..fbb97ea 100644
--- a/scripts/bench.py
+++ b/scripts/bench.py
@@ -29,6 +29,35 @@ from reader import make_reader
from reader._app import create_app, get_reader
+# ---
+
+# sqliteserver is https://gist.github.com/lemon24/1f35deed9da79dc20d9c538aa0f5e0cc#file-08-manager-db-everything-py
+import sqliteserver
+import reader._storage
+
+class RemoteStorage(reader._storage.Storage):
+
+ def __init__(self, address, authkey, path, timeout=None):
+ self.address = address
+ self.authkey = authkey
+ super().__init__(path, timeout)
+
+ def connect(self, *args, **kwargs):
+ return sqliteserver.connect(self.address, self.authkey, *args, **kwargs)
+
+
+ADDRESS = '/tmp/db.sqlite.sock'
+
+def make_reader(url):
+ from reader import make_reader
+ return make_reader(None, _storage=RemoteStorage(ADDRESS, b'', url))
+
+manager = sqliteserver.make_manager_cls()(ADDRESS, b'')
+manager.start()
+
+# ---
+
+
def get_params(fn):
rv = []
for param in inspect.signature(fn).parameters.values(): $ python scripts/bench.py time get_entries_all get_entries_read get_entries_feed update_feeds '*search*' > before.txt
$ : ...
$ python scripts/bench.py time get_entries_all get_entries_read get_entries_feed update_feeds '*search*' > after.txt
$ python scripts/bench.py diff local.txt remote.txt --format times
number num_entries get_entries_all get_entries_feed get_entries_read search_entries_all search_entries_read update_feeds update_search
4 32 2.09 4.00 4.50 2.11 3.00 5.14 5.00
4 64 1.42 3.67 4.50 1.71 3.00 4.76 1.49
4 128 1.19 3.00 4.50 1.38 3.00 4.63 1.64
4 256 1.47 2.82 3.33 1.30 3.20 8.68 0.98
4 512 0.99 1.81 2.50 1.17 2.00 3.02 1.33
4 1024 2.22 2.42 4.67 1.15 1.50 2.24 1.08
4 2048 1.32 1.55 3.33 1.05 2.45 2.09 1.36 |
I guess reader now supports multiple storage implementations (at least privately), and what it doesn't support we know how to implement. Resolving. |
reader = make_reader(
r'sqlalchemy+postgresql://'
r'postgres:password@'
r'flscrapper-egslava.c9xhhmfmayou.us-east-1.rds.'
r'amazonaws.com/flscrapper_egslava_db'
) reader.exceptions.StorageError: error while opening database: sqlite3.OperationalError: unable to open database file I can connect to this DB using pgAdmin/Postico, meanwhile feedparser tells me that it can't open sqlite db, though it's not. I haven't found any documentation section about postgresql, except posts in this thread. What am I doing wrong? |
Hi @egslava, As mentioned in passing here, there's currently only one storage implementation, SQLite. That said, it is possible to write an alternative one (this issue is about what make_reader() should look like when that happens). If you are willing/planning to implement a Postgres storage (not necessarily as part of reader itself), please let me know, so I can make the minimum necessary changes to unblock you. I would update make_reader() to allow at least passing in a different storage instance, You would need to implement at least the following Storage and Search methods (click to expand):Storage.__enter__()
Storage.__exit__()
Storage.close()
Storage.add_feed()
Storage.delete_feed()
Storage.change_feed_url()
Storage.get_feed_last()
Storage.get_feeds_page()
Storage.get_feed_counts()
Storage.get_feeds_for_update()
Storage.get_entries_for_update()
Storage.set_feed_user_title()
Storage.set_feed_updates_enabled()
Storage.mark_as_stale()
Storage.mark_as_read()
Storage.mark_as_important()
Storage.get_entry_recent_sort()
Storage.set_entry_recent_sort()
Storage.update_feed()
Storage.add_or_update_entries()
Storage.add_entry()
Storage.delete_entries()
Storage.get_entry_last()
Storage.get_entries_page()
Storage.get_entry_counts()
Storage.get_tags_page()
Storage.set_tag()
Storage.delete_tag()
Storage.default_search_cls (class attribute pointing to Search class)
Search.enable()
Search.disable()
Search.is_enabled()
Search.update()
Search.search_entry_last()
Search.search_entries_page()
Search.search_entry_counts() I can also extract these into a protocol / base class, so you can:
A note regarding search: The SQLite search implementation is tightly-bound to the SQLite storage (that is, search keeps a reference to storage and uses its internals, e.g. database connections). Since Postgres also has support for full-text search AFAIK, it makes sense for it to follow the same pattern. (Having a search implementation that uses a different underlying storage than the storage is possible, but additional hooks would need to be exposed, so that search can be notified of changes at the Python level.) Do let me know. |
Thanks for the info. It took me a while to find this in docs, but, yeah, later, I found it and just hosted my 10 lines of code script to EC2, so it's fine for now, since I'm still having my free tier. I feel, like I have to say to avoid ambiguity, that I'm not planning to implement Postgresql support, since I just don't need it that much right now and current version of the 'reader' already fits my needs. Thanks for the great library! |
No worries, it looks like we're in a similar situation here. Glad to hear reader is still useful as-is, thank you for reaching out. :D |
reader is intended to support multiple storage implementations.
Making make_reader() the main constructor for Reader instances should serve as the base for this. IIRC the main inspiration for it was sqlalchemy.create_engine; it should be possible to expand make_reader() to a similar connection string + additional parameters model.
I recently wrote a "SQLite server" proof of concept using the stdlib multiprocessing; using it in a subclass of the existing storage should provide a different enough implementation to drive the design (that doesn't actually require implementing a whole new storage).
The client side of the prototype needs at least the following arguments:
These could be mapped to a connection string like scheme://:authkey@host:port or scheme://:authkey@/path, with the authkey encoded in some way (not sure how it would work for Windows pipes, though).
Additionally, in the future it may be possible to pass a SQLite database path, or maybe a full URI filename; I'm not sure how this would work with the connection string (e.g. the UNIX socket path may clash with with the DB path).
The text was updated successfully, but these errors were encountered: