收录日期:2019/02/17 00:06:57 时间:2010-06-16 02:11:27 标签:python,google-app-engine,indexing,scalability

I have these entity kinds:

  • Molecule
  • Atom
  • MoleculeAtom

Given a list(molecule_ids) whose lengths is in the hundreds, I need to get a dict of the form {molecule_id: list(atom_ids)}. Likewise, given a list(atom_ids) whose length is in the hunreds, I need to get a dict of the form {atom_id: list(molecule_ids)}.

Both of these bulk lookups need to happen really fast. Right now I'm doing something like:

atom_ids_by_molecule_id = {}

for molecule_id in molecule_ids:
    moleculeatoms = MoleculeAtom.all().filter('molecule =', db.Key.from_path('molecule', molecule_id)).fetch(1000)
    atom_ids_by_molecule_id[molecule_id] = [
        MoleculeAtom.atom.get_value_for_datastore(ma).id() for ma in moleculeatoms
    ]

Like I said, len(molecule_ids) is in the hundreds. I need to do this kind of bulk index lookup on almost every single request, and I need it to be FAST, and right now it's too slow.

Ideas:

  • Will using a Molecule.atoms ListProperty do what I need? Consider that I am storing additional data on the MoleculeAtom node, and remember it's equally important for me to do the lookup in the molecule->atom and atom->molecule directions.

  • Caching? I tried memcaching lists of atom IDs keyed by molecule ID, but I have tons of atoms and molecules, and the cache can't fit it.

  • How about denormalizing the data by creating a new entity kind whose key name is a molecule ID and whose value is a list of atom IDs? The idea is, calling db.get on 500 keys is probably faster than looping through 500 fetches with filters, right?

Your third approach (denormalizing the data) is, generally speaking, the right one. In particular, db.get by keys is indeed about as fast as the datastore gets.

Of course, you'll need to denormalize the other way around too (entity with key name atom ID, value a list of molecule IDs) and will need to update everything carefully when atoms or molecules are altered, added, or deleted -- if you need that to be transactional (multiple such modifications being potentially in play at the same time) you need to arrange ancestor relationships.. but I don't see how to do it for both molecules and atoms at the same time, so maybe that could be a problem. Maybe, if modifications are rare enough (and depending on other aspects of your application), you could serialize the modifications in queued tasks.