Associating user accounts with publications
For the purposes of reporting to NSF and producing a public bibliography of the various facilities that have used our testbeds, it would be helpful to be able to cross-reference our userbase with publication databases. Options for publication databases include:
At this point, I am leaning towards DLBP: it's a big (very big) database we can download and process ourselves as we see fit, as opposed to the other two that are web services that we'd have to interact with via searches. My understanding is that MS Academic search has an API, which Google Scholar does not, but Google scholar will search through the contents of papers, which MS Academic Search will not, meaning that Google Scholar is much better for searching for papers that mention our name or URLs. DBLP also has the advantage of being focused on CS, and seems to contain more structured data about papers and people that we can use to generate bibliographies from.
What I'm imagining the workflow looks like (to be repeated periodically):
- Download the latest DLBP database
- Match users from our own DB to people in the DLBP database. This is necessarily going to be a bit fuzzy, likely based on email address, name, institution. Likely, record this matching in our own DB so that if we ever have to manually do any matching that gets saved. Also, we should probably make this a one-to-many mapping, as these kinds of databases do tend to inadvertently split people's records (eg. maybe they have separate records for Kobus and Jacobus)
- Find all papers in the DB by those people: in the initial run, looking for papers since the testbed was "open", in subsequent runs, we just look at once since we last checked
- Optional: automatically download the paper, if available, run text extraction, and look for keywords like the name of the testbed, names of hardware types, etc.
- Check each paper to see if it used the facility. We can do this ourselves (possibly assisted by automatic search for keywords) and/or ask users themselves. In my experience, doing this manually goes pretty fast, assuming you have a link to the paper: you open it up, Ctrf-f for the name of the facility, and it's usually immediately obvious whether they are describing it in the context of evaluation. To ask users, we'd provide them with a list of the papers we know about, and ask them yes/no did you use our testbed to evaluate this, and maybe, if they are in more than one project, which project is this related to. We'd need to do something to avoid bothering too many people in duplicate about the same papers.
- Record all of this, and create lists views for the whole testbed, per user, per project, maybe per publication venue, institutions, etc. If the database we use somehow marks research areas, that would be an interesting criteria too.
One possible place to look for code to parse DLBP data could be CSrankings, which also has some additional metadata that could be useful, like mappings of faculty to institutions. @johnsond might also have some relevant code.