202505 pubScan: Explore your scientific co-authorship network
During end of 2024 in winter I was parsing and creating some visualizations on a small scale for visualizing the co-publication network for selected scientists. It was pretty fast and all the data was stored locally (being parsed over API from PubMed).
And then, I had an idea that it would be really cool to be able to search for any author and visualize the co-publication network. However to reach that kind of goal, ABC challenges popped up :-)
Challenge A The parsing over PubMed was not an option anymore, since my code would retrieve all publications of the center author (the auther that the user searches for), find all co-authors on those publications, and then also retrieve all the publications for these co-authors.
Not only was this too slow over the API, it was also not the way to go, one never should "attach" oneself to APIs like that if there are other options.
Solution A
PubMed provides XML files for all the publications stored in their database. The main files are called PubMed baseline, and then there are montly updates called PubMed updatefiles. So the data I could get and store locally on the server, but how to make it searchable, by PMID and by author name, and in a very fast way?
I tried several things, parsing the XML and storing PMID records into a file and author to PMIDs records into another file, and then searching with grep. This worked reasonably well up to a certain size of files however in the end was not feasible. I turned to Mysql, created two tables,
mysqk:authors and mysql:publications. After doing some indexing and debugging XML parsing (ah!), I got a database that I would search two ways: either search for an author (full name) and get all the authors PMIDs, or search for a PMID and get all the information related to a specific publication.
However, the user would start typing the author name in the browser and I also needed a name matching method, so the names that would be suggested to the user would be the ones present in the pubScan (=PubMed) database. Leaving aside the problem that authors do not have unique IDs in PubMed, still I needed a way to match names. And here, simply writting all the names in a file (one per line) and then using grep (with -E and some additional tricks) proved to be working well.
Challenge B The size of the network for some authors was just too large. I realized the vis.js component I was using was ideal for displaying 150-200 nodes and max 3-5K edges, for the network to be still clickable and responsive.
Solution B
To keep the size of the network manageable, pubScan shows a maximum of 150 co-authors (based on number of shared publications with main author). Additionally, up to 2,000 edges are included. This includes all edges between the main author and its co-authors plus the 100 strongest co-authorship links. If the edges in this case are <2000, a random sample of the remaining edges is added.
Challenge C The loading of all the data (stored locally on my server) would still take a few seconds, or even up to 10 seconds for very large networks, so I needed some way to stream the process, meaning the user would need to see progress of the download (construction) of the network.
Solution C
Here the solution was streaming, so one single call to the server API from the browser client generates the response "in steps", and each step trigers an update on the client side (visual display of progress in the browser for example).
# streaming on the client side using fetch
function get_publications(pmids) {
let encodedQuery = encodeURIComponent(pmids);
const datetime = new Date().toISOString();
fetch("https://pubscan.expressrna.org/gw/index.py?action=get_publications&pmids=" + encodedQuery + "&response_type=json&datetime=" + datetime).then(response =>
{
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
function readStream() {
return reader.read().then(({ done, value }) => {
if (done) {
// stream finished
return;
}
buffer += decoder.decode(value, { stream: true });
let parts = buffer.split("\n");
buffer = parts.pop(); // Keep the last incomplete part in buffer
for (const part of parts) {
try {
const mydata = JSON.parse(part);
...
# streaming on the server side with mod_wsgy using yield
class TableClass():
...
def get_data(...):
test = {"instruction": "progress", "description": f"found {len(pmids)} publications for {center_name}"}
yield self.return_string(json.dumps(test)+"\n")
sys.stdout.flush()