Steps for downloading the full dataset from CiteSeerX:
Update:
Gregor has kindly provided CiteSeerX fetcher that works much better than the old approach. Thanks :)
P.S. I am in the process of setting up a new web server at the lab; so I will try make a whole dump available one of these days. If you need it sooner email me.
Old
Great thanks to this blog post
I have added slightly more details, and made some minor changes (that seemed to work better in my case)
- Download and extract the "Demo" from http://purl.oclc.org/NET/OPENSRC/downloads/oaiharvester/jars/oaiharvesterdemo.tar
- download xerces and place xerces.jar it in the same directory as Demo
http://www.apache.org/dist/xerces/j/Xerces-J-bin.2.9.1.zip
- Go to the Demo directory, type the following command (all in one line) to download the full dataset of CiteSeerX to the file "citeseerx_alldata.xml"
java -classpath .:oaiharvester.jar:xerces.jar org.acme.oai.OAIReaderRawDump http://citeseerx.ist.psu.edu/oai2 -o citeseerx_alldata.xml
Note most likely you will not see anything right away; so you may check the file citeseerx_alldata.xml to make sure that things are being added to it. The size of the data is about 520 MB, so it may take a while.
For more information you may also want to look at the original website for the oai harverster here
keywords: citeseer citeseerx data set dataset dump download database db citation analysis link analysis