Google Summer of Code Ideas

From DSpace Wiki

(Redirected from Summer of Code Ideas)
Jump to: navigation, search

Contents

[edit] DSpace and Google Summer of Code

[edit] Student applications are accepted Mon, March 23, 12pm – Fri, April 3, 12pm.

Please spread awareness of this program and DSpace among your best students! Point them to:

[edit] Potential Google Summer of Code projects

Feel free to review previous ideas in the Summer of Code Ideas 2008 page, if anything seems appropriate, please feel free to bring it forward.

[edit] DSpace 2.0 Initiative

After some discussion within the DSpace 2.0 group, we've concluded that using the GSoC as a source for moving forward the DSpace 2.0 initiative is an excellent idea. If you have an idea for DSpace 2.0 development, please feel free to add it here. We will be organizing a list of ideas within the DSpace 2.0 Architecture group and presenting that list here as well. If you would like to discuss DSpace 2.0 projects explicitly, please feel free to join the dspace-architecture list as well and drop us a note there.

https://lists.sourceforge.net/lists/listinfo/dspace-architecture

  • General note on code from the GSoC developers
    • Code should be maintainble and adhere to the code style of the DS2 project
    • Code must not use any packages with licenses which are incompatible with free open source (commercial, GPL, LGPL)

[edit] DSpace REST webapp

Implement REST webapplication to support easy interaction with DSpace from other languages and remotely.

  • The task would be to define a set of REST APIs (URLs, Responses, Docs, etc.) and validate them with the mentor
  • Once validated the task would then be to implement the REST APIs using whatever technology the developer is comfortable with (this should be discussed with the mentor though)
    • Documentation must be available at the REST URLs such that a developer can easily view the docs by simply accessing root URLs
    • Documentation must be i18n compliant and easy to create translations for
    • The code must not take advantage of "insider knowledge" to get data from DSpace and must only use the available APIs
  • Aaron separated out the SAKAI REST implementation (EntityBus) into a separate project usable in both DSpace and SAKAI. We intend to be using this. Learn all about it here: http://code.google.com/p/entitybus/

[edit] Authentication/Authorization providers

Port LDAPAuthenticator, X509Authenticator, create new Authenticators for open id/open auth etc.

  • The task would be to create DS2 providers for authentication and principal extractors:
    • LDAP
    • OpenId
    • X509
    • Basic Auth
    • Kerberos
    • Active Directory
    • Shibboleth
    • CAS
  • The task would also be to create DS2 providers for authorization:
    • LDAP
    • OpenAuth
    • IP Based
    • Kerberos

[edit] Storage Service implementations

DSpace 2 has a generalized storage service API which allows a DS2 reposoitory to use many possible systems to store repository data (DBMS, JCR, etc.) and even other repositories (Fedora, Eprints, etc.)

  • The task would be to implement additional Storage Service providers for DS2 (some examples listed)
    • JCR (Alfresco/Xythos/jackrabbit??) REST
    • JCR (Alfresco/Xythos/jackrabbit??) RMI
    • Fedora REST
    • Filesystem + Sesame (or other TripleStore)
    • Filesystem + DBMS
    • IRoDS/SRB
    • s3/SimpleDB
    • Other?

[edit] Distributed repository searching system

Cloud style search system which allows repositories to register and indicate they are searchable and what search interfaces and parameters they support from a standard set

  • The task would be to develop an application (probably python based in appengine since it is free but any free scalable alternative is good) which repositories could register themselves with to indicate they want to participate in wider/global search
    • Defining the standard set of parameters and search options would be the first task (DC? Author/title/publish date?)
    • Defining registration information would be the second task (geolocation, name, associations, type of repo, etc.)
    • Defining the interface to register and the interface to query for searchable repositories
    • Defining the method for getting back search results (realtime, email, etc.)
    • Deciding how much caching of results and information the central system should do (should the repo be allowed to specify this)

This would be mostly be a proof of concept which could be used to build a more reobust and production oriented method of inter-repository communication later on.

[edit] Core, Logging, Event Management

[edit] Port Statistics framework off log4j logs and convert to use UsageEvent Plugin

Its a poor design choice to have statistics aggregation dependent on log4j debugging logs, too many other third party API depend on log4j and now we have to mediate that risk. Recommend a project to use Mark Woods contribution of UsageEvents to port the Statistics reporting stuff off the log4j logs to a seperate logging service.

[edit] Administration and Acces Control

[edit] User Management - Enhancement

  • User self management
    • delete ones account
    • reminder of account
    • reminder of unfinished tasks and submissions
  • User contact (for alerts, feedback, ads)
    • all active users
    • all registered users
    • users of a group
    • users with special rights (i.e. all submitters)
  • manage non-active users

based on set of rules

    • reminders sent
    • deletion
  • check for invalid accounts
  • validate unsendable emails (registration, alert and so on)

[edit] Matadata Management

[edit] Extend Item templates to support creating Collection default ResourcePolicies for Items, Bundles and/or Bitstreams

Item templates currently support metadata fields one may want in an Item by default, but do not support anything else like default bitstreams or default permissions on bitstreams, It would be good to extend Item template to have more features.

Proposed by --Mark Diggory 14:58, 24 March 2009 (GMT)

[edit] Stackable Support of Naming Authority Tools

At the moment metadata is not controlled (apart from controlled vocabularies) by any authority tool. There are a lot of national (mostly based on national libraries) tools and some international projects like VNAF in this area. Enabling metadata field based stackable integration of tools for ingest would increase the qualitiy of metadata and facilitate a lot of other tasks, like reporting and evaluation etc.

Proposed by -- Claudia Jürgen 31 March 2009


[edit] Typed Metadata Fields

DSpace is organized a little bit the same way than "old fashion" card catalogue: a database maintains a strict and coherent storage of data and indexes are dynamically built with Lucene to rapidly find something in the database.

The indexing process has a big added value as it insures that the users, from what they know, find what they need. This process must be tailored to the exact needs of each application.

We would like to propose to make DSpace evolve toward a concept of highly parameterized "pluggable indexation" where multilingual, multialphabet issues could be solved by easily accepting contributions and where new datatypes (duration, Longitude/Latitude, etc.) coud also be perfectly supported.

This "openness" of indexation will also naturally bring the question of "openness" of editing of a given value type (or language) and display.

The proposal is therefore:

  1. to define an architecture for "data types management"
  2. to allow the definition of Java programmed Data Types Managers to handle a given type: the managers would provide validation, clean-up, tokenization, toString, toXML and toHTML methods on data values of a given type
  3. to allow the parameterization of DSpace to associate a data type to each DC (or other) fields for all languages or for a specific language
  4. in DSpace, to call the data type managers everywhere a value must be validated, indexed, displayed.

[edit] SolR

SolR could be a source of inspiration / of reuse:

[edit] JCR

A possibly efficient and "future proof" way would be to integrate a JCR compliant repository for storage, indexing and retrieval of metadata and bitstreams?

User:Christophe.Dupriez

[edit] i18n of DSpace Objects

At the moment only static parts of the DSpace UI are i18n. These are kept in message catalogues.

Variable metainformation is not presented language dependant.

The metainformation of an item beeing stored as metadata is theoretically presentable language dependant, wheras the metadata of other DSpace objects (communities, collections, epersons, groups) is not stored as metadata in DSpace terms and it is not possible to have it i18n.

There should be a distinction between properties (e.g. for config purpose) and metainformation of DSpace objects. The metainformation should be expressed as DSpace metadata objects, even for metametadata.

[edit] Management of input-forms via db

Move the input-forms from file based storage to the database and make them manageable via the UI. This should include the management of templates and sanity checks and be flexible enough to enable the below mentioned "Item type based input".

[edit] Item type based metadata collection

At present metadata collection is based on the input-forms.xml file. This file can be edited to collect different metadata for different collections. An alternative is proposed that would allow the administator to collect different metadata based on item type. There are no doubt many ways this could be achieved, for example:

- An alternative input-forms.xml file based on item type.

or..

- An admin page that would allow the administrator to select metadata terms for each item type. Sensibly this information would be held in the database rather than a file. If I understand Christophe's proposal above then this option would dovetail nicely in that the definition of data type would no longer be done in input-forms.xml.

Robin Taylor. University of Edinburgh.

[edit] Metadata Registry Service

Stackable/Plugable service that can be used to query disparate CV, Ontological, Naming registries, via a shared query syntax (XQuery? Sparql?) returning XML/RDF. Our first goal being the ability to plug these services into Fields in the Customizable Submission workflow pages to populate suggested values for fields. (Sources: LDAP, JDBC, DNS, XCat/OCLC/Barton, Google Scholar etc, GFR, other Metadata Registries).


Stretch it even further... DSpace itself is a managed registry of Metadata. A properly architected infrastructure that would allow the meshing of DSpace and other registry metadata in a "Semantic Web, Linked Open Data" way would establish a foundation for exposing DSpace instances as LOD, providing for SPARQL endponts on both ends of the design means that DSpace instances could become both producers and consumers of LOD. --Mark Diggory 12:28, 5 April 2008 (EDT)

[edit] Add support to upload new metadata schemas as a file into the DSpace Metadata Registry.

It requires a developer to add whole new metadata schemas to the dspace metadata registry, recommend creating an upload form for the registry that creates new namespaces/fields based on at least the existing metadata schema configuration file format.

Proposed by --Mark Diggory 16:38, 24 March 2009 (GMT)

[edit] Community Tools and Community Trends

[edit] Adaptive Question Answering System based on the DSpace Mailing List Knowledge Base

Recent years have witnessed the tremendous usage of repository software since majority of scholarly content are published in digital form with no exception to the proliferation of DSpace instances. Popular software does manage user queries through mailing list supported by dedicated committers and contributors. A good number of questions asked in a mailing list would have been responded previously. In this case, a Question Answering (QA) system would help users by answering their questions, if it has been responded earlier or would suggest related answers encompassing the subject asked. For this, information available on the DSpace mailing list knowledge base can be extracted using template based extraction techniques or a rule based system. Once extracted, this can be classified according to a taxonomical structure (i.e. Functional Overview, Installation, Upgrading, Configuration, Customization, Architecture, and Versions) that represents the DSpace system architecture/documentation. The keywords automatically generated from the message text improve the adaptive retrieval of relevant information in this QA system. A test-bed for this QA system can be build using the DSpace platform and the taxonomical structure can be facilitated by the in-built controlled vocabulary feature.

Proposed by: Jayan C Kurian, Research staff, National University of Singapore, Singapore. email: Jayan@comp.nus.edu.sg, jayanntu@gmail.com

Potential Student: Ashly Markose, Post-graduate student, National University of Singapore, Singapore email : ashly@comp.nus.edu.sg, ashlymarkose@gmail.com

[edit] Research Trend Analysis using Institutional Repositories

Institutional repositories gather an organization’s scholarly content and buttress knowledge sharing and dissemination of intellectual output. The communities and collections in a repository are designed according to an institution’s distribution of research centers, schools and divisions. Each collection that represents a school, division or research centre holds erudite contents facilitated by respective academic projects mentored at those centers. By performing Co-Word analysis on each document collection, the individual research strength of that division or school can be found out. The same principle can be propagated to sub-collections as well as communities in a repository. This extrapolates the research strength of a particular division/school and in general can be extended to determine the profound research strength of an institution. In addition to this, a qualified variation or trend in the research strength of individual divisions/schools can also be found out by applying Co-Word analysis over a predetermined period of time. The above described feature can be extended as an add-on to the existing framework of DSpace and would facilitate the burgeoning representation of DSpace as a research platform.

Proposed by: Jayan C Kurian, Research staff, National University of Singapore, Singapore. email: Jayan@comp.nus.edu.sg, jayanntu@gmail.com

Potential Student: Ashly Markose, Post-graduate student, National University of Singapore, Singapore email : ashly@comp.nus.edu.sg, ashlymarkose@gmail.com


[edit] Report Generation Tool for DSpace

“Report Generation” in general brings added value to any Information Management System with no exception to Institutional Repositories. Taken from an academic perspective, one of the main advantages is to generate reports based on individual authors and contribution period (e.g. Jayan, 15Feb2007-25Apr2007). In-addition, if publications (Journal e.g. JASIST (Wiley), IPM (Elsevier) or Conference e.g. JCDL, ECDL) can be segregated based on ranking it would add much value from management perspective. A summarized report in this form from various academic disciplines in Institutes of higher learning would definitely drive strong interest from the stakeholders of Institutional Repositories. Our plan is to achieve this report generation based on data extracted from the Dspace IR and present this grouped by various custom filter options such as author, contribution period, publication ranking, and summarized report for each academic discipline etc. The motivation to this proposal is based on feedback received from the academic community during a presentation that was meant for encouraging contribution into repositories. Based on our initial thoughts we plan to use "DataVision", an open source report writer tool for report generation and subsequent integration of the same as an add-on to DSpace. We believe that such a feature would definitely drive academic contribution to repositories keeping in mind its long-term benefits.

Proposed by: Jayan C Kurian, Research staff, National University of Singapore, Singapore. email: Jayan@comp.nus.edu.sg, jayanntu@gmail.com

Potential Student: Ashly Markose, Post-graduate student, National University of Singapore, Singapore email : ashly@comp.nus.edu.sg, ashlymarkose@gmail.com

[edit] Search, Browse, Discovery and Semantic Web

[edit] Port DSpace Crosswalk API to be a SAX/XSLT pipeline or STAX rather than JDOM

History repeats itself... the Cocoon folks learned long ago that Pipelines would be both more efficient and more flexible if they were SAX driven rather than DOM driven. DSpace could learn a lesson from that play-book and re-implement the Crosswalk API to Be a suite of SAX XMLReaders That take DSpace Objects and serialize them to XML. This would leverage the existing work done in the Manakin DSpace Adapter API and bring it into a new Addon or dspace-api directly making it available as a much more efficient mechanism for getting DSpace Objects into XML.

[edit] Search - Revise search to use Solr instead of plain Lucene

Solr is an Apache project that extends Lucene to provide (perhaps most notably of the list of features) faceted search and the ability to index and search specific fields.

[edit] Metadata Tracings

Enable tracings on such metadata fields as Author, Subject, etc. to launch a repository search for the metadata value from within another item or search results list (i.e. clicking 'Albert Einstein' in the author field display of one item would search the repository for all occurrences of 'dc.contributor.author=Albert Einstein' in the repository).

I think DSpace JSPUI already has this feature? In XMLUI this would be xslt development and in general Manakin/XMLUI could use some more extensive development in this area --Mark Diggory 12:22, 5 April 2008 (EDT)

[edit] LinkOut

The metadata of a document can be very useful to propose services to the user around the document. For instance:

  • an author name can be used to send an e-mail (if it is a DSpace user) or to make Google Scholar Search
  • an ISSN can be used to access publisher home page
  • the title can be used for a citation search
  • a CAS can be used to search a chemical database (or any ID of a gene, plant, etc. can serve to make a database search).
  • an institutional Id. can link to applications around this Id providing some service
  • etc.

I would propose to be able to parameterize a link from any given metadata field to a "linked services" page (services provided by external applications OR by DSpace itself, for instance documents of the same author, on the same subject, etc.). This is a generalisation of existing features proposed by different patches. Christophe.Dupriez 14:50, 17 March 2007 (EDT)

[edit] Build, Installation, Testing and Running DSpace

[edit] Complete installation process (ant fresh_install, update, init_configs, etc) as Maven goals

We almost completed the process of porting the build system to Maven, however, that process stalled with several tasks that are critical for the installation/deployment of DSpace that releid heavily on Ant, we would like to explore 1.) Running ant tasks from Maven, 2.) creating a set of Maven plugins for DSpace installation update and deployment.

[edit] Sanity Checking Framework

  • checking for prerequisites and rights
  • are all components installed in running
  • system diagnostics
  • other dependencies
    like code and content
    metadata registry, input-forms and Messages.properties
  • amount of bitstreams in db and assetstore

[edit] Build an automated testing system (unit tests, kinda)

  • Create some automated tests to detect bugs in DSpace code, particularly in the org.dspace.content package. It does not need to be particularly fast. The test rig could load some test data into DSpace, check that it's there, perform various manipulations and check that the expected results appear. Authorisation would also be an important factor to test.

Unit tests that cover a lot of this have been written for 1.6 (though they aren't yet in trunk). What I would like to see is a mechanism for the construction of 'sandbox' environments for running such tests in. For example, proper testing would require a test database, test asset store, etc, which is all quite messy to do by hand, and tricky to automate. --James Rutherford

A useful precursor to unit tests, IMHO, would be an eclipse project file setting. This would allow developers to more easily build eclipse without them having to do all the IDE setup themselves. Then you could use the builtin JUnit support that eclipse has. Plus, of course, eclipse would help do development generally.

Personal tools