[Topaz-dev] [Topaz-pubapp-dev] Mulgara Performance Woes

Russell Uman ruman at plos.org
Tue Mar 4 09:49:47 PST 2008


> I just tried checking the sar logs, but they only go back to the 24th.
> However, it seems you upgraded on the 26th at 2:53PM? In that 
> case, the sar log from the 25th indicates about 31% avg cpu 
> usage, and on the 27th 41% avg cpu usage. If the update date 
> is correct, then this would indicate that while there is some 
> increased cpu usage in the new code, it's not huge and not 
> nearly what I thought. But if the increase is only those 10%, 
> then why are you suddenly running into these massive 
> problems? Or am have I got a completely wrong picture of 
> things and you've been having big problems before the 
> upgrade? (I know you've had problems while ingesting, but now 
> you seem to have problems all the time?)

:) i certainly haven't been monitoring mulgara's load regulary. i was
just curious what evidence you had to favor an inflection point over a
slow increase in traffic. it sounds like we don't know - but at least it
wasn't a sudden jump at the release of 0.8.2.1.

the timeline iirc is as follows:

we identified the ehcache memory leak or whatever issue sometime before
0.8.2 (maybe even before 0.8.1?) and we instituted cron restarts to
prevent it from happening, and to try and get the inevitable
restart-related crashes to happen at a known time.

then, for the entire month of february, we were ingesting 12 hours a day
and performance was terrible, presumably because of ingest.

next, since launching the CJs (on 2/29, after a 2/26 upgrade to 0.8.2.1)
we've had terrible performance.

my theory is that the extra traffic from the CJs is the cause of our
load and instability now that we're not ingesting constantly.

however, we should definitely consider other possibilities. it's
definitely possible that we've introduced or uncovered some bugs related
to caching - perhaps we're flushing caches prematurely on some
operations - i'll ask the front end devs what logging options we have to
investigate this.

the scariest possible theory is that we're somehow introducing bad data
(either corrupt or semantically invalid) into mulgara when we force it
to stop with a transaction open, and the series of restarts is making
things worse and worse. i'm working on finding a time to export and
re-import database to see if that helps at all...

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This email is confidential to the intended recipient. If you have received it in error, please notify the sender and delete it from your system. Any unauthorized use, disclosure or copying is not permitted. The views or opinions presented are solely those of the sender and do not necessarily represent those of Public Library of Science unless otherwise specifically stated. Please note that neither Public Library of Science nor any of its agents accept any responsibility for any viruses that may be contained in this e-mail or its attachments and it is your responsibility to scan the e-mail and attachments (if any).



More information about the Topaz-Dev mailing list