Regular expression (regex) clean up text

Hi,

Server - MEDIAWIKI - MYSQL - CENTOS 5 - PHP5
I have a database import of close to a million pages into my wiki, mediawiki site,

the format that were left with is not pretty, and I need to find a way to clean this up and present it nicely...

I think regex is the best option as I can do a search and replace on text ony via a mediawiki extension, so I would need to know simple regex to accomplish this.

Here is a sample of the text and as such the problem.

{{USA Case Law |Court=1st Circuit |Docket No.=94-1950 |Case name=Clarke v. Kentucky Fried |Original Document=http://www.ca1.uscourts.gov/cgi-bin/getopn.pl?OPINION=94-1950.01A }}





July 5, 1995UNITED STATES COURT OF APPEALS
FOR THE FIRST CIRCUIT


No. 94-1950
KARIN CLARKE,

Plaintiff, Appellant,

v.

 KENTUCKY FRIED CHICKEN OF CALIFORNIA, INC.,

 Defendant, Appellee.





 ERRATA SHEET




 Theopinion ofthisCourt issuedonJune 14,1995,is
amended as follows:


 Cover sheet, underlisting ofcounsel, add: NanMyerson
Evans, Bon Tempo & Evans and David A. Robinson on brief of amicus
curiae National Employment Lawyers Association.































[Appendix not attached.Please contact Clerk's Office
for opinion with appendix.]
UNITED STATES COURT OF APPEALS
FOR THE FIRST CIRCUIT

No. 94-1950

KARIN CLARKE,

Plaintiff, Appellant,

v.

 KENTUCKY FRIED CHICKEN OF CALIFORNIA, INC.,

 Defendant, Appellee.



 APPEAL FROM THE UNITED STATES DISTRICT COURT

FOR THE DISTRICT OF MASSACHUSETTS

 [Hon. Edward F. Harrington, U.S. District Judge]


Selya, Circuit Judge,

 Campbell, Senior Circuit Judge,

 and Cyr, Circuit Judge.



 Kevin G. Powers, with whom Robert S. Mantell and LawOffice of Kevin G. Powers were on brief for appellant. Jeffrey G.Huvelle, withwhom MelissaCole,Covington&Burling,TerryPhilipSegal,BrendaR.Sharton and Segal & Feinberg were on brief for appellee.
 Nan Myerson Evans,Bon Tempo &Evans and DavidA.
Robinson on briefof amicuscuriae NationalEmploymentLawyers Association.

June 14, 1995














CYR,Circuit Judge. PlaintiffKarin Clarkeappeals CYR,Circuit Judge.
from adistrict court judgment dismissingher sexual harassment

claimagainst herformeremployer, KentuckyFried Chickenof

California,Inc. ("KFC"), forfailure to exhaust administrative

remedies,and dismissingher relatedstate-law tortclaims on

preemption grounds.We affirm the judgment.


I I

BACKGROUND BACKGROUND
While employed by defendantKFC at a fast-food restau-

rantinSaugus,Massachusetts, Clarkewassexually harassed,

physically assaulted,and subjectedto attempted rapeby other

KFC employees. Clarke quither job andinitiated thepresent

lawsuit in Massachusetts Superior Court,alleging sexual harass-

ment,negligent and recklessinfliction ofemotional distress,

and negligent hiring, retention and supervision.

After removing the caseto federal district court, see
28 U.S.C.1441, 1446; see also id. 1332 (diversity jurisdic-

tion), KFC filed a motion to dismiss all claims, see Fed. R. Civ.

P. 12(b)(6),contending thatthe sexual harassmentclaim under

Mass.Gen. L.Ann. ch.214,1C, wasbarred forfailure to

exhaustmandatory administrativeremedies beforethe Massachu-

setts Commission Against Discrimination ("MCAD"), see Mass.Gen.
L.ch. 151B,5 (prescribingsix-month limitationperiod for

MCAD claims),9 (making section 5procedure "exclusive"), and

that Clarke'scommonlawtortclaims werepreemptedbythe

Massachusetts Workers'Compensation Act,see Mass. Gen.L. ch.

2










152, 1 et seq. (Supp. 1994).The motion to dismiss was granted
in its entirety.Clarke v. Kentucky Fried Chicken of California,
Inc., No. 94-11101-EFH (D. Mass. Aug. 17, 1994).1

II II

DISCUSSION DISCUSSION
A. Sexual Harassment A. Sexual Harassment

Clarkefirst contends thatthe districtcourt should

nothavedismissed hersexualharassmentclaim, becausethe

"jurisdictional" clauseinMass. Gen.L.Ann. ch.214,1C

(1986) ("Thesuperior court shall have jurisdiction in equity to

enforcethisrightand toawarddamages.")evinces aclear

legislative intent to except such claims from compliance with the

otherwise mandatory MCAD exhaustion requirementimposed on other

employment-based discrimination claimsunder Massachusettslaw.

In order to place her contention in context, we examine pertinent

case law and statutes, see infra APPENDIX at pp. (i)-(iii). (1st Cir. 1990).

7

---------- Post updated 23-02-12 at 03:57 AM ---------- Previous update was 22-02-12 at 11:50 PM ----------

I hate to bump this, but I really could use some help here.
I need a regex search and replace to fix the format to just look normal...

thanks guys!

Define "Normal". Remove extraneous blank lines?
Paginate? This is your view at 10000 feet - we have to go a lot lower or we'll trash something you do not want trashed.

According to what I just read, wikimedia pages are xhtml, and the editor works just like editing a page in wikipedia. The formatting information simply refers html and xhtml formatting tags, etc.

Where is there documentation on using a regex to mass edit documents?
Either I don't get it or you are barking up the wrong tree.

A priori, I would get the datastream you used to import, clean it up, remove the junk and re-import. But that seems not feasible for some reason.

Since you want an answer:

 <br /> 

is the html tag for a line feed + carriage return (a new line in text in Windows). You apparently have those embedded everywhere.

Explain to me what regex you think you need (meaning what it looks for) and how the documentation says to use that regex, and we can help.